Structure from motion (SFM)

1. Affine structure from motion

Given a set of corresponding points in two or more images, compute the camera parameters and the 3D point coordinates
从图像中找一些点，再恢复到3D空间

1.1 Recall: Corresponding point detection

Detect SIFT features

Match features between each pair of images

何种方法可能会出现错误
Use RANSAC to estimate fundamental matrix between each pair
- 不断找8个点去估计F矩阵，对于正确的F，一定可以使得这些点满足矩阵变换关系，那么当F保留内点最多时，则F正确
- 这样估计出来的F可以用于去除一些错误的匹配
Keep only the matches at are “inliers” with respect to the best fundamental matrix
- 只保留与最佳基本矩阵相关的“内联”匹配

1.2 Structure from motion

Given: $m$ images of $n$ fixed 3D points
$\mathbf{x}_{i j}=\mathbf{P}_{i} \mathbf{X}_{j}, i=1, \ldots, m, \quad j=1, \ldots, n$
estimate $m$ projection matrices $\mathbf{P}_{i}$ motion
estimate $n$ 3D points $\mathbf{X}_{j}$ from the $m n$ correspondences $\mathbf{x}_{i j}$ structure

1.3 Orthographic Projection

$\left[\begin{array}{llll} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right]\left[\begin{array}{l} x \\ y \\ z \\ 1 \end{array}\right]=\left[\begin{array}{l} x \\ y \\ 1 \end{array}\right] \Rightarrow(x, y)$

1.4 Affine cameras

A general affine camera combines the effects of an affine transformation of the $3 D$ space, orthographic projection, and an affine transformation of the image:

$\begin{array}{l} \mathbf{P}=[3 \times 3 \text { affine }]\left[\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right][4 \times 4 \text { affine }]\\ =\left[\begin{array}{cccc} a_{11} & a_{12} & a_{13} & b_{1} \\ a_{21} & a_{22} & a_{23} & b_{2} \\ 0 & 0 & 0 & 1 \end{array}\right]=\left[\begin{array}{cc} \mathbf{A} & \mathbf{b} \\ \mathbf{0} & 1 \end{array}\right] \end{array}$

Affine projection is a linear mapping + translation in non-homogeneous coordinates (in Euclidean space))
- 转换为非齐次坐标系

Given: $m$ images of $n$ fixed 3D points: $\mathbf{x}_{i j}=\mathbf{A}_{i} \mathbf{X}_{j}+\mathbf{b}_{i}, \quad i=1, \ldots, m, j=1, \ldots, n$
Problem: use the $m n$ correspondences $x_{i j}$ to estimate $m$ projection matrices $A_{i}$ and translation vectors $b_{i}$, and $n$ points $X_{j}$
Centering: subtract the centroid of the image points in each view
- 对于每张图片做一个中心化处理
- 这种变换等价于在空间上先做中心化，在进行投影

$\begin{aligned} \hat{\mathbf{x}}_{i j} &=\mathbf{x}_{i j}-\frac{1}{n} \sum_{k=1}^{n} \mathbf{x}_{i k}=\mathbf{A}_{i} \mathbf{X}_{j}+\mathbf{b}_{i}-\frac{1}{n} \sum_{k=1}^{n}\left(\mathbf{A}_{i} \mathbf{X}_{k}+\mathbf{b}_{i}\right) \\ &=\mathbf{A}_{i}\left(\mathbf{X}_{j}-\frac{1}{n} \sum_{k=1}^{n} \mathbf{X}_{k}\right)=\mathbf{A}_{i} \hat{\mathbf{X}}_{j} \end{aligned}$ $\hat{\mathbf{x}}_{i j}=\mathbf{A}_{i}\left(\mathbf{X}_{j}-\frac{1}{n} \sum_{k=1}^{n} \mathbf{X}_{k}\right)=\mathbf{A}_{i} \hat{\mathbf{X}}_{j}$

For simplicity, set the origin of the world coordinate system to the centroid of the 3D points
- 以世界坐标系的原点为中心点，这样我们就可以简化表达式
After centering, each normalized $2 D$ point is related to the $3 D$ point $X_{j}$ by

$\hat{\mathbf{x}}_{i j}=\mathbf{A}_{i} \mathbf{X}_{j}$

Let’s create a $2 m \times n$ data (measurement) matrix:

$\mathbf{D}=\left[\begin{array}{cccc} \hat{\mathbf{x}}_{11} & \hat{\mathbf{x}}_{12} & \cdots & \hat{\mathbf{x}}_{1 n} \\ \hat{\mathbf{x}}_{21} & \hat{\mathbf{x}}_{22} & \cdots & \hat{\mathbf{x}}_{2 n} \\ & & \ddots & \\ \hat{\mathbf{x}}_{m 1} & \hat{\mathbf{x}}_{m 2} & \cdots & \hat{\mathbf{x}}_{m n} \end{array}\right]=\underset{cameras(2m ×3)}{\left[\begin{array}{c} \mathbf{A}_{1} \\ \mathbf{A}_{2} \\ \vdots \\ \mathbf{A}_{m} \end{array}\right]}\underset{points (3×n)}{\left[\begin{array}{cccc} \mathbf{X}_{1} & \mathbf{X}_{2} & \cdots & \mathbf{X}_{n} \end{array}\right]}$

$x_{ij}$是21，$A_i$是2\3,$X_i$是3*1
The measurement matrix $\mathbf{D}=\mathbf{M S} $ must have rank 3 !

1.5 Factorizing the measurement matrix

Singular value decomposition of D:

Obtaining a factorization from SVD:
- 奇异值分解后，我们选取特征值最大的三个值所对应的$U$和$V$

因为恢复存在歧义性所以恢复出来的三维结构可能会失去深度信息，即会发生变形
- $HS$为仿射变换，所以会发生变形

$D=MS=MH^{-1}HS$

The decomposition is not unique. We get the same D by using any $3 \times 3$ matrix $\mathrm{C}$ and applying the transformations $\mathrm{M} \rightarrow \mathrm{MC}, \mathrm{S} \rightarrow \mathrm{C}^{-1} \mathrm{~S}$
That is because we have only an affine transformation and we have not enforced any Euclidean constraints (like forcing the image axes to be perpendicular, for example)

一种解决方法：加入先验
第二种通过灯箱实验，加入适当约束

2. Projective structure from motion

Given: $m$ images of $n$ fixed 3 D points

$x_{i j}=P_{i} X_{j}, i=1, \ldots, m, \quad j=1, \ldots, n$

Problem: estimate $m$ projection matrices $\boldsymbol{P}_{i}$ and $n$ 3D points $\boldsymbol{X}_{j}$ from the $m n$ correspondences $x_{i j} $

With no calibration info, cameras and points can only be recovered up to a $4 \times 4$ projective transformation $\boldsymbol{H}$ :
- 加入标定信息的变换

$\begin{array}{cc} x=P X & P=K[R T] \\ \downarrow & \downarrow \\ H X & P H^{-1} \\ x=P X=P H^{-1} H X & \end{array}$

2.1 Problem

Factorization method (by SVD) Assume all points are visible.
- This not true if: occlusions occur
  - 如果出现遮挡，会出现缺失点，导致一些点用不了
- failure in establishing correspondences
  - 其次如果点的对应关系无法找到也用不了

2.2 Algebraic approach : Two-camera case

Compute fundamental matrix $\mathbf{F}$ between the two views From at least 8 point correspondences, compute $F$ associated to camera 1 and 2
Use $\mathrm{F}$ to estimate projective cameras

$\mathrm{F}->P_{1}, P_{2}$

Use these cameras to triangulate and estimate points in 3D

$X_{j}^{*}=\arg \min \left(d^{2}\left(\mathbf{x}_{1 j}, \mathbf{P}_{1} \mathbf{X}_{j}\right)+d^{2}\left(\mathbf{x}_{2 j}, \mathbf{P}_{2} \mathbf{X}_{j}\right)\right)$

3. Bundle adjustment

Non-linear method for refining structure and motion
Minimize reprojection error

$\sum_{i=1}^{m} \sum_{j=1}^{n} D\left(\mathbf{x}_{i j}-\mathbf{P}_{i} \mathbf{X}_{j}\right)^{2}$

D is the nonlinear mapping
Nonlinear minimization: Newton Method

4. What is deep learning? How does it work?

Fully connect feedforward neural network
Loss function
Optimization of deep neural network

4.1 Perceptron

Mathematical model

Distance from a incorrect classified point to the line

Goal: make L(w, b) as small as possible $L(w, b)=-\sum_{x \in M} y_{i}\left(w \cdot x_{i}+b\right)$
Method: optimize $w$ and $b$ through gradıent descent

Let’s optimize w as an example

Question

Can we solve the problem below with perceptron?

Goal: classifying samples to class A and class B

4.2 Idea: the Two-Layer Perceptron

Two-Layer Perceptron
- For the XOR problem, draw two, instead, of one lines
What did it do?

L4.3 et’s consider more general cases

Recall: Machine Learning ≈ Looking for a function f
Speech Recognition

Image Recognition

Playing Go

Dialogue System

4.4 Framework

4.5 Neuron

$z=a_{1} w_{1}+\cdots+a_{k} w_{k}+\cdots+a_{K} w_{K}+b$

4.6 Other activation functions

Rectified Linear Unit (ReLU) $y=\max (\mathbf{0}, z)$
Leaky ReLU

$y= \begin{cases}z & z>0 \\ \alpha z & \text { otherwise }\end{cases}$

Softplus

$y=\ln \left(1+e^{z}\right)$

4.7 Neural Network

Network parameter 𝜃𝜃:all the weights and biases in the “neurons”

4.8 Fully Connect Feedforward Network

4.9 Matrix Operation

4.10 Output Layer as Multi-Class Classifier

Softmaxlayer as the output layer
- 为了更好的可解释性

4.11 Example Application

4.12 Loss for an Example

4.13 Total Loss

4.14 Gradient Descent

4.15 Local Minima

Do we really minimize the total loss in engineering?

4.16 Shuffle the samples for each epoch

4.17 Mini-batch Epoch

Step #1: Randomly initialize network parameters

Mini-batch is Faster

4.18 Vanishing gradient and ReLU

梯度压缩，导致梯度反向传播时只能影响靠近输出的layer的参数

4.19 Rectified Linear Unit (ReLU)

4.20 ReLU -variant

4.21 Learnable activation function -Maxout

Learnable activation function [Ian J. Goodfellow, ICML’13]

You can have more than 2 elements in a group
ReLu是Maxout的特俗情况

因为maxout会随着w变化，所以是可学习的激活函数
How many pieces depending on how many elements in a group

不同的样本会产生不同的网络架构，从而改变模式和参数都达到调整
Given a training data x, we know which z would be the max

Given a training data x, we know which z would be the max

Train this thin and linear network1x2x
Different thin and linear network for different examples

4.22 Hard to find optimal parameters

4.23 In physical world……

4.24 Review: Vanilla Gradient Descent

Start at position $\theta^{0}$
Compute gradient at $\theta^{0}$
Move to $\theta^{1}=\theta^{0}-\eta \nabla L\left(\theta^{0}\right)$
Compute gradient at $\theta^{1}$
Move to $\theta^{2}=\theta^{1}-\eta \nabla L\left(\theta^{1}\right)$
Stop until $\nabla L\left(\theta^{t}\right) \approx 0$

4.25 Momentum

Start at point $\theta^{0}$
Movement $\mathrm{v}^{0}=0$
Compute gradient at $\theta^{0}$
Movement $\mathrm{v}^{1}=\lambda \mathrm{v}^{0}-\eta \nabla L\left(\theta^{0}\right)$
Move to $\theta^{1}=\theta^{0}+\mathrm{v}^{1}$
Compute gradient at $\theta^{1}$
Movement $\mathrm{v}^{2}=\lambda \mathrm{v}^{1}-\eta \nabla L\left(\theta^{1}\right)$
Move to $\theta^{2}=\theta^{1}+\mathrm{v}^{2}$
Movement not just based on gradient, but previous movement

4.26 Dropout

回忆一下，3.8节 (多层感知机) 的图3.3描述了一个单隐藏层的多层感知机。其中输入个数为 4 ，隐藏单元个数为 5 ，且隐藏单元 $h_{i}(i=1, \ldots, 5)$ 的计算表达式为

$h_{i}=\phi\left(x_{1} w_{1 i}+x_{2} w_{2 i}+x_{3} w_{3 i}+x_{4} w_{4 i}+b_{i}\right)$

这里 $\phi$ 是激活函数， $x_{1}, \ldots, x_{4}$ 是输入，隐藏单元 $i$ 的权重参数为 $w_{1 i}, \ldots, w_{4 i}$ ，偏差参数为 $b_{i}$ 。当对该隐藏层使用丟弃法时，该层的隐藏单元将有一定概率被至弃掉。设丟弃概率为 $p$ ，那么有 $p$ 的概率 $h_{i}$ 会被清零，有 $1-p$ 的概率 $h_{i}$ 会除以 $1-p$ 做拉伸。丟弃概率是丟弃法的超参数。具体来说，设随机变量 $\xi_{i}$ 为 0 和 1 的概率分别为 $p$ 和 $1-p_{\text {。 }}$ 使用丟弃法时我们计算新的隐藏单元 $h_{i}^{\prime}$

$h_{i}^{\prime}=\frac{\xi_{i}}{1-p} h_{i}$

由于 $E\left(\xi_{i}\right)=1-p$ ，因此

$E\left(h_{i}^{\prime}\right)=\frac{E\left(\xi_{i}\right)}{1-p} h_{i}=h_{i}$

即丟弃法不改变其输入的期望值。让我们对图3.3中的隐藏层使用丟弃法，一种可能的结果如图3.5所示，其中 $h_{2}$ 和 $h_{5}$ 被清零。这时输出值的计算不再依赖 $h_{2}$ 和 $h_{5}$ ，在反向传播时，与这两个隐藏单元相关的权重的梯度均为 0 。由于在训练中隐藏层神经元的丢弃是随机的，即 $h_{1}, \ldots, h_{5}$ 都有可能被清零，输出层的计算无法过度依赖 $h_{1}, \ldots, h_{5}$ 中的任一个，从而在训练模型时起到正则化的作用，并可以用来应对过拟合。在测试模型时，我们为了拿到更加确定性的结果，一般不使用丟弃法。

def dropout(X, drop_prob):
    X = X.float()
    assert 0 <= drop_prob <= 1
    keep_prob = 1 - drop_prob
    # 这种情况下把全部元素都丢弃
    if keep_prob == 0:
        return torch.zeros_like(X)
    mask = (torch.rand(X.shape) < keep_prob).float()

    return mask * X / keep_prob

Each time before updating the parameters
Each neuron has p% to dropout
- The structure of the network is changed.
  Thinner!
Using the new network for training
For each mini-batch, we resample the dropout neurons

Why the weights should multiply (1-p)% (dropout rate) when testing?

Dropout is a kind of ensemble

Train a bunch of networks with different structures