  • Partition image into blocks and compute histogram of gradient orientations in each block


  • 对光照不敏感,一定程度上可以容忍一些变化

1.1 Pedestrian detection with HOG

  • Train a pedestrian template using a linear support vector machine


  • At test time, convolve feature map with template(SVM)
  • Find local maxima of response
  • For multi-scale detection, repeat over multiple levels of a HOG pyramid


1.2 Window-based detection: strengths

  • Sliding window detection and global appearance descriptors: Simple detection
    • protocol to implement
    • Good feature choices critical
    • Past successes for certain classes

1.3 High computational complexity

  • For example: 250,000 locations x 30 orientations x 4 scales = 30,000,000 evaluations!
  • If training binary detectors independently, means cost increases linearly with number of classes

  • 对于一些方形的框不能有针对进行目标检测,因为物体不一定都是呈矩形分布的

  • Non-rigid, deformable(非刚性的、可变形的物体) objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpoint

    • 对非刚性形变不具有鲁棒性


  • If considering windows in isolation, context is lost、
    • 丢失上下文信息


  • In practice, often entails large, cropped training set (expensive)
  • Requiring good match to a global appearance description can lead to sensitivity to partial occlusions
    • 需要标记


2. Discriminative part-based models

  • Single rigid template usually not enough to represent a category
    • Many objects (e.g. humans) are articulated(铰接式), or have parts that can vary in configuration(结构)


  • Many object categories look very different from different viewpoints, or from instance to instance
    • 不同视角带来的变化


2.1 Solution

  • 先用全局做响应,再用局部算子做相应
  • 不管是哪个部分存在,都可以判定为目标检测成功



  • 虽然空间组合发生变化,但是部件仍能检测出来

3. Object proposals

3.1 Main idea:

  • Learn to generate category-independent regions/boxes that have object-like properties.
  • Let object detector search over “proposals”, not exhaustive sliding windows
    • 找有目标的窗口


  • 多尺度显著性
    • 人眼在观测物体时,会有关注点


  • 颜色对比度
    • 一般物体检测周围环境的颜色存在明显的变化


  • 边缘密度,一般来说一个物体的边缘是闭合的


  • 超像素跨越性:把相似的像素点聚类在一起叫超像素,一个超像素不应该属于两个类。一个框不可能跨越超像素,否则框内无目标



  • 只需要1000个窗口,就能把目标框出来

3.2 Summary

  • Object recognition as classification task
    • Boosting (face detection ex)
    • Support vector machines and HOG (person detection ex)
    • Sliding window search paradigm
      • Pros and cons
      • Speed up with attentional cascade
      • Discriminative part-based models, object proposals

4. Motion and Tracking

4.1 From images to videos

  • A video is a sequence of frames captured over time
  • Now our image data is a function of space (𝑥,𝑦)and time (𝑡)


4.2 Motion is a powerful perceptual cue

  • Sometimes, it is the only cue
    • 每一帧图片具有强相关性,运动可以带来丰富的信息
    • 下图在运动时可以看到两个圆


  • Even “impoverished” motion data can evoke a strong percept
    • 下图可以看出一个运动的人


4.3 Uses of motion in computer vision

  • 3D shape reconstruction
    • 多角度拍摄
  • Object segmentation
  • Learning and tracking of dynamical models
    • 目标追踪
  • Event and activity recognition

4.4 Motion field

  • motion field is the projection of the 3D scene motion into the image
    • 运动场是3D场景运动到图像中的投影



4.5 Motion estimation: Optical flow

  • Definition: optical flow is the apparent motion of brightness patterns in the image

    • 明显亮度模式的运动
    • 光流(optical flow)是空间运动物体在观察成像平面上的像素运动的瞬时速度
  • Ideally, optical flow would be the same as the motion field

  • Have to be careful: apparent motion can be caused by lighting changes without any actual motion
    • Think of a uniform rotating sphere under fixed lighting vs. a stationary sphere under moving illumination
    • 一种是均匀光照对选装球体的影响
    • 一种是光照变化,但是物体不动
  • GOAL:Recover image motion at each pixel from optical flow

4.6 Estimating optical flow


  • 时间很小,位移矢量近似于速度矢量

  • Given two subsequent frames, estimate the apparent motion field u(x,y), v(x,y) between them

    • u,v分别是横向速度和纵向速度


  • Key assumptions
    • Brightness constancy: projection of the same point looks the same in every frame
      • 亮度恒定不变。相同的投影点在不同帧间运动时,其亮度不会发生改变。
    • Small motion:points do not move very far
      • 时间连续或运动是“小运动”。即时间的变化不会引起目标位置的剧烈变化,相邻帧之间位移要比较小。
    • Spatial coherence:points move like their neighbors
      • 空间相关性:相邻的点相似

4.6.1 Key Assumptions: small motions


  • 相邻帧,某个区域的像素是逐渐变化的

4.6.2 Key Assumptions: spatial coherence


  • 空间上的一致性,在小领域上运动趋势是相似的

4.6.3 Key Assumptions: brightness Constancy


4.7 The brightness constancy constraint

  • 亮度恒定


  • Brightness Constancy Equation:
  • Linearizing the right side using Taylor expansion:
  • t方向求导即为:
  • Can we use this equation to recover image motion (u,v) at each pixel?
  • 增量和梯度方向垂直的话,增量就无影响
  • How many equations and unknowns per pixel?

    • One equation (this is a scalar equation!), two unknowns (u,v)
    • 无法求解参数
  • The component of the flow perpendicular(垂直) to the gradient (i.e., parallel to the edge) cannot be measured


  • 会有多个解,与实际运动就会不一致

4.8 The aperture problem

  • 孔径问题指在运动估计(Motion Estimation)中无法通过单个算子【计算某个像素值变化的操作,例如:梯度】准确无误地评估物体的运行轨迹。原因是每一个算子只能处理它所负责局部区域的像素值变化,然而同一种像素值变化可能是由物体的多种运行轨迹导致。


  • 在小孔里看是平行运动,但实际三维运动却不是


  • 三维是旋转,但是二维看起来是向上走
  • 這就是「區域(local)」 和「 全域 (global)」 視覺處理的差別。我們的視覺系統區域上 (locally) 可以有孔徑問題的錯覺,但是當我們觀察的範圍是全域 (globally)的時候,卻又分析的出來三張紙條不同的移動方向。

4.9 Solving the ambiguity

  • How to get more equations for a pixel?
  • Spatial coherence constraint:
    • Assume the pixel’s neighbors have the same (u,v)
    • If we use a 5x5 window, that gives us 25 equations per pixel
  • Overconstrained linear system
  • Least squares solution for $d$ given by $\left(A^{T} A\right) d=A^{T} b$
  • The summations are over all pixels in the $\mathrm{K} \times \mathrm{K}$ window

  • Optimal $(u, v)$ satisfies Lucas-Kanade equation

  • When is this solvable? I.e., what are good points to track?
    • $A^TA$​ should be invertible
      • 不一定可逆
    • $A^TA$​​ should not be too small due to noise
      • eigenvalues $\lambda_{1}$ and $\lambda_{2}$ of $A^{\top} A$​ should not be too small
      • 如果$A^TA$值很小,如果有噪音,就会造成很大的扰动,所以特征值不能太小
    • $A^TA$​ should be well-conditioned
      • $\lambda_{1} / \lambda_{2}$​ should not be too large $\left(\lambda_{1}=\right.$​ larger eigenvalue $)$​​
  • Does this remind you of anything?
    • Criteria for Harris corner detector

4.10 Recall: second moment matrix

  • Estimation of optical flow is well-conditioned precisely for regions with high “cornerness”:


4.10.1 Low texture region


  • 对于平滑区域和边缘都不好检测光流估计
  • 角点会较为容易检测,因为他的梯度在各个方向都有变化



4.10.2 The aperture problem resolved



  • 用找交点的方式,来进行约束

4.11 Errors in Lucas-Kanade

  • The motion is large (larger than a pixel)
  • A point does not move like its neighbors
    • 柔性物体的变化
  • Brightness constancy does not hold

Revisiting the small motion assumption


  • Is this motion small enough?
    • Probably not—it’s much larger than one pixel
    • How might we solve this problem?
  • 意思是对于一些比较大的运动怎么进行测量?

4.12 Reduce the resolution!


  • 利用下采样,那么原来偏移两个像素的运动,就会变成偏移一个像素,从而提高鲁棒性

4.13 Coarse-to-fine optical flow estimation


  • 先将图片进行下采样


  • 然后从最低分辨率的图片开始进行光流估计,然后在进行上采样
    • 对于低分辨率求得的u,v将作为下一层的初始值


4.14 A point does not move like its neighbors

  • Motion segmentation


  • 先分块,再用聚类的方法,找真正的方向,把图像分为不同的层,作为整体目标的考虑
  • Brightness constancy does not hold
    • Feature matching
  • 先检测关键点,就可以追踪关键点的轨迹

5. Feature Tracking



  • 通过找到图像的关键点,然后最终图像关键点,从而形成特征追踪

5.1 Single object tracking


  • 可以有效的解决遮挡问题

5.2 Multiple object tracking


  • 可能遇到的问题
    • 实体重叠
    • 实体分开(id不能搞混)

5.3 Tracking with a fixed camera


  • 因为用固定的相机拍摄,当人运动时会导致尺度会发生变化

5.4 Tracking with a moving camera


  • 运动的相机背景发生变化

5.5 Tracking with multiple cameras


  • 角度变化

5.6 Challenges in Feature tracking

  • Figure out which features can be tracked
    • Efficiently track across frames
  • Some points may change appearance over time
    • e.g., due to rotation, moving into shadows, etc.
  • Drift: small errors can accumulate as appearance model is updated
    • 两帧有小的误差,小的误差累积成大的误差
  • Points may appear or disappear.
    • 特征点消失与出现

5.7 What are good features to track?

  • Intuitively, we want to avoid smooth regions and edges. But is there a more is principled way to define good features?
  • 稳定好计算
  • Key idea: “good” features to track are the ones whose motion can be estimated reliably
  • What kinds of image regions can we detect easily and consistently?

5.8 Motion estimation techniques

  • Optical flow
    • Recover image motion at each pixel from spatio-temporal image brightness variations (optical flow)
  • Feature-tracking

    • Extract visual features (corners, textured areas) and “track” them over multiple frames
  • 特征跟踪:可以用光流算法来帮助最终跟踪

5.9 Optical flow can help track features

  • Once we have the features we want to track, lucas-kanadeor other optical flow algorithsmcan help track those features



6. Shi-Tomasifeature tracker

6.1 Simple KLT tracker

  1. Find a good point to track (harriscorner)
  2. For each Harris corner compute motion (translation or affine) between consecutive frames.
  3. Link motion vectors in successive frames to get a track for each Harris point
  4. Introduce new Harris points by applying Harris detector at every m (10 or 15) frames
    • 检查是否有新的好的特征点
  5. Track new and old Harris points using steps 1‐3

6.2 Recall: Challenges in Feature tracking

  • Figure out which features can be tracked
  • Some points may change appearance over time
  • Drift: small errors can accumulate as appearance model is updated
    • 所以要找一些比较稳定的特征点作为最终对象
  • Points may appear or disappear.

    • Need to be able to add/delete tracked points
  • Check consistency of tracks by affine registration to the first observed instance of the feature

  • Affine model is more accurate for larger displacements

6.3 2D transformations

  • 可参考阅读 2D transformation review


6.3.1 Translation

  • Let the initial feature be located by (x, y).

  • In the next frame, it has translated to (x’, y’).

  • We can write the transformation as:

  • We can write this as a matrix transformation using homogeneous coordinates:


  • Notation:
  • There are only two parameters:
  • The derivative of the transformation w.r.t. $\mathbf{p}$ :
  • This is called the Jacobian.

6.3.2 Similarity motion

  • Rigid motion includes scaling + translation.
  • We can write the transformations as:

6.3.3 Affine motion

  • Affine motion includes scaling + rotation + translation.

6.4 Iterative KLT tracker

  • Given a video sequence, find all the features and track them across the video.
  • First, use Harris corner detection to find features and their location $\boldsymbol{x}$. For each feature at location $\boldsymbol{x}=\left[\begin{array}{ll}x & y\end{array}\right]^{T}$​
  • Choose a descriptor create an initial template for that feature: $T(\boldsymbol{x})$​.
  • 注意初始帧数会对每个特征计算一个描述符模板,用于比较往后特征描述符和该模板的差距
  • Our aim is to find the $\boldsymbol{p}$ that minimizes the difference between the template $T(\boldsymbol{x})$ and the description of the new location of $\boldsymbol{x}$​​​ after undergoing the transformation.
    • 在特征对应的这样一个小区域,进行最小化变化前后描述符之间的差值
  • For all the features $x$ in the image $I$,

    • $I(W(\boldsymbol{x} ; \boldsymbol{p}))$ is the estimate of where the features move to in the next frame after the transformation defined by $W(\boldsymbol{x} ; \boldsymbol{p})$. Recall that $\boldsymbol{p}$​ is our vector of parameters.
    • Sum is over an image patch around $\boldsymbol{x}$​.
  • We will instead break down $\boldsymbol{p}=\boldsymbol{p}_{\mathbf{0}}+\Delta \boldsymbol{p}$

    • Large $+$ small $/$ residual motion
    • Where $\boldsymbol{p}_{\mathbf{0}}$ is going to be fixed and we will solve for $\Delta \boldsymbol{p}$, which is a small value.
    • We can initialize $\boldsymbol{p}_{\mathbf{0}}$ with our best guess of what the motion is and initialize $\Delta \boldsymbol{p}$​​ as zero.
  • It’s a good thing we have already calculated what $\frac{\partial W}{\partial p}$ would look like for affine, translations and other transformations!

  • So our aim is to find the $\Delta \boldsymbol{p}$ that minimizes the following:

  • Where $\nabla I=\left[\begin{array}{ll}I_{x} & I_{y}\end{array}\right]$

  • Differentiate wrt $\Delta \boldsymbol{p}$​ and setting it to zero:

  • Solving for $\Delta \boldsymbol{p}$ in:
  • we get:
  • where $H=\sum_{x}\left[\nabla I \frac{\partial W}{\partial p}\right]^{T}\left[\nabla I \frac{\partial W}{\partial p}\right]$
  • H matrix for translation transformations

Recall that

  1. $\nabla I=\left[\begin{array}{ll}I_{x} & I_{y}\end{array}\right]$ and
  2. for translation motion, $\frac{\partial W}{\partial p}(\boldsymbol{x} ; \boldsymbol{p})=\left[\begin{array}{ll}1 & 0 \ 0 & 1\end{array}\right]$
  • H matrix for affine transformations

6.5 Overall KLT tracker algorithm

  • Given the features from Harris detector:
    • 这里应该指的是得到特征的坐标信息以及特征信息
    • 对于追踪而言可以直接用光流法最终特征,但光流法是有误差的
    • 因为存在噪声,所以需要去比较10帧前后的特征变化,一般来讲经过2D变换后仍能找到特征
    • 存在一种情况,也就是该特征已经消失,则此时一定找不到一种合适小运动,使得特征进行有效的变换
  1. Initialize $\boldsymbol{p}_{\mathbf{0}}$ and $\Delta \boldsymbol{p}$.
  2. Compute the initial templates $T(x)$ for each feature.
  3. Transform the features in the image $I$ with $W\left(\boldsymbol{x} ; \boldsymbol{p}_{\mathbf{0}}\right)$.
  4. Measure the error: $I\left(W\left(\boldsymbol{x} ; \boldsymbol{p}_{\mathbf{0}}\right)\right)-T(x)$.
  5. Compute the image gradients $\nabla I=\left[\begin{array}{ll}I_{x} & I_{y}\end{array}\right]$.
  6. Evaluate the Jacobian $\frac{\partial W}{\partial p}$.
  7. Compute steepest descent $\nabla I \frac{\partial W}{\partial p}$.
  8. Compute Inverse Hessian $H^{-1}$
  9. Calculate the change in parameters $\Delta \boldsymbol{p}$
  10. Update parameters $\boldsymbol{p}_{\mathbf{0}}=\boldsymbol{p}_{\mathbf{0}}+\Delta \boldsymbol{p}$
  11. Repeat 2 to 10 until $\Delta \boldsymbol{p}$ is small.
  • $\Delta \boldsymbol{p}$如果一直很大,则把该特征删去

  • 总的来说,该算法是为了持续监视特征的一个算法,每隔10帧左右进行依次运算,当该运算指的是在给定两张图片,给定了一开始计算的特征模板,然后每隔10fp做一次判别,从当前帧的前第十帧的某一个特征点进行2D变换到当前帧就可以得到当前帧的小区域描述符,通过最小化两者的rms,找到符合的小$\Delta p$说明该特征完好,否则该特征可能已经消失,则不再对该特征进行追踪

