Image Segmentation

Recognition & Detection

1. Introduction to recognition

1.1 Activity recognition

What are these people doing?

walking
shopping
rolling a cartsitting
talking
…

1.2 Categorization vs Single instance recognition

Where is the crunchy（松脆的） nut?

1.3 Visual Recognition

Design algorithms that have the capability to:
- Classify images or videos
- Detect and localize objects
- Estimate semantic and geometrical attributes.
- Classify human activities and events

1.4 Why is it difficult?

Want to find the object despite possibly large changes inscale, viewpoint, lighting and partial occlusion

2. The machine learning framework

Training: given a training set of labeled examples
${(x_1,y_1),…. , (x_N,y_N)}, $estimate the prediction function $f$ by minimizing the prediction error on the training set
Testing: apply $f$ to a never before seen test example $x$ and output the predicted value $y = f(x)$
Apply a prediction function to a feature representation of the image to get the desired output:

2.1 A simple pipeline - Training

2.2 “Classic”recognition pipeline

Hand-crafted feature representation
- 纹理
- 边缘
- 角点
Off-the-shelf trainable classifier

3. Bag of words

3.1 Origin

3.1.1 Origin 1: Texture Recognition

Texture is characterized by the repetition of basic elements or textons

统计图像中包含纹理基元的频率，从而作为特征向量

3.1.2 Origin 2: Bag-of-words models

Orderless document representation: frequencies of words from a dictionary salton & McGill (1983)

3.1.3 Bags of features for object recognition

Works pretty well for image-level classification and for recognizing object instances

3.2 Bag of features

First, take a bunch of images, extract features, and build up a”dictionary” or “visual vocabulary”——a list of common features
Given a new image, extract features and build a histogram - for each feature, find the closest visual word in the dictionary

3.2.1 Bag of features: outline

Extract features

Learn “visual vocabulary”
- 只选取有代表性的特征基元

Quantize features using visual vocabulary
- 算相似度，投影bins
Represent images by frequencies of “visual words”

3.2.2 Feature extraction

Regular grid
- Vogel & Schiele, 2003
- Fei-Fei &Perona,2005

lnterest point detector
- Csurka et al. 2004
- Fei-Fei& Perona,2005Sivic et al. 2005

Other methods
- Random sampling (Vidal-Naquet & Ullman, 2002).
- Segmentation-based patches (Barnard et al.2003)

3.3 Learning the visual vocabulary

把特征点进行聚类，得到聚类中心，从而得到视觉词汇
- 对于SIFT，则将10000个128维向量缩小为3个128维，进而通过直方图，最后变成一个三维向量

3.3.1 From clustering to vector quantization

Clustering is a common method for learning a visual vocabulary or codebook
- Unsupervised learning process
- Each cluster center produced by k-means becomes a codevector
- Provided the training set is representative, the codebook will be “universal”
The codebook is used for quantizing features
- A vector quantizer takes a feature vector and maps it to the index of the nearest codevector in a codebook
- Codebook = visual vocabulary
- Codevector= visual word
  - 简言之，就是通过聚类得到的聚类中心就是我们需要的code vector，然后code vector 组成codebook，对于后续图片的表达，只要借助于量化，即可

3.3.2 Visual vocabularies

3.3.3 Visual vocabularies: lssues

How to choose vocabulary size?
- Too small: visual words not representative of all patches
- Too large: quantization artifacts, overfitting
Computational efficiency
- Vocabulary trees
  (Nister & Steweniu:s, 2006)

训练阶段，进行树状分类，即递归调用k-means；测试阶段先进行粗分类

3.3.4 Large-scale image matching

Bag-of-words models have been useful in matching an image to a large database of object instances

3.3.5 Bags of features for object recognition

3.4 What about spatial information?

单纯使用上诉表示方法，会失去空间信息的特征

3.4.1 Spatial pyramids （Spatial Pyramid Matching）

将图像分成若干块(sub-regions)，分别统计每一子块的特征，最后将所有块的特征拼接起来，形成完整的特征。

简介：

假设存在两个点集$X$和$Y$（每个点都是$D$维的，以下将它们所在的空间称作特征空间）。将特征空间划分为不同的尺度$0,…,L$，在尺度$l$下特征空间的每一维划出$2^l$个cells，那么d维的特征空间就能划出$D=2^{dl}$个bins；
两个点集中的点落入同一个bin就称这两个点Match。在一个bin中match的总数定义为 $min(X_i, Y_i)$，其中$X_i$和$Y_i$分别是两个点集中落入第$i$个bin的点的数目；
统计各个尺度下match的总数$\mathcal{I}^l$（就等于直方图相交)。由于细粒度的bin被大粒度的bin所包含，为了不重复计算，每个尺度的有效Match定义为match的增量$\mathcal{I}^l-\mathcal{I}^{l+1}$
不同的尺度下的match应赋予不同权重，显然大尺度的权重小，而小尺度的权重大，因此定义权重为$\frac{1}{2^{L-l}}$
最终，两点集匹配的程度定义为：

$\begin{aligned} \kappa^{L}(X, Y) &=\mathcal{I}^{L}+\sum_{\ell=0}^{L-1} \frac{1}{2^{L-\ell}}\left(\mathcal{I}^{\ell}-\mathcal{I}^{\ell+1}\right) \\ &=\frac{1}{2^{L}} \mathcal{I}^{0}+\sum_{\ell=1}^{L} \frac{1}{2^{L-\ell+1}} \mathcal{I}^{\ell} \end{aligned}$

我觉得要特别说明一下的就是这里的特征空间与前面两个点集的点所被描述的空间之间的关系——-没有关系，对，我觉得是没有关系，因此就有作者的SPM：
- 将图像空间用构造金字塔的方法分解为多个scale的bins（通俗地说就是切分成不同尺度的方形）
- 像BOW一样构造一本大小为M的dictionary，这样每个特征都能投影到dictionary中的一个word上。其中字典的训练过程是在特征空间中完成。论文中的特征利用的dense SIFT。
- 统计每个bin中各个words的数目，最终两幅图像的匹配程度定义为：

$K^{L}(X, Y)=\sum_{m=1}^{M} \kappa^{L}\left(X_{m}, Y_{m}\right)$

注意，当L=0时，模型就退化成为BOW了。
SPM介绍了两幅图像匹配的方法。如要用于场景分类，注意(2)式就等于$M(L+1)$个直方图相交运算的和，其实也就等于一个更大的向量直接进行直方图相交运算而已。而这个向量，就等于每个被划分的图像子区域上的visual words直方图连在一起。这个特征，就是用来分类的特征。
作者在实验中表明，不同L下，M从200取到400对分类性能影响不大，也就是降低了码书的大小对分类效果的影响。
在本文最开始也提到了，这个方法可以作为一个模板，每个sub-region中统计的直方图可以多种多样，简单的如颜色直方图，也可以用HOG，这就形成了PHOG。SPM的matlab代码也可以从作者的主页上下载到(here)。只不过这种空间分类信息仍然有局限性——-一幅相同的图像旋转90度，匹配的结果就不会太高了。所以模型隐含的假设就是图像都是正着存储的（人都是站立的，树都是站立的…….）。另外空间Pyramid的分块方法也没有考虑图像中object的信息（仅仅是利用SIFT特征来描述了Object），这也是作者在文中承认的缺点。DPM，应该是考虑了这个问题的吧。

4. “Classic” recognition pipeline

4.1 Recall: Many classifiers to choose from

K-nearest neighbor
SVM
Neural networks
Naive Bayes
Logistic regression
Randomized Forests
Etc.

4.2 Generalization

How well does a learned model generalize from the data it was trained on to a new test set?

4.2.1 Bias-Variance Trade-off

Models with too few parameters are inaccurate because of a large bias (not enough flexibility).

Models with too many parameters are inaccurate because of a large variance(too much sensitivity to the sample).

4.2.2 Bias versus variance

Components of generalization error
Bias: how much the average model over all training sets differ from the true model?
- Error due to inaccurate assumptions/simplifications made by the model
Variance: how much models estimated from different training sets differ fron each other

Underfitting:

model is too “simple”to represent all the relevant classcharacteristics
- High bias and low variance
- High training error and high test error

Overfitting:

model is too “complex” and fits irrelevant characteristics(noise) in the data
Low bias and high variance
Low training error and high test error

No classifier is inherently(天生的) better than any other: you need to make assumptions to generalize
Errors
- Bias: due to over-simplifications
- Variance: due to inability to perfectlyestimate parameters from limited data

4.2.3 How to reduce variance?

Choose a simpler classifier
Regularize the parameters
Get more training data

4.3 Remarks

Know your data:
- How much supervision do you have?
- How many training examples can you afford?
- How noisy?
Know your goal (i.e. task):
- Affects your choices of representation
- Affects your choices of learning algorithms
- Affects your choices of evaluation metricss
Understand the math behind each machine learning algorithm under consideration!

5. Object detection

5.1 From image classification to object detection

5.2 Window-based detection models

Building an object model
Given the representation, train a binary classifier

使用不同窗口进行遍历整个图像

5.3 Window-based object detection: recap

大致思想是生成很多window，然后用学习到的分类器遍历每个window的图像，看分类正确得分

Training:

Obtain training data
Define features
Define classifier

Given new image:

Slide window
Score by classifier

5.4 Challenges

lmages may contain more than one class, multiple instances from the same class
Bounding box localization
- 位置精度影响分类
Evaluation
- 评价标准不一样

5.5 Object detection evaluation

At test time, predict bounding boxes, class labels, and confidence scores
For each detection, determine whether it is a true or false positive
- PASCAL criterion: Area(GT ∩ Det)/ Area(GT U Det)>0.5
  - 就是交并比$IoU$
- For multiple detections of the same ground truth box, only one considered a true positive
For each class, plot Recall-Precision curve and compute Average Precision (area under the curve)

Precision:指的是无误检
- 保证检测出来的是正确的

$\text { Precision }=\frac{T P}{T P+F P}=\frac{1}{1+\frac{FP}{TP}}$

Recall:表示无漏检
- 保证不漏检测正确的，例如不希望任何有缺陷的样品漏掉
- 下式可以理解为在所有正确的样本中，你预测为正确的样本占的比重，所以最大化召回率会使得你尽可能预测到所有的正例样本

$\text { Recall }=\frac{T P}{T P+F N}$

两者需要trade-off

AUC的物理意义就是权衡这两者的度量

6. Face detection

Slide a window across the image and evaluate a detection model at each location
- Thousands of windows to evaluate: efficiency and low false positive rates are essential
- Faces are rare: 0-10 per image
- 一张图像不可能出现很多张人脸

6.1 Viola-Jones face detector

6.2 Boosting intuition

对于复杂的特征，可能需要复杂的曲线去分类
有没有什么办法，简化模型复杂度
- 用多条直线拟合曲线

给分错的点，给一个较大的权重
- 就能在下一次分类分对

经过数次分类
- 就可以得到多个弱分类器

6.3 Boosting: training

lnitially, weight each training example equally
ln each boosting round:
- Find the weak learner that achieves the lowest weighted training error
- Raise weights of training examples misclassified by current weak learner
Compute final classifier as linear combination of all weaklearners (weight of each learner is directly proportional toits accuracy)
Exact formulas for re-weighting and combining weak learners depend on the particular boosting scheme.

6.4 Viola-Jones face detector

Main idea:

Represent local texture with( efficienily computable “rectangular” features within window of interest.
Select discriminative features to be weak classifiers
Use boosted combination of them as final classifier
Form a cascade（串联） of such classifiers, rejecting clear negatives quickly

6.5 Viola-Jones detector: features

6.5.1 “Rectangular” filters

Feature output is difference between adjacent regions
- 左边所有像素点的值减去右边所有像素点的值

计算量很大，因为既要考虑框的尺度以及位置(遍历所有像素点)
这些特征都很简单，就是分别将白色和黑色区域中的所有像素相加，然后做差。例如图1中的A特征，首先计算两个区域像素和$Sum(white),Sum(black).$
然后计算:

$feature=Sum(white)-Sum(black)$

但是考虑到多尺度问题，即利用不同大小的扫描窗口去检测不同大小的人脸，这个特征feature应该需要归一化。即最终特征：

$feature'=\frac{\text{feature}}{\text{pixel\_num}}$

$pixel_num$是黑色/白色区域的像素点个数。这样一来，即使扫描窗口的大小不一样，得到的人脸对应位置的特征值也能基本一致。另外，说一下为啥这个叫haar-like。因为在haar-wavelet中，haar基函数是下面这样一个东西。

$\psi(x) \equiv \begin{cases}1 & 0 \leq x<\frac{1}{2} \\ -1 & \frac{1}{2}<x \leq 1 \\ 0 & \text { otherwise }\end{cases}$

想象一下，如果把这个一维函数，扩展成二维的，那上面的A特征不就是用一个二维的haar基函数与图像每个像素点相乘得到的吗？其他的特征，也可是看做haar基函数不同尺度的二维扩展。

6.5.2 Computing the integral image

积分图像
- 像素左上角所有的值之和

Cumulative row sum: $s(x, y)= s(x-1, y)+ i(x, y)$
- 每次遍历会把行和算出来，$s(x,y)$是$(x,y)$像素左边的像素和，$i(x,y)$是当前位置像素值
Integral image: $ii(x, y)= ii(x, y-1)+ s(x, y)$
- $ii(x,y-1)$为上一行的像素的图像积分值，$s(x,y)$是$(x,y)$像素左边的像素和

这样子只要扫描一遍图像就可以算出所有点的积分图象值，$O(w\times l)$

6.5.3 Computing sum within a rectangle

Let $A,B,C,D$ be the values of the integral image at the corners of a rectangle
Then the sum of original image values within the rectangle can be computed as:
$sum = A-B-C+D$
Only 3 additions are required for any size of rectangle!
- 每个长方形只要算三次加法，大大降低了计算量

6.5.4 features

太多特征了

Considering all possible filter
parameters: position, scale, and type:
180,000+ possible features associated with each 24 x 24 window
Which subset of these features should we use to determine if a window has a face?
Use AdaBoost both to select the informative features and to form the classifier
最后，说说如何在一帧图像中提取这种特征。首先选定检测窗口的大小（这个可以是多尺度的，比如$2424,3636$等等），就拿$2424$来说，利用这个窗口对整个图像进行滑动，每滑动到一个位置，就在窗口中提取一堆haar-like特征，至于在哪个位置提取什么尺寸的特征，论文中没有说明，这个挺符合微软研究院的风格的，他们很少给出完整的framework，不过后来也有很多学者对于这个问题进行了研究，所以这点不太重要。总而言之，按照论文里面说的，一个$2424$的窗口，大概可以提取160000维特征
同时，论文中也说了，上述的haar-like特征，虽然在表达能力上很弱，但是由于维数比较大，是overcomplete的，这也算是对其表达能力进行了补充。另外，为了检测不同尺寸的人脸，之前的检测系统通常是把输入图像做成图像金字塔（图像按照尺寸从大到小的一组），然后检测窗口大小不变。Viola-jones则相反，他们保持输入图像尺寸不变，让检测窗口的尺寸不断调整。因为窗口的调整比起图像的调整要方便的多，这也节省了大量的时间。

6.5.5 Viola-Jones detector: AdaBoost

Want to select the single rectangle feature and threshold that best separates positive (faces) and negative (non-faces) training examples, in terms of weighted error.
- 直观理解就是，每一个窗口可以得到一个特征，用此来训练相应的classifier

下一次训练，用别的filter

伪代码：

Given example images $\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right)$ where $y_{i}=0,1$ for negative and positive examples respectively.
Initialize weights $w_{1, i}=\frac{1}{2 m}, \frac{1}{2 l}$ for $y_{i}=0,1$ respectively, where $m$ and $l$ are the number of negatives and positives respectively.
- Start with uniform weights on training examples
For $t=1, \ldots, T$ :
- Evaluate weighted error for each feature, pick best.
1. Normalize the weights,

$w_{t, i} \leftarrow \frac{w_{t, i}}{\sum_{j=1}^{n} w_{t, j}}$

so that $w_{t}$ is a probability distribution.

2. For each feature $j$, train a classifier $h_{j}$ which is restricted to using a single feature. The error is evaluated with respect to $w_{t}$,

$\epsilon_{j}=\sum_{i} w_{i}\left|h_{j}\left(x_{i}\right)-y_{i}\right|$ $h_{j}\left(x_{i}\right)=\left\{\begin{array}{l} 1 &\text{正类} \\ 0 &\text{父类} \end{array}\right.$

$\epsilon_{j}$是错误率

3.Choose the classifier $h_{t}$, with the lowest error $\epsilon_{t}$.

选择错误率最小的分类器，作为第一个分类器

4.Update the weights:

Re-weight the examples:

Incorrectly classified -> more weight
Correctly classified -> less weight

$w_{t+1, i}=w_{t, i} \beta_{t}^{1-e_{i}}$

$t$是迭代次数，$i$是对应第$i$个特征

$e_{i}=\left\{\begin{array}{l} 1 &\text{ if example $x\_{i}$ is classified correctly} \\ 0 &\text{otherwise} \end{array}\right.$ $\beta_{t}=\frac{\epsilon_{t}}{1-\epsilon_{t}}$

正确分类的样本给小权重，错误分类的给大权重
- 这样就可以专注于下次更新
The final strong classifier is:

$h(x)= \begin{cases}1 & \sum_{t=1}^{T} \alpha_{t} h_{t}(x) \geq \frac{1}{2} \sum_{t=1}^{T} \alpha_{t} \\ 0 & \text { otherwise }\end{cases}$ $\alpha_{t} =\log \frac{1}{\beta_{t}}=-\log\epsilon-1$ $\beta_{t}=\frac{\epsilon_{t}}{1-\epsilon_{t}}$

Final classifier is combination of the weak ones, weighted according to error they had.
- 将弱分类器结合成强分类器
- 错误率越低，权重越高
eplision是错误率
The final strong classifier is:

$C(x)= \begin{cases}1 & \sum_{t=1}^{T} \alpha_{t} h_{t}(x) \geq \frac{1}{2} \sum_{t=1}^{T} \alpha_{t} \\ 0 & \text { otherwise }\end{cases}$

where $\alpha_{t}=\log \frac{1}{\beta_{t}}$
Even if the filters are fast to compute, each new image has a lot of possible windows to search.
How to make the detection more efficient?

6.5.6 Cascading classifiers for detection

先尽可能不要漏检，即一开始用较小的precision，再进行第二阶段检查，逐步提高precision
- 直到下采样到很小的区域，用一个比较多的特征，即使用较多分类器进行分类
Form a cascade with low false negative rates early on
Apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative
- 第一个模型选用一个特征去构造，然后设置较小的正确率，尽量保证不漏检；再用剩余没检测的继续训练，进一步提高特征数于准确率

6.5.7 如何训练cascade of classifiers

论文中给出了一种很简单但是很有效的方法。

1.用户选定每一层的最大可接受误检率f（maximum acceptable rate of fpr）和每一层的最小可接受的检测率d（minimum accpetable detection rate）

2.用户选择整体的$FPR_{target}$

3.初始化：总体$FPR,D=1$，D指的是检测率，循环计数器i=0

4.循环：如果当前$FPR$大于，$FPR_{target}$继续添加一层adaboost分类器。

5.在训练当前层分类器时，如果目前该层的特征性能没有达到该层的$f_i$标准，则继续添加新的特征。添加新特征时，持续降低该特征的检测率阈值，直到该层分类器的检测率$d_i>d$，然后更新$D_i=d_i\times D_{i-1}$

6.需要注意的是，每一级分类器使用的训练集中的负样本，都是上一级被错分的，即false positive，假阳性。这使得下一级分类器更加关注那些更难的（容易被错分的）样本。最后，如果检测到多个重叠的人脸位置，则将检测矩形框合并。

The key insight is that smaller, and therefore more efficient, boosted classifiers can be constructed whichreject many of the negative sub-windows while detect-ing almost all positive instances. Simpler classifiers are used to reject the majority of sub-windows before more complex classifiers are called upon to achieve low false positive rates.

每个stage都是adaboosting，第一个stage只有一个weak classifier,第二个stage有两个weak classifier,…每个weak classifier只有一个特征