CNN - Smurf

CNN

1. Why CNN for Image?

参数过多，我们可以减少全连接神经元、共享参数、卷积

Some patterns are much smaller than the whole image
- 神经元不需要看整张图片，而只需要看一小部分

The same patterns appear in different regions.
- 对于重复出现的模式可以共享参数

Subsampling the pixels will not change the object
- 降采样不影响图像语义

We can subsample the pixels to make image smaller
- Less parameters for the network to process the image

1.2 The whole CNN

卷积是局部区域的加权和，做卷积时大小相同就共享参数了
而maxpooling相当于降采样

1.3 CNN Convolution

由于过于一个输出通道的卷积核都可以学习一种模式，这相当于就是共享参数

Do the same process for every filter

CNN Zero Padding

CNN Colorful image

Convolution v.s. Fully Connected
- 减少了很多参数
- 输出多个feature map，说明其增加更多的非线性变换，增强网络的表征能力

CNN Max Pooling

增加非线性，以及减少参数

Flatten

Convolutional Neural Network

What does CNN learn?

What is the essential difficulty?

深度学习解决语义鸿沟：提取高阶语义模式、不受光照、旋转等影响

What can CNN do for computer vision?

Before deep learning was born

Feature extraction example #1

Feature name: Local Binary Pattern (LBP)
Use center pixel value to threshold the 3x3 neighborhood
Result in binary number
Histogram of the labels is used as a texture descriptor

Feature extraction example #2

Feature name: Scale invariant feature transform (SIFT)
Divide the 16x16 window into a 4x4 grid of cells
Compute an orientation histogram for each cell
16 cells x 8 orientations = 128 dimensional descriptor

What’s wrong with traditional features?

Image classification with deep learning

Four typical image classification nets

Image classification with deep learning

AlexNet

Characters of AlexNet

Trained by two GPUs
Data augmentation
Clipping / flipping / …
Using ReLU rather than sigmoid function
Overlapped pooling
Dropout in full connection layers

Q: Why use smaller filters? (3x3 conv)

大的卷积核可以分解成小的卷积核，网络加深，获得更大感受野，减少参数，速度加快
两个conv 33 相当于5\5，三个相当于7*7，（2*3+1）
GoogLENet

尺度信息更丰富，防止丢失信息
增加每一层学到的模式

Why not going much deeper?

ResNet
- 学习到的函数变为残差：F(x)-x

瓶颈残差块：使得计算量变少

好处：保证前向信息传播的流畅性、其次保证梯度回传的稳定性

From classification to segmentation

Converting the segmentation problem to classification
- 把一个窗口扣成小块去卷积

列举所有滑动窗口去卷积分类
- 这样会导致参数爆炸

Downsampling and Upsampling

先下采样，使得参数变少，再上采样还原分类结果
Review: Unpooling
- 记住pooling的所有位置，然后反pooling，除了最大值的位置，其他标为零

Review: Deconvolution
- 末尾补零即可

Object detection

Predict bounding boxes, class labels, and confidence scores
For each detection, determine whether it is true or false

Basic idea to detection: Sliding windows

通过设计一些判断的准则，找一些置信度最高的框保留下来

R-CNN: Region proposals + CNN features

Regional-based Convolutional Neural Network (R-CNN)

Fast R-CNN

RoI pooling goal

“Crop and resample” a fixed size feature representing a region of interest out of the feature map
Use nearest neighbor interpolation of coordinates, max pooling
把原始图片的候选框映射到feature map上去

For each RoI , predicts probabilities for c+1 classes (with background) and four bounding box offsets for c classes

Fast R-CNN training

Bounding box regression

Faster R CNN

Slide a small window (3x3) over the conv5 layer
- Predict object/no object
- Regress bounding box coordinates with reference to anchors (3 scales x 3 aspect ratios)
一开始是调整每个候选框，而最后是只对一些框进行计算loss

YOLO

Streamlined detection architectures

The Faster R CNN pipeline separates proposal generation and region
classification:

Is it possible do detection in one shot?

Idea: No bounding box proposals. Predict a class and a box for every location in a grid.

Divide the image into 7x7 cells.
Each cell trains a detector.
The detector needs to predict the object’s class distributions.
The detector also predicts bounding boxes and confidence
scores.
Objective function
- 为了让小的框的长宽对loss的影响更大一点

Localization accuracy suffers compared to Fast(er) R CNN due to coarser features, errors on small boxes
7x speedup over Faster R CNN (45 155 FPS vs. 7 18 FPS)