Transfer learning for CV、self-supervised learning
1. Transfer learning for CV


1.1 Why?

1.2 Traditional vs. Transfer Learning

由源数据学到一些共用的知识,在进行微调
Traditional machine learning:
- learn a system for a task, respectively
 
Transfer learning:
- transfer the knowledge form the source model for the target task
 
Task description
- Source data: $(x^s, y^s)$ A large amount
 - Target data: $(x^t, y^t)$  Very little
- One-shot learning: only a few examples in target domain
 
 - Example: (supervised) speaker adaption
- Source data: audio data and transcriptions from many speakers
 - Target data: audio data and its transcriptions of specific user
 
 - Idea: training a model by source data, then fine-tune the model by target data
- Challenge: only limited target data, so be careful about overfitting
 
 
1.3 Conservative Training

- 学习率调的很低
 
1.4 Layer Transfer

- Which layer can be transferred (copied)?
- Speech: usually copy the last few layers
 - Image: usually copy the first few layers
 
 

1.5 Neural Network Layers: General to Specific
- Bottom/first/earlier layers: general learners
- Low-level notions of edges, visual shapes
 
 - Top/last/later layers: specific learners
- High-level features such as eyes, feathers
 
 
1.6 Multitask Learning
- The multi-layer structure makes NN suitable for multitask learning
- 任务相关则可以共享部分参数
 
 

1.7 Progressive Neural Networks

- 不考虑任务相关性
 - 只进行特征共享,但是不共享参数
 
2. Domain-adversarial training
2.1 Task description: domain adaptation

- How to remove the domain shift?
 - How to bridge the domain gap?
 - The domain can be a general concept:
- Datasets: transfer from an “easy” dataset to a “hard” one
 - Modalities: transfer from RGB to depth, infrared images, point cloud……
 
 

- Remove the domain shift
 

2.2 Discrepancy-based approaches
我们希望两者数据越接近越好,这样在源数据训练可以迁移到目标数据
Idea: minimize the domain distance in a feature space
- Works focus on designing a reasonable distance
 

- Example: Metric learning based
 
2.3 Adversarial-based approaches

- 我们希望找到一个特征空间可以使他们的特征领域可以混在一起
 
2.4 Adversarial-based approaches
- Method 1: Domain-adversarial training
 


- 不同于GAN,GAN的分类器希望能分开fake数据,而对抗学习希望分类器越分不开越好
 

- 所以我们对于domain classifier不能使用梯度下降,而应该使用梯度反向
 

- Method 2: GAN-based methods
 

2.5 Reconstruction-based approaches

- The data reconstruction of source or target samples is an auxiliary task that simultaneously focuses on creating a shared representation between the two domains and keeping the individual characteristics of each domain.
 
2.6 Knowledge distillation
- Distill the knowledge from a larger deep neural network into a small network
 

- Response-based knowledge
- Use the neural response of the last output layer of the teacher model to transfer.
 - Directly mimic the final prediction of the teacher model.
 - Simple yet effective
 
 

大型网络与轻型网络分类越相近,越好
Feature-based knowledge
- Extend the transfer point from the last layer to intermediate layers
 - A good extension of response-based knowledge, especially for the training of thinner and deeper networks.
 - Generalize feature maps to attention maps
 

- Relation-based knowledge
- Both response-based and feature-based knowledge use the outputs of specific layers in the teacher model.
 - Relation-based knowledge further explores the relationships between different layers or data samples.
 
 - 考虑不同的特征分布
 

- Extension: Cross-modal distillation
- The data or labels for some modalities might not beavailable during training or testing
 
 

3. Self-supervised learning
3.1 Motivation
Recall the idea of transfer learning: start with general-purpose feature representation pre-trained on a large, diverse dataset and adapt it to specialized tasks
Challenge: overcoming reliance on supervised pre-training


3.2 Self-supervised pretext tasks
- Self-supervised learning methods solve “pretext” tasks that producegood features for downstream tasks.
- Learn with supervised learning objectives, e.g., classification, regression.
 - Labels of these pretext tasks are generated automatically
 
 - Example: learn to predict image transformations / complete corrupted images
 

3.2.1 Self-supervised learning workflow (I)

- Learn good feature extractors from self-supervised pretext tasks, e.g., predicting image rotations
 
3.2.2 Self-supervised learning workflow (II)

- Attach a shallow network on the feature extractor; train the shallow
network on the target task with small amount of labeled data - Evaluate the learned feature encoders on downstream target tasks
 
3.2.3 Self-supervisedvs. unsupervisedlearning
- The terms are sometimes used interchangeably in the literature, but self-supervised learning is a particular kind ofunsupervised learning
 Unsupervised learning: any kind of learning without labels
- Clustering and quantization
 - Dimensionality reduction, manifold learning
 - Density estimation
… 
Self-supervised learning: the learner “makes up” labels from the data and then solves a supervised task
3.3 Self-supervisedvs. Generative learning
Both aim to learn from data without manual label annotation.
Generative learning aims to model data distribution, e.g., generating realistic images.
- 希望能生成和真实越相近越好的图片,更注重细节
 
Self-supervised learning aims to learn high-level semantic features with pretext tasks
- 只学习高阶语义信息
 
3.4 Types of self-supervised learning
- 预测遮挡,预测上色,预测未来
 

- 预测拼图
 

- 对比学习
 

3.5 Self-Supervision as data prediction
3.5.1Colorization

要考虑固有颜色的歧义性,只要上色会在自然界出现,就不判错
Colorization: Training data generation
- 数据灰度化
 

- 用ab作为监督信息
 - 对ab空间进行量化,从而预测一个分布,最终考虑到了颜色的歧义性
 

3.6 Self-supervision by transformation prediction
- Pretext task:randomly sample a patch and one of 8 neighbors,Guess the spatial relationship between the patches
 


3.6.1 Context prediction: Details
- 切割时留有gap,防止学到这些边缘
 

3.6.2 Jigsaw puzzle solving

- 不同于预测位置,而是考虑九个块整体的一个顺序
 
Details
- 防止过拟合,只考虑64种组合,其hamming loss较大
 

3.6.3 Rotation prediction
- Pretext task: recognize image rotation (0, 90, 180, 270 degrees)
 

- During training, feed in all four rotated versions of an image in the same mini-batch
 

3.7 Contrastive methods
- Encourage representations of transformed versions of the same image to be the same and different images to be different
- 希望同种信息越相近越好,不同种数据越不相近越好
 
 

- Encourage representations of transformed versions of the same image to be the same and different images to be different
 

- Given: query point $x$, positive samples $x^{+}$, negative samples $x^{-}$
- Positives are typically transformed versions of $x$, negatives are random examples from the same mini-batch or memory bank
 
 - Key idea: learn representation to make $x$ similar to $x^{+}$, dissimilar from $x^{-}$(similarity is measured by dot product of normalized features)
 - Given 1 positive sample and $N$ - 1 negative samples, Contrastive loss:
 
- This seems familiar as cross entropy loss for a N-way Softmaxclassifier!
Try to find the positive samples from the Nsamples. - $\tau$ is the temperature hyperparameter(determines how concentrated the softmaxis)
 - 我们希望温度参数越小越好,这样预测越集中
 

3.8 SimCLR: A Simple Framework for Contrastive Learning
Generate positive samples through data augmentation.
Use a projection network 𝒉𝒉(·)to project features to a space where contrastive learning is applied

- SimCLR:Evaluation
 


- Train feature encoder on ImageNet (entire training set)
using SimCLR. - Freeze feature encoder, train a linear classifier on top with
labeled data. 
3.8.1 SimCLRdesign choices: projection head

- Linear / non-linear projection heads improve representation learning.
A possible explanation:- representation space 𝒛𝒛is trained to be invariant to data transformation
 - contrastive learning objective may discard useful information for downstream tasks
 - by leveraging the projection head 𝒈(ᐧ), more information can be preserved in the 𝒉 representation space
 
 
