Transfer learning for CV

Transfer learning for CV、self-supervised learning

1. Transfer learning for CV

1.1 Why?

1.2 Traditional vs. Transfer Learning

由源数据学到一些共用的知识，在进行微调
Traditional machine learning:
- learn a system for a task, respectively
Transfer learning:
- transfer the knowledge form the source model for the target task
Task description
Source data: $(x^s, y^s)$ A large amount
Target data: $(x^t, y^t)$ Very little
- One-shot learning: only a few examples in target domain
Example: (supervised) speaker adaption
- Source data: audio data and transcriptions from many speakers
- Target data: audio data and its transcriptions of specific user
Idea: training a model by source data, then fine-tune the model by target data
- Challenge: only limited target data, so be careful about overfitting

1.3 Conservative Training

学习率调的很低

1.4 Layer Transfer

Which layer can be transferred (copied)?
- Speech: usually copy the last few layers
- Image: usually copy the first few layers

1.5 Neural Network Layers: General to Specific

Bottom/first/earlier layers: general learners
- Low-level notions of edges, visual shapes
Top/last/later layers: specific learners
- High-level features such as eyes, feathers

1.6 Multitask Learning

The multi-layer structure makes NN suitable for multitask learning
- 任务相关则可以共享部分参数

1.7 Progressive Neural Networks

不考虑任务相关性
只进行特征共享，但是不共享参数

2. Domain-adversarial training

2.1 Task description: domain adaptation

How to remove the domain shift?
How to bridge the domain gap?
The domain can be a general concept:
- Datasets: transfer from an “easy” dataset to a “hard” one
- Modalities: transfer from RGB to depth, infrared images, point cloud……

Remove the domain shift

2.2 Discrepancy-based approaches

我们希望两者数据越接近越好，这样在源数据训练可以迁移到目标数据
Idea: minimize the domain distance in a feature space
Works focus on designing a reasonable distance

Example: Metric learning based

$\begin{aligned} &D_{t s}^{(m)}\left(\mathcal{X}_{t}, \mathcal{X}_{s}\right)= \quad\left\|\frac{1}{N_{l}} \sum_{i=1}^{N_{t}} f^{(m)}\left(\mathbf{x}_{t i}\right)-\frac{1}{N_{s}} \sum_{i=1}^{N_{z}} f^{(m)}\left(\mathbf{x}_{s i}\right)\right\|^{2} \end{aligned}$

2.3 Adversarial-based approaches

我们希望找到一个特征空间可以使他们的特征领域可以混在一起

2.4 Adversarial-based approaches

Method 1: Domain-adversarial training

不同于GAN，GAN的分类器希望能分开fake数据，而对抗学习希望分类器越分不开越好

所以我们对于domain classifier不能使用梯度下降，而应该使用梯度反向

Method 2: GAN-based methods

2.5 Reconstruction-based approaches

The data reconstruction of source or target samples is an auxiliary task that simultaneously focuses on creating a shared representation between the two domains and keeping the individual characteristics of each domain.

2.6 Knowledge distillation

Distill the knowledge from a larger deep neural network into a small network

Response-based knowledge
- Use the neural response of the last output layer of the teacher model to transfer.
- Directly mimic the final prediction of the teacher model.
- Simple yet effective

大型网络与轻型网络分类越相近，越好
Feature-based knowledge
- Extend the transfer point from the last layer to intermediate layers
- A good extension of response-based knowledge, especially for the training of thinner and deeper networks.
- Generalize feature maps to attention maps

Relation-based knowledge
- Both response-based and feature-based knowledge use the outputs of specific layers in the teacher model.
- Relation-based knowledge further explores the relationships between different layers or data samples.
考虑不同的特征分布

Extension: Cross-modal distillation
- The data or labels for some modalities might not beavailable during training or testing

3. Self-supervised learning

3.1 Motivation

Recall the idea of transfer learning: start with general-purpose feature representation pre-trained on a large, diverse dataset and adapt it to specialized tasks
Challenge: overcoming reliance on supervised pre-training

3.2 Self-supervised pretext tasks

Self-supervised learning methods solve “pretext” tasks that producegood features for downstream tasks.
- Learn with supervised learning objectives, e.g., classification, regression.
- Labels of these pretext tasks are generated automatically
Example: learn to predict image transformations / complete corrupted images

3.2.1 Self-supervised learning workflow (I)

Learn good feature extractors from self-supervised pretext tasks, e.g., predicting image rotations

3.2.2 Self-supervised learning workflow (II)

Attach a shallow network on the feature extractor; train the shallow
network on the target task with small amount of labeled data
Evaluate the learned feature encoders on downstream target tasks

3.2.3 Self-supervisedvs. unsupervisedlearning

The terms are sometimes used interchangeably in the literature, but self-supervised learning is a particular kind ofunsupervised learning
Unsupervised learning: any kind of learning without labels
- Clustering and quantization
- Dimensionality reduction, manifold learning
- Density estimation
  …
Self-supervised learning: the learner “makes up” labels from the data and then solves a supervised task

3.3 Self-supervisedvs. Generative learning

Both aim to learn from data without manual label annotation.
Generative learning aims to model data distribution, e.g., generating realistic images.
- 希望能生成和真实越相近越好的图片，更注重细节
Self-supervised learning aims to learn high-level semantic features with pretext tasks
- 只学习高阶语义信息

3.4 Types of self-supervised learning

预测遮挡，预测上色，预测未来

预测拼图

对比学习

3.5 Self-Supervision as data prediction

3.5.1Colorization

要考虑固有颜色的歧义性，只要上色会在自然界出现，就不判错
Colorization: Training data generation
- 数据灰度化

用ab作为监督信息
对ab空间进行量化，从而预测一个分布，最终考虑到了颜色的歧义性

3.6 Self-supervision by transformation prediction

Pretext task:randomly sample a patch and one of 8 neighbors，Guess the spatial relationship between the patches

3.6.1 Context prediction: Details

切割时留有gap，防止学到这些边缘

3.6.2 Jigsaw puzzle solving

不同于预测位置，而是考虑九个块整体的一个顺序

Details

防止过拟合，只考虑64种组合，其hamming loss较大

3.6.3 Rotation prediction

Pretext task: recognize image rotation (0, 90, 180, 270 degrees)

During training, feed in all four rotated versions of an image in the same mini-batch

3.7 Contrastive methods

Encourage representations of transformed versions of the same image to be the same and different images to be different
- 希望同种信息越相近越好，不同种数据越不相近越好

Encourage representations of transformed versions of the same image to be the same and different images to be different

Given: query point $x$, positive samples $x^{+}$, negative samples $x^{-}$
- Positives are typically transformed versions of $x$, negatives are random examples from the same mini-batch or memory bank
Key idea: learn representation to make $x$ similar to $x^{+}$, dissimilar from $x^{-}$(similarity is measured by dot product of normalized features)
Given 1 positive sample and $N$ - 1 negative samples, Contrastive loss:

$l\left(x, x^{+}\right)=-\log \frac{\exp \left(f(x)^{T} f\left(x^{+}\right) / \tau\right)}{\frac{\exp \left(f(x)^{T} f\left(x^{+}\right) / \tau\right)}{\text { Score for the positive }}+\frac{\sum_{j=1}^{N} \exp \left(f(x)^{T} f\left(x_{j}^{-}\right) / \tau\right)}{\text { pair }}}$

This seems familiar as cross entropy loss for a N-way Softmaxclassifier!
Try to find the positive samples from the Nsamples.
$\tau$ is the temperature hyperparameter(determines how concentrated the softmaxis)
我们希望温度参数越小越好，这样预测越集中

3.8 SimCLR: A Simple Framework for Contrastive Learning

Generate positive samples through data augmentation.
Use a projection network 𝒉𝒉(·)to project features to a space where contrastive learning is applied

SimCLR：Evaluation

Train feature encoder on ImageNet (entire training set)
using SimCLR.
Freeze feature encoder, train a linear classifier on top with
labeled data.

3.8.1 SimCLRdesign choices: projection head

Linear / non-linear projection heads improve representation learning.
A possible explanation:
- representation space 𝒛𝒛is trained to be invariant to data transformation
- contrastive learning objective may discard useful information for downstream tasks
- by leveraging the projection head 𝒈(ᐧ), more information can be preserved in the 𝒉 representation space