Transfer learning for CV、self-supervised learning
1. Transfer learning for CV
1.1 Why?
1.2 Traditional vs. Transfer Learning
由源数据学到一些共用的知识,在进行微调
Traditional machine learning:
- learn a system for a task, respectively
Transfer learning:
- transfer the knowledge form the source model for the target task
Task description
- Source data: $(x^s, y^s)$ A large amount
- Target data: $(x^t, y^t)$ Very little
- One-shot learning: only a few examples in target domain
- Example: (supervised) speaker adaption
- Source data: audio data and transcriptions from many speakers
- Target data: audio data and its transcriptions of specific user
- Idea: training a model by source data, then fine-tune the model by target data
- Challenge: only limited target data, so be careful about overfitting
1.3 Conservative Training
- 学习率调的很低
1.4 Layer Transfer
- Which layer can be transferred (copied)?
- Speech: usually copy the last few layers
- Image: usually copy the first few layers
1.5 Neural Network Layers: General to Specific
- Bottom/first/earlier layers: general learners
- Low-level notions of edges, visual shapes
- Top/last/later layers: specific learners
- High-level features such as eyes, feathers
1.6 Multitask Learning
- The multi-layer structure makes NN suitable for multitask learning
- 任务相关则可以共享部分参数
1.7 Progressive Neural Networks
- 不考虑任务相关性
- 只进行特征共享,但是不共享参数
2. Domain-adversarial training
2.1 Task description: domain adaptation
- How to remove the domain shift?
- How to bridge the domain gap?
- The domain can be a general concept:
- Datasets: transfer from an “easy” dataset to a “hard” one
- Modalities: transfer from RGB to depth, infrared images, point cloud……
- Remove the domain shift
2.2 Discrepancy-based approaches
我们希望两者数据越接近越好,这样在源数据训练可以迁移到目标数据
Idea: minimize the domain distance in a feature space
- Works focus on designing a reasonable distance
- Example: Metric learning based
2.3 Adversarial-based approaches
- 我们希望找到一个特征空间可以使他们的特征领域可以混在一起
2.4 Adversarial-based approaches
- Method 1: Domain-adversarial training
- 不同于GAN,GAN的分类器希望能分开fake数据,而对抗学习希望分类器越分不开越好
- 所以我们对于domain classifier不能使用梯度下降,而应该使用梯度反向
- Method 2: GAN-based methods
2.5 Reconstruction-based approaches
- The data reconstruction of source or target samples is an auxiliary task that simultaneously focuses on creating a shared representation between the two domains and keeping the individual characteristics of each domain.
2.6 Knowledge distillation
- Distill the knowledge from a larger deep neural network into a small network
- Response-based knowledge
- Use the neural response of the last output layer of the teacher model to transfer.
- Directly mimic the final prediction of the teacher model.
- Simple yet effective
大型网络与轻型网络分类越相近,越好
Feature-based knowledge
- Extend the transfer point from the last layer to intermediate layers
- A good extension of response-based knowledge, especially for the training of thinner and deeper networks.
- Generalize feature maps to attention maps
- Relation-based knowledge
- Both response-based and feature-based knowledge use the outputs of specific layers in the teacher model.
- Relation-based knowledge further explores the relationships between different layers or data samples.
- 考虑不同的特征分布
- Extension: Cross-modal distillation
- The data or labels for some modalities might not beavailable during training or testing
3. Self-supervised learning
3.1 Motivation
Recall the idea of transfer learning: start with general-purpose feature representation pre-trained on a large, diverse dataset and adapt it to specialized tasks
Challenge: overcoming reliance on supervised pre-training
3.2 Self-supervised pretext tasks
- Self-supervised learning methods solve “pretext” tasks that producegood features for downstream tasks.
- Learn with supervised learning objectives, e.g., classification, regression.
- Labels of these pretext tasks are generated automatically
- Example: learn to predict image transformations / complete corrupted images
3.2.1 Self-supervised learning workflow (I)
- Learn good feature extractors from self-supervised pretext tasks, e.g., predicting image rotations
3.2.2 Self-supervised learning workflow (II)
- Attach a shallow network on the feature extractor; train the shallow
network on the target task with small amount of labeled data - Evaluate the learned feature encoders on downstream target tasks
3.2.3 Self-supervisedvs. unsupervisedlearning
- The terms are sometimes used interchangeably in the literature, but self-supervised learning is a particular kind ofunsupervised learning
Unsupervised learning: any kind of learning without labels
- Clustering and quantization
- Dimensionality reduction, manifold learning
- Density estimation
…
Self-supervised learning: the learner “makes up” labels from the data and then solves a supervised task
3.3 Self-supervisedvs. Generative learning
Both aim to learn from data without manual label annotation.
Generative learning aims to model data distribution, e.g., generating realistic images.
- 希望能生成和真实越相近越好的图片,更注重细节
Self-supervised learning aims to learn high-level semantic features with pretext tasks
- 只学习高阶语义信息
3.4 Types of self-supervised learning
- 预测遮挡,预测上色,预测未来
- 预测拼图
- 对比学习
3.5 Self-Supervision as data prediction
3.5.1Colorization
要考虑固有颜色的歧义性,只要上色会在自然界出现,就不判错
Colorization: Training data generation
- 数据灰度化
- 用ab作为监督信息
- 对ab空间进行量化,从而预测一个分布,最终考虑到了颜色的歧义性
3.6 Self-supervision by transformation prediction
- Pretext task:randomly sample a patch and one of 8 neighbors,Guess the spatial relationship between the patches
3.6.1 Context prediction: Details
- 切割时留有gap,防止学到这些边缘
3.6.2 Jigsaw puzzle solving
- 不同于预测位置,而是考虑九个块整体的一个顺序
Details
- 防止过拟合,只考虑64种组合,其hamming loss较大
3.6.3 Rotation prediction
- Pretext task: recognize image rotation (0, 90, 180, 270 degrees)
- During training, feed in all four rotated versions of an image in the same mini-batch
3.7 Contrastive methods
- Encourage representations of transformed versions of the same image to be the same and different images to be different
- 希望同种信息越相近越好,不同种数据越不相近越好
- Encourage representations of transformed versions of the same image to be the same and different images to be different
- Given: query point $x$, positive samples $x^{+}$, negative samples $x^{-}$
- Positives are typically transformed versions of $x$, negatives are random examples from the same mini-batch or memory bank
- Key idea: learn representation to make $x$ similar to $x^{+}$, dissimilar from $x^{-}$(similarity is measured by dot product of normalized features)
- Given 1 positive sample and $N$ - 1 negative samples, Contrastive loss:
- This seems familiar as cross entropy loss for a N-way Softmaxclassifier!
Try to find the positive samples from the Nsamples. - $\tau$ is the temperature hyperparameter(determines how concentrated the softmaxis)
- 我们希望温度参数越小越好,这样预测越集中
3.8 SimCLR: A Simple Framework for Contrastive Learning
Generate positive samples through data augmentation.
Use a projection network 𝒉𝒉(·)to project features to a space where contrastive learning is applied
- SimCLR:Evaluation
- Train feature encoder on ImageNet (entire training set)
using SimCLR. - Freeze feature encoder, train a linear classifier on top with
labeled data.
3.8.1 SimCLRdesign choices: projection head
- Linear / non-linear projection heads improve representation learning.
A possible explanation:- representation space 𝒛𝒛is trained to be invariant to data transformation
- contrastive learning objective may discard useful information for downstream tasks
- by leveraging the projection head 𝒈(ᐧ), more information can be preserved in the 𝒉 representation space