semi-supervised learning

1. What is semi-supervised learning?

Humans learn in semi-supervised way

1.1 Why semi-supervised learning helps?

The distribution of the unlabeled data tell us something.

1.2 Low-density Separation Assumption

希望分开的类内差距尽量大
Given: labelled data set $=\left\{\left(x^{r}, \hat{y}^{r}\right)\right\}_{r=1}^{R}$, unlabeled data set $=\left\{x^{u}\right\}_{u=1}^{U}$
Repeat:

Hard label vs Soft label
- Considering using neural network $𝜃^∗$(network parameter) from labelled data

软标记对训练没有影响，所以应该使用硬标记

1.3 Entropy-based Regularization

我们希望我们的分类是非黑即白的
可以通过信息熵来判断是否分类成功，对于无标记数据我们希望其分类越集中越好

1.4 Smoothness Assumption

Assumption: “similar” $x$ has the same $\hat{y}$
- 数据相似带来标签相似
More precisely:
- $\mathrm{x}$ is not uniform.
- If $x^{1}$ and $x^{2}$ are close in a high density region, $\hat{y}^{1}$ and $\hat{y}^{2}$ are the same.

Connected by a high density path

我们可以在训练数据库中插入多个2，使得可以左边的2可以通向右边的2
Classify astronomyvs. travelarticles

可以找到一条连通区域，从而进行分类

1.5 Graph-based Approach

$\text { How to know } x^{1} \text { and } x^{2} \text { are connected by a high density path? }$

Define the similarity $s\left(x^{i}, x^{j}\right)$ between $x^{i}$ and $x^{j}$
Add edge:
- K Nearest Neighbor
- e-Neighborhood

Edge weight is proportional to $s\left(x^{i}, x^{j}\right)$ Gaussian Radial Basis Function:

$s\left(x^{i}, x^{j}\right)=\exp \left(-\gamma\left\|x^{i}-x^{j}\right\|^{2}\right)$

The labelled data influence their neighbors. Propagate through the graph
- 图上的标记会随着路径传播

不一定有效

Define the smoothness of the labelson the graph
- w是特征空间的相似度，S越小越平滑

Define the smoothness of the labels on the graph

$S=\frac{1}{2} \sum_{i, j} w_{i, j}\left(y^{i}-y^{j}\right)^{2}=y^{T} L y$

$y:(R+U)-\operatorname{dim}$ vector
- 在标记传播过程中，会先初始化标签，R表示有标签，U表示原来无标签

$y=\left[\cdots y^{i} \cdots y^{j} \cdots\right]^{T}$

$L:(R+U) \times(R+U)$ matrix
D是行和放于对角线

不同层都可以加如smooth，即传播后会返回每一层的输出以便于计算loss

2. Unsupervised Neural Network

2.1 Recall: Unsupervised learning

Data: x
Just data, no labels!
Goal: Learn some underlying
hidden structure of the data
Examples: Clustering, dimensionality reduction, density estimation, etc.

K-means clustering

2.2 Auto-encoder

希望编码器可以自动凝练特征，编码后又可以恢复

Output of the hidden layer is the code

2.3 Deep Auto-encoder

深度网络更具表征更具判别性

De-noising auto-encoder
- 希望有噪音的图像能够恢复为无噪图像，即希望自编码器能够自主去噪

2.4 Auto-encoder –Text Retrieval

The documents talking about the same thing will have close code.

2.5 Auto-encoder –Similar Image Search

2.6 Auto-encoder for CNN

2.7 CNN -Unpooling

2.8 CNN -Deconvolution

Greedy Layer-wise Pre-training
- 逐层进行训练，训练完后的参数freeze

最后进行微调

2.9 Why VAE (Variational Auto-Encoders)?

对编码进行插值能否采样？
- 不会

但我们希望编码能够一定线性插值得到新的图片
- 将确定性的向量变为一个分布，即对编码进行加噪

e为来自高斯分布的采样权值，$\sigma$为标准差

2.10 Pokémon Creation

垂直方向控制大小，水平方向控制方向

2.11 Problems of VAE

It does not really try to simulate real images

有其重构函数是像素级别的，所以不一定完全相近

3. Generative Adversarial Network (GAN)

3.1 Basic Idea of GAN

$\text { The data we want to generate has a distribution } P_{\text {data }}(x)$

A generator G is a network. The network defines a probability distribution.
- 不考虑原始数据的分布

3.2 Generative adversarial networks

Train two networks with opposing objectives:
- Generator:learns to generate samples
- Discriminator:learns to distinguish between generated and real samples
两者互相博弈，最后越来越好

3.3 Evolution

Generator
- 每一维度都觉了图像某一特征

Discriminator

3.4 The evolution of generation

固定一个更新另一个，从而迭代更新

The discriminator $D(x)$ should output the probability that the sample $x$ is real
That is, we want $D(x)$ to be close to 1 for real data and close to 0 for fake
Expected conditional log likelihood for real and generated data:
- 对于判别器，我们希望区分出真假
- 而生成器则相反，希望他越小越好 $V(G, D)=\mathbb{E}_{x \sim p_{\text {data }}} \log D(x)+\mathbb{E}_{z \sim p} \log (1-D(G(z)))$
We seed the generator with noise $z$ drawn from a simple distribution $p$
(Gaussian or uniform)

3.5 GAN objective

$V(G, D)=\mathbb{E}_{x \sim p_{\text {data }}} \log D(x)+\mathbb{E}_{z \sim p} \log (1-D(G(z)))$

The discriminator wants to correctly distinguish real and fake samples: $D^{*}=\arg \max _{D} V(G, D)$
The generator wants to fool the discriminator: $G^{*}=\arg \min _{G} V(G, D)$
Train the generator and discriminator jointly in a minimax game
Update discriminator:
Repeat for $k$ steps:
Sample mini-batch of noise samples $z_{1}, \ldots, z_{m}$ and mini-batch of real samples $x_{1}, \ldots, x_{m}$

3.6 Training algorithm in practice

Update parameters of $D$ by stochastic gradient ascent on
- Repeat for $k$ steps:
  - Sample mini-batch of noise samples $z_{1}, \ldots, z_{m}$ and mini-batch of real samples $x_{1}, \ldots, x_{m}$
  - Update parameters of $D$ by stochastic gradient ascent on $\frac{1}{m} \sum_{m}\left[\log D\left(x_{m}\right)+\log \left(1-D\left(G\left(z_{m}\right)\right)\right)\right]$
Update generator:
- Sample mini-batch of noise samples $z_{1}, \ldots, z_{m}$
- Update parameters of $G$ by stochastic gradient ascent on $\frac{1}{m} \sum_{m} \log D\left(G\left(z_{m}\right)\right)$
Repeat until happy with results
Update discriminator: push $D\left(x_{\text {data }}\right)$ close to 1 and $D(G(z))$ close to 0
- The generator is a “black box” to the discriminator
- The generator is exposed to real data only via the output of the discriminator (and its gradients)

Test time –the discriminator is discarded

3.7 Original GAN results

原始GAN比较模糊，因为这样能够难以分类

3.8 Problems with GAN training

Stability
- Parameters can oscillate or diverge, generator loss does not correlate with sample quality
- Behavior very sensitive to hyperparameter selection
只能模仿几个模式而无法生成实际的多模态
Mode collapse
- Generator ends up modeling only a small subset of the training data

3.9 DCGAN

Early, influential convolutional architecture for generator
- 使用卷积，且不用池化，即用stride代替

Early, influential convolutional architecture for generator
Discriminator architecture (empirically determined to give best training stability):
- Don’t use pooling, only strided convolutions
- Use Leaky ReLU activations (sparse gradients cause problems for training)
- Use only one FC layer before the softmax output
- Use batch normalization after most layers (in the generator also)
  - 降低对超参敏感程度

3.10 DCGAN results

Interpolation between different points in the z space
- 即是连续的

Vector arithmetic in the z space

Pose transformation by adding a “turn” vector

4. Conditional generation

To condition the generation of samples on discrete side information (label) 𝑦, we need to add 𝑦 as an input to both generator and discriminator
- 加入类的标签，加入限制

4.1 BigGAN

Class-conditional generation of ImageNet images up to

对Z空间进行截断，防止由于分布带来的模糊，因为只取了一部分作为编码空间，从而提高分辨率
但也有可能降低保真度，所以需要tradeoff

5. Image-to-image translation

Produce modified image $y$ conditioned on input image $x$
(note change of notation)
- Generator receives $x$ as input
- Discriminator receives an $x, y$ pair and has to decide whether it is real or fake

作为一个对照来进行判别，以增加条件
- 即希望鞋的形状一致

5.1 Translating between maps and aerial photos

Day to night

Edges to photos

5.2 Unpaired image-to-image translation

有时候我们并不能得到成对的样本
Given two unordered image collections 𝑋 and 𝑌, learn to “translate” an image from one into the other and vice versa

5.3 CycleGAN

Given: domains $X$ and $Y$
- 就是我们希望X可以变为Y，Y经过反变换后还可以生成Y
- 就可以限制Y的形状类似X
Train two generators $F$ and $G$ and two discriminators $D_{X}$ and $D_{Y}$
- $G$ translates from $X$ to $Y, F$ translates from $Y$ to $X$
- $D_{X}$ recognizes images from $X, D_{Y}$ from $Y$
- Cycle consistency: we want $F(G(x)) \approx x$ and $G(F(y)) \approx y$

Illustration of cycle consistency:

Translation between maps and aerial photos

Tasks for which paired data is unavailable

5.4 CycleGAN: Limitations

Cannot handle shape changes (e.g., dog to cat)
Can get confused on images outside of the training domains (e.g., horse with rider)
- 不能对训练数据以外的做拟合
Cannot close the gap with paired translation methods