Word Vector

1. Logistic Regression for x

Input:
- a document
- a fixed set of classes
Output: a predicted class

Input observation: vector

Weights: one per feature:

Sometimes we call the weights

Output: a predicted class

(multinomial logistic regression:

2. Making with sigmoids

$\hat{y}= \begin{cases}1 & \text { if } P(y=1 \mid x)>0.5 & \text { if } \mathrm{w} \cdot \mathrm{x}+\mathrm{b}>0 \\ 0 & \text { otherwise } & \text { if } \mathrm{w} \cdot \mathrm{x}+\mathrm{b} \leq 0\end{cases}$ $\begin{aligned} P(y=1) &=\sigma(w \cdot x+b) \\ &=\frac{1}{1+\exp (-(w \cdot x+b))} \\ P(y=0) &=1-\sigma(w \cdot x+b) \\ &=1-\frac{1}{1+\exp (-(w \cdot x+b))} \\ &=\frac{\exp (-(w \cdot x+b))}{1+\exp (-(w \cdot x+b))} \end{aligned}$ $L(\beta)=\prod^n_{i=1}y_iP(Y=y_i|X=x_i;\beta)\\$ $\begin{array}{l} logL(\beta)=\sum^n_{i=1}y_ilogP(Y=1|X=x_i;\beta)+(1-y_i)logP(Y=0|X=x_i;\beta)\\ \ \ \ \ \ \ \ \ \ \ \ \ \ =\sum^n_{i=1} y_ilog(\frac{1}{1+e^{-\beta^Tx}})+(1-y_i)log(\frac{e^{-\beta^Tx}}{1+e^{-\beta^Tx}})\\ \ \ \ \ \ \ \ \ \ \ \ \ \ =\sum^n_{i=1} y_i(log(\frac{1}{1+e^{-\beta^Tx}})-log(\frac{e^{-\beta^Tx}}{1+e^{-\beta^Tx}}))+log(\frac{e^{-\beta^Tx}}{1+e^{-\beta^Tx}})\\ \ \ \ \ \ \ \ \ \ \ \ \ \ =\sum^n_{i=1} y_i\beta^Tx-log(1+e^{\beta^Tx}) \end{array}$ $\begin{array}{ll} \bigtriangledown_\beta L(\beta)=\sum^n_{i=1}(y_i x - \frac{x_ie^{\beta^Tx_i}}{1+e^{\beta^Tx_i}})\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =\sum^n_{i=1}x_i(y_i - \sigma(\beta^Tx_i)) \end{array}$

For SGD:

$\begin{array}{ll}\\ \beta^{(k+1)}=\beta^{(k)}-\alpha \bigtriangledown L(\beta)\\ \ \ \ \ \ \ \ \ \ \ =\beta^{(k)}-\sum^{min\_batch}_{i=1}x_i(y_i - \sigma(\beta^Tx_i)) \end{array}$

Training: we learn weights w and b using stochastic gradient descent and cross-entropy loss.

3. How to present the word

把所有积极的词搜集起来，尝试用频率对单词进行表达
Suppose

$\begin{aligned} p(+|x)=P(Y=1|x) & = s (\begin{array}{l} W {x+b}) \end{array} \\ &=s([2.5,-5.0,-1.2,0.5,2.0,0.7]\dotproduct[3,2,1,3,0,4.19]+0.1) \\ &=s(0.833) \\ &=0.70 \\ p(-|x)=P(Y=0 | x) &=1-s(W x+b) \\ =& 0.30 \end{aligned}$

Classification problem :
- x: watermelons
- y : which class belongs to

用拉平的向量来表达这个图像

用已有的文本去训练出词向量

3.1 One-Hot

用单词的在集合的位置来表示单词

Problem1

同义词的表达，虽然使用上完全相同，但还是有细微差别

big和large虽然很相似，但是用法还是有细微的差别，不能完全等同

字符级别的相似性和词的相似性是有差异的，对于独热形式，使用内积，两个向量的相似性都是0，则无法用内积衡量两个词的相似性

以前，可以用一个树状结构来构建词向量，每个词都有其对应的上位词，通过比较这两个节点找到共同的父亲所经历的路径，来进行判别
但是构建一棵合理的树，需要大量专家知识，并且树的结构很好优化；另一方面由于语言变化迅速，这棵树必须经常更新

Problem2

造成大量的空间浪费

3.2 Distributional Representation

Def:

低维的，稠密的词向量表达
- Turney & Pantel (2010)
  “If units of text have similar vectors in a text frequency matrix, then they tend to have similar meanings.”
When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window):
Use the many contexts of w to build up a representation of w
- 用context表示中心词
  
  Example：
  - 随着国外新冠患者数量的猛增,让原本松了一口气的我们，再次紧张起来。
  - 治疗一列新冠轻症患者的费用在1万元上下。
  - 肆虐的新冠病毒究竟长什么样?
  - 基因测序等研究结果显示，新冠病毒与SARS冠状病毒同属冠状病毒科的β属冠状病毒

Word Vectors

We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.
Note: word vectors are sometimes called word embeddings. They are a distributed representation.

4. Similarity

$\begin{aligned} &\text { euclidean }(u, v)=\sqrt{\sum_{i=1}^{n}\left|u_{i}-v_{i}\right|^{2}} \\ &\operatorname{cosine}(u, v)=1-\frac{\sum_{i=1}^{n} u_{i} \times v_{i}}{\|u\|_{2} \times\|v\|_{2}} \end{aligned}$

一般先归一化再使度量相似度

使用cosine的原因：

欧氏距离能够体现个体数值特征的绝对差异，所以更多的用于需要从维度的数值大小中体现差异的分析，如使用用户行为指标分析用户价值的相似度或差异。
余弦距离更多的是从方向上区分差异，而对绝对的数值不敏感，更多的用于使用用户对内容评分来区分兴趣的相似度和差异，同时修正了用户间可能存在的度量标准不统一的问题（因为余弦距离对绝对数值不敏感）。

【下面举一个例子，来说明余弦计算文本相似度】

Example：

句子A：这只皮靴号码大了。那只号码合适

句子B：这只皮靴号码不小，那只更合适

怎样计算上面两句话的相似程度？
基本思路是：如果这两句话的用词越相似，它们的内容就应该越相似。因此，可以从词频入手，计算它们的相似程度。
第一步，分词。

句子A：这只/皮靴/号码/大了。那只/号码/合适。

句子B：这只/皮靴/号码/不/小，那只/更/合适。

第二步，列出所有的词。

这只，皮靴，号码，大了。那只，合适，不，小，很

第三步，计算词频。

句子A：这只1，皮靴1，号码2，大了1。那只1，合适1，不0，小0，更0

句子B：这只1，皮靴1，号码1，大了0。那只1，合适1，不1，小1，更1

第四步，写出词频向量。

　　句子A：(1，1，2，1，1，1，0，0，0)

　　句子B：(1，1，1，0，1，1，1，1，1)

到这里，问题就变成了如何计算这两个向量的相似程度。我们可以把它们想象成空间中的两条线段，都是从原点（[0, 0, …]）出发，指向不同的方向。两条线段之间形成一个夹角，如果夹角为0度，意味着方向相同、线段重合,这是表示两个向量代表的文本完全相等；如果夹角为90度，意味着形成直角，方向完全不相似；如果夹角为180度，意味着方向正好相反。因此，我们可以通过夹角的大小，来判断向量的相似程度。夹角越小，就代表越相似。
计算结果中夹角的余弦值为0.81非常接近于1，所以，上面的句子A和句子B是基本相似的

由此，我们就得到了文本相似度计算的处理流程是:

Other metrics

Matching coefficient

$\operatorname{matching}(u, v)=\sum_{i=1}^{n} \min \left(u_{i}, v_{i}\right)$

Jaccard distance

$\operatorname{jaccard}(u, v)=1-\frac{\operatorname{matchin} \mathbf{g}(u, v)}{\sum_{i=1}^{n} \max \left(u_{i}, v_{i}\right)}$

Dice distance

$\text { dice }(u, v)=1-\frac{2 \times \operatorname{matching}(u, v)}{\sum_{i=1}^{n} u_{i}+v_{i}}$

Overlap

$\operatorname{overlap}(u, v)=1-\frac{\text { matching }(u, v)}{\min \left(\sum_{i=1}^{n} u_{i}, \sum_{i=1}^{n} v_{i}\right)}$

证明上诉距离是否满足这个性质

5. Word2Vec Model

ldea:

We have a large corpus of text
Every word in a fixed vocabulary is represented by a vector.
Go through each position t in the text, which has a center word c and context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability
- Mikolov, T., Sutskever, l., Chen, K., Corrado, G.S.and Dean,J., 2013. Distributed representations of words and phrasesand their compositionality.In Advances in neural information processing systems (pp.3111-3119).

Example: windows and process for computing the

5.1 Two model variants:

Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words

6. SG

6.1 Example

The quick brown fox jumps over the lazy dog

For each position , predict context words within awindow of fixed size , given center word ;.
第一个连乘要扫描整个文档，第二个连乘是滑动窗口的长度
is all variablesto be optimized

$\text { Likelihood }=L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m \atop j \neq 0} P\left(w_{t+j} \mid w_{t} ; \theta\right)$

The objective function is the (average) negative log likelihood:
- sometimes called cost or loss function

$J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m \atop j \neq 0} \log P\left(w_{t+j} \mid w_{t} ; \theta\right)$

Minimizing objective function Maximizing predictive accuracy
归一化，消除词表长度的影响
Note: 对于一个词，它既可能是中间词，也可能是背景词

Question: How to calculate ?

Answer: We will use two vectors per word w:
- when w is a center word
- when w is a context word
Then for a center word c and a context word o:

$P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}$

(1) Dot product compares similarity of and .

$u^{T} v=u . v=\sum_{i=1}^{n} u_{i} v_{i}$

Larger dot product = larger probability

(2) Exponentiation makes anything positive

(3) Normalize over entire vocabulary to give probability distribution

This is an example of the softmax function $\operatorname{softmax}\left(x_{i}\right)=\frac{\exp \left(x_{i}\right)}{\sum_{j=1}^{n} \exp \left(x_{j}\right)}=p_{i}$
The softmax function maps arbitrary values to a probability distribution
- “max” because amplifies probability of largest
- “soft” because still assigns some probability to smaller
- Frequently used in Deep Learning

6.3 Word2Vec: skip-gram model

fake task

借助假的分类任务，从而学到与其他任务相似性的参数

输入一个读热向量，得到10000个神经元（即一个10000维向量），每个神经元对应一个词的概率
即相当于在做一个给定一个词，预测另一个词的任务
而这其中学习的隐藏层的参数其实就是词向量，或者说这个学习到的参数与词的概率分布具有同源分布

Process

首先从隐藏层矩阵挑出相应的中心词向量c

然后用中心词向量（300dim）乘以刚才的隐藏层矩阵，就可以得到（10000dim）对应每个词的概率分布

接着，我们对于给定窗口长度，分别在其对应的背景词位置使用softmax，从而得到这些背景词的条件概率分布，再使用交叉熵

CBOW AND SG

6.4 Problem

由于分类任务的巨大性（每次要对所有词向量做内积），计算量巨大
这将会导致
- Running gradient descent on a neural network that large is going to be slow.
- Need a huge amount of training data in order to tune that many weights and avoid over fitting.

7. Solusion

Subsampling frequent words to decrease the number of training examples.
Modifying the optimization objective with a technique they called “Negative Sampling” , which causes each training sample to update only a small percentage of the model’s weights.
Word pairs and “phases”

7.1 Subsampling

Subsampling frequent words to decrease the number of training examples.

对于频率较高的词，不用采样

自动调整采样的公式，经常出现的单词采样概率小，不常出现的概率大

7.2 Negative Sampling

Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately.
Negative sampling addresses this issue by having each training sample only modify a small percentage of the weights , rather than all of them.
- 通过缩小最后的output，来使得计算量下降，并使得参数每次变化不会特别大

The “negative samples” are chosen using a “unigram distribution”. Essentially , the probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.

3. Word pairs and “phases”.

The authors pointed out that a word pair like “Boston Globe”(a newspaper) has a much different meaning than the individual words “Boston” and “Globe”.So it makes sense totreat “Boston Globe”, wherever it occurs in the text, as asingle word with its own word vector representation.

8. How to evaluate the generated embeddings?

Small windows (C=+| 2) : nearest words are syntacticall similar words in same taxonomy 最近的单词是语法相似在相同分类法中的单词
Large windows (c= +|- 5): nearest words are related words in samle semantic field 最近词是语义场中的相关词

距离平移性，通过平移操作的计算，观测生成的距离的相似性
- doctor-man+woman=nurse

可视化，svd，tsvg，压缩到二维图像去观测

并没有考虑反义词的现象，这有可能使得相似性不能很好地度量语义的相似

无法很好度量一些层次关系

9. Other interesting things:

用词向量看文化
不同语言学的语义有差异，可能会带来翻译不一致的问题
混合embedding，使得语义更加丰富