IR - Smurf

1. HMM Exercise

$H M M : \hat{g}=\underset{y \in Y}{\operatorname{argmax}} P(x, y)$

Transition

$\left.\begin{array}{r}N\rightarrow V \\ \downarrow \\ c\end{array}\right\} 9\quad \left.\begin{array}{r}P\rightarrow V \\ \downarrow \\ a\end{array}\right\} 9$ $\left.\begin{array}{r}N-D \\ \downarrow \\ a\end{array}\right\} 1$ $P(V \mid N)=\frac{9}{9+1}=\frac{9}{10}$ $P(D \mid N)=0.1$

Emission

$P(a\mid v)=0.5 \quad P(a\mid D)=1$

for :

$\begin{array}{r}N\rightarrow ? \\ \downarrow \\ a\end{array}$ $\begin{aligned} ?=& \underset{i \in\{V, D\}}{\operatorname{argmax}} \alpha P(i \mid N) P(a \mid i) \\ =& \operatorname{argmax}\{(0.9 \times 0.5) \alpha, (0.1 \times 1)\alpha\} \\ =& V \end{aligned}$

分析原因：因为HMM做了局部归一化，导致HMM更容易转移到具有比较少转移状态的状态。
E:/third
我们可以看到hmm的状态不容易进行跳转

2. CRF vs. HMM

$\begin{array}{l} H M M: P(\vec{y}, \vec{x})=P\left(y_{1} \mid \operatorname{start} \right)\prod_{i=1}^{n-1} P\left(y_{i-1} \mid y_{i}\right) P\left(\operatorname{end} \mid y_{n}\right) \prod_{i=1}^{n} P\left(x_{i} \mid y_{i}\right) \\ \log P(x, y)=\log P(y_1 \mid \text { start })+\sum_{i=1}^{n-1} \log P\left(y_{i-1} \mid y_{i}\right)\\ +\log P\left(\text { end } \mid y_{n}\right)+\sum_{i=1}^{n} \log P\left(x_{i} \mid y_{i}\right) \\ \end{array}$

we can easily get the equation as flowing:

$\sum_{i=1}^{n} \log P\left(x_{i} \mid y_{i}\right)=\sum_{s, t} \log P(t \mid s) \cdot N_{s t}(x, y), \quad s.t:\text { s: tags},\quad t=\text { words }$

why?
Example:
: the dog ate the homework

$\begin{array}{l} N_{D, \text{the}}(x, y)=2 \\ N_{N, \text{dog}}(x, y)=1 \\ N_{V, \text{ate}}(x, y)=1 \\ N_{N, {\text {homework }}(x, y)=1} \\ N_{s, t}(x, y)=0, \quad\text {(s,t) is other} \text { combination } \end{array}$ $\begin{array}{l} \underset{i=1}{\overset{n}\sum}\log P(x_i\mid y_i)&=\quad\log P(\text{the}\mid\text{D})+\log P(\text{dog}\mid\text{N})\\&+\log P(\text{ate}\mid\text{V})+\log P(\text{the}\mid\text{D})+\log P(\text{homework}\mid\text{N})\\ &=\quad \log P(\text{the}\mid\text{D})\times 2+\log P(\text{dog}\mid\text{N})\times 1+\log P(\text{ate}\mid\text{V})\times 1\\&+\log P(\text{homework}\mid\text{N})\times 1\\ &=\quad \underset{s,t}{\sum}\log P(t\mid s)\cdot N(x,y) \end{array}$

同理，我们可以对式子的第一、二、三项做变化：

$\log P(y_1\mid start)=\underset{s}{\sum}\log P(t\mid s)\cdot N_{\text{start,s}}(x,y)$ $\sum_{i=1}^{n-1} \log P\left(y_{i-1} \mid y_{i}\right)=\underset{s,s'}{\sum}\log P(s'\mid s)\times N_{s,s'}(x,y)$ $\log P\left(\text { end } \mid y_{n}\right)=\underset{s}{\sum}log P(end\mid s)\times N_{s,end}(x,y)$

所以，我们可以对原式做变换：

$\begin{array}{l} \log P(x, y)&=&\underset{s}{\sum}\log P(t\mid s)\cdot N_{\text{start,s}}(x,y)+\underset{s,s'}{\sum}\log P(s'\mid s)\times N_{s,s'}(x,y)\\&+&\underset{s}{\sum}log P(end\mid s)\times N_{s,end}(x,y)+\underset{s,t}{\sum}\log P(t\mid s)\times N(x,y)\\ &=& \left[\begin{array}{c} \log p(t \mid s) \\ \vdots \\ \log p(s \mid \text { start }) \\ \vdots \\ \log P\left(s^{\prime} \mid s\right) \\ \vdots \\ \log P(\text { end } \mid s)\\ \vdots \end{array}\right]^{t} \cdot\left[\begin{array}{c} N_{s, t}(x, y) \\ \vdots \\ N_{s t a r t,s}{(x, y)} \\ \vdots \\ N_{s, s^{\prime}}(x, y) \\ \vdots \\ N_{s, \text { end }}(x, y) \\ \vdots \end{array}\right]\\ &=& w^t \psi(x,y) \end{array}$

对于CRF而言是可以学习的参数
长度：tagswords(sL), tags tages(s\ +2s(start and end))
训练时，最大化后验概率

$w^*=\underset{w}{\operatorname{argmax}}P(\vec{y}\mid\vec{x})=\underset{w}{\operatorname{argmax}}\frac{P(\vec{x},\vec{y})}{\underset{y'}{\sum}P(x,y')}$

取log后：

$w^*=\underset{w}{\operatorname{argmax}}\log P(x^n\mid \hat{y}^n)-\sum_{y'}P(y'\mid x^n)P(x^n,y')$

根据梯度上升：

$\theta\rightarrow\theta+\eta\nabla(\theta)$ $\frac{\partial \theta(w)}{\partial w_{s,t}}=N_{s,t}(x^n,y^n)-\sum_{y'}P(y'\mid x^n)N_{s,t}(x^n,y^n)$

3. CRF

E:/third

假设

$P(\boldsymbol{y}, \boldsymbol{x})=f(\boldsymbol{y}, \boldsymbol{x})=h\left(y_{1} , \boldsymbol{x}\right) +g\left(y_{1}, y_{2}\right)+h\left(y_{2} , \boldsymbol{x}\right)+\ldots +h\left(y_{n} , \boldsymbol{x}\right)+g\left(y_{n-1}, y_{n}\right)$

其实就是表示成边的条件概率以及状态条件概率
则可以计算后验概率：

$P(\boldsymbol{y}|\boldsymbol{x})=\frac{1}{Z(x)}exp\left(\underset{i,k}{\sum}\lambda_k t_k(y_{i-1}, y_i,x,i)+\underset{i,l}{\sum}u_l t_l(y_{i-1}, y_i,x,i)\right)$ $Z(\boldsymbol{x})=\sum_\boldsymbol{y'}exp\left(\underset{i,k}{\sum}\lambda_k t_k(y'_{i-1}, y'_i,x,i)+\underset{i,l}{\sum}u_l t_l(y'_{i-1}, y'_i,x,i)\right)$

假设有个转移特征，个状态特征，，记：

$f_{k}\left(y_{i-1}, y_{i}, x, i\right)=\left\{\begin{array}{ll} t_{k}\left(y_{i-1}, y_{i}, x, i\right), & k=1,2, \cdots, K_{1} \\ s_{i}\left(y_{i}, x, i\right), & k=K_{1}+l ; l=1,2, \cdots, K_{2} \end{array}\right.$

然后对转移与状态特征在各个位置求和：

$f_{k}(y, x)=\sum_{i=1}^{n} f_{k}\left(y_{i-1}, y_{i}, x, i\right), \quad k=1,2, \cdots, K$

用表示特征的权值：

$w_{k}=\left\{\begin{array}{ll} \lambda_{k}, & k=1,2, \cdots, K_{1} \\ \mu_{l}, & k=K_{1}+l ; l=1,2, \cdots, K_{2} \end{array}\right.$

于是条件随机场可以用以下式子表示：

$\begin{aligned} P(y \mid x) &=\frac{1}{Z(x)} \exp \sum_{k=1}^{K} w_{k} f_{k}(y, x) \\ Z(x) &=\sum_{y‘} \exp \sum_{k=1}^{K} w_{k} f_{k}(y’, x) \end{aligned}$

为权值向量：

$w=\left(w_{1}, w_{2}, \ldots, w_{K}\right)^{\mathrm{T}}$

以表示全局特征向量即：

$F(y, x)=\left(f_{1}(y, x), f_{2}(y, x), \cdots, f_{K}(y, x)\right)^{\mathrm{T}}$

则条件随机场可以写成与的内积的形式:

$P_{w}(y \mid x)=\frac{\exp (w \cdot F(y, x))}{Z_{w}(x)}$

其中

$Z_{w}(x)=\sum_{y'} \exp (w \cdot F(y', x))$

3.1 Inference:

$\mathbf{y}_{\text {best }}=\operatorname{argmax}_{\mathbf{y}^{\prime}} \exp \left(\sum_{k=1}^{n} w^{\top} f_{k}\left(\mathbf{x}, \mathbf{y}^{\prime}\right)\right)$

lf y consists of 5 variables with 30 values each, how expensive are these?Need to
constrain the form of our CRFs to make it tractable
E:/third
$P(\mathbf{y} \mid \mathbf{x}) \propto \prod_{k} \exp \left(\phi_{k}(\mathbf{x}, \mathbf{y})\right)$

$P(\mathbf{y} \mid \mathbf{x}) \propto \exp \left(\phi_{o}\left(y_{1}\right)\right) \prod_{i=2}^{n} \exp \left(\phi_{t}\left(y_{i-1}, y_{i}\right)\right) \prod_{i=1}^{n} \exp \left(\phi_{e}\left(x_{i}, y_{i}\right)\right)$

E:/third

E:/third

因为最好不要随意依赖，所以我们要给它加位置信息
E:/third
Notation: omit from the factor graph entirely (implicit)
Don’t include initial distribution, can bake into other factors
Sequential CRFs:

$\begin{gathered} P(\mathbf{y} \mid \mathbf{x})=\frac{1}{Z} \prod_{i=2}^{n} \exp \left(\phi_{t}\left(y_{i-1}, y_{i}\right)\right) \prod_{i=1}^{n} \exp \left(\phi_{e}\left(y_{i}, i, \mathbf{x}\right)\right) \\ P(\mathbf{y} \mid \mathbf{x}) \propto \exp w^{\top}\left[\sum_{i=2}^{n} f_{t}\left(y_{i-1}, y_{i}\right)+\sum_{i=1}^{n} f_{e}\left(y_{i}, i, \mathbf{x}\right)\right] \end{gathered}$

3.2 Inference

$\begin{aligned} & \max _{y_{1}, \ldots, y_{n}} e^{\phi_{t}\left(y_{n-1}, y_{n}\right)} e^{\phi_{e}\left(y_{n}, n, \mathbf{x}\right)} \cdots e^{\phi_{e}\left(y_{2}, 2, \mathbf{x}\right)} e^{\phi_{t}\left(y_{1}, y_{2}\right)} e^{\phi_{e}\left(y_{1}, 1, \mathbf{x}\right)} \\ =& \max _{y_{2}, \ldots, y_{n}} e^{\phi_{t}\left(y_{n-1}, y_{n}\right)} e^{\phi_{e}\left(y_{n}, n, \mathbf{x}\right)} \cdots e^{\phi_{e}\left(y_{2}, 2, \mathbf{x}\right)} \max _{y_{1}} e^{\phi_{t}\left(y_{1}, y_{2}\right)} \underbrace{e^{\phi_{e}\left(y_{1}, 1, \mathbf{x}\right)}}\\ =&\max _{y_{3}, \ldots, y_{n}} e^{\phi_{t}\left(y_{n-1}, y_{n}\right)} e^{\phi_{e}\left(y_{n}, n, \mathbf{x}\right)} \cdots \max _{y_{2}} e^{\phi_{t}\left(y_{2}, y_{3}\right)} \underbrace{e^{\phi_{e}\left(y_{2}, 2, \mathbf{x}\right)} \max _{y_{1}} e^{\phi_{t}\left(y_{1}, y_{2}\right)} \operatorname{score}_{1}\left(y_{1}\right)}\\ =&\max _{y_{3}, \ldots, y_{n}} e^{\phi_{t}\left(y_{n-1}, y_{n}\right)} e^{\phi_{e}\left(y_{n}, n, \mathbf{x}\right)} \cdots \max _{y_{2}} e^{\phi_{t}\left(y_{2}, y_{3}\right)}score_2(y_2)\\ =&\max _{y_n}score_n(y_n) \end{aligned}$ $score_i(y_i)=\left\{\begin{array}{cc} e^{\phi_{e}\left(y_{i}, i, \mathbf{x}\right)},\quad\text{i=1}\\ e^{\phi_{e}\left(y_{i}, i, \mathbf{x}\right)} \max _{y_{i-1}} e^{\phi_{t}\left(y_{i-1}, y_{i}\right)} \operatorname{score}_{i-1}\left(y_{1}\right),\quad\text{i$\neq$ 1} \end{array}\right.$

3.3 Training

Logistic regression:
Maximize $\mathcal{L}\left(\mathbf{y}^{}, \mathbf{x}\right)=\log P\left(\mathbf{y}^{} \mid \mathbf{x}\right)$
Gradient is completely analogous to logistic regression:

$\frac{\partial}{\partial w} \mathcal{L}\left(\mathbf{y}^{*}, \mathbf{x}\right)=\sum_{i=2}^{n} f_{t}\left(y_{i-1}^{*}, y_{i}^{*}\right)+\sum_{i=1}^{n} f_{e}\left(y_{i}^{*}, i, \mathbf{x}\right)$ $\text { intractable } \quad-\mathbb{E}_{\mathbf{y}}\left[\sum_{i=2}^{n} f_{t}\left(y_{i-1}, y_{i}\right)+\sum_{i=1}^{n} f_{e}\left(y_{i}, i, \mathbf{x}\right)\right]$

forward backward Algorithm

$\frac{\partial}{\partial w} \mathcal{L}\left(\mathbf{y}^{*}, \mathbf{x}\right)=\sum_{i=1}^{n} f_{e}\left(y_{i}^{*}, i, \mathbf{x}\right)-\sum_{i=1}^{n} \sum_{s} P\left(y_{i}=s \mid \mathbf{x}\right) f_{e}(s, i, \mathbf{x})$

拟牛顿法：

$P_{w}(y \mid x)=\frac{\exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right)}{\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right)}$

学习优化目标：

$\min _{w \in \mathcal{R}^{n}} f(w)=\sum_{x} \tilde{P}(x) \log \sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right)-\sum_{x, y} \tilde{P}(x, y) \sum_{i=1}^{n} w_{i} f_{i}(x, y)$

梯度函数：

$g(w)=\sum_{x, y} \tilde{P}(x) P_{w}(y \mid x) f(x, y)-E_{\tilde{p}}(f)$

BFGS算法：

输入：特征函数;经验分布

输出：最优化参数值；最优化模型

(1) 选的初始点,取为正定对称矩阵，置

(2) 计算若 , 則停止计算: 否则转 (3)
(3) 戈求出
(4) 以为搜索: 求使得：

$f\left(w^{(k)}+\lambda_{k} p_{k}\right)=\min _{\lambda \geqslant 0} f\left(w^{(k)}+\lambda p_{k}\right)$

(5) 置

(6) 计算，若，则停止计算；否则，按下式计算

$B_{k+1}=B_{k}+\frac{y_{k} y_{k}^{\mathrm{T}}}{y_{k}^{\mathrm{T}} \delta_{k}}-\frac{B_{k} \delta_{k} \delta_{k}^{\mathrm{T}} B_{k}}{\delta_{k}^{\mathrm{T}} B_{k} \delta_{k}}$

其中，

$y_{k}=g_{k+1}-g_{k}, \quad \delta_{k}=w^{(k+1)}-w^{(k)}$

(7) 査 , 转 (3).

4. Information Retrieval

4.1 Introduction

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

4.2 Term-document incidence matrices

E:/third

4.3 Incidence vectors

So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)(反码) bitwise AND
110100 AND 110111 AND 101111 = 100100
E:/third

4.4 Problem: Can’t build the matrix

Consider million documents, each with about 1000 words.
Avg 6 bytes/word including spaces/punctuation
- 6GB of data in the documents.
Say there are distinct terms among these.
matrix has half a trillion 0’s and 1’s.
But it has no more than one billion 1’s
- matrix is extremely sparse
What’s a better representation?
- We only record the 1 positions.

4.5 Inverted index

For each term t , we must storkkkkke a list of all documents that contain t
- Identify each doc by a docID , a document serial number
Can we used fixed size arrays for this?
E:/third

E:/third

4.5 Initial stages of text processing

Tokenization
- Cut character sequence into word tokens
  - Deal with “ John’s”, a state of the art solution
Normalization
- Map text and query term to same form
  - You want U.S.A. and USA to match
Stemming
- We may wish different forms of a root to match
  - authorize authorization
Stop words
- We may omit very common words (or not)
  - the, a, to, of

4.5.1 Indexer steps: Token sequence

Sequence of (Modified token, Document ID) pairs.
E:/third

4.5.2 Indexer steps: Sort

Sort by terms
- And then docID
- Core indexing step
  E:/third

4.5.3 Indexer steps: Dictionary & Postings

Multiple term entries in a single document are merged.
Split into Dictionary and Postings
Doc. frequency information is added.
E:/third

4.5.4 Where do we pay in storage? Pointers Terms

E:/third

4.6 Query processing with an inverted index

4.6.1 Query processing: AND

Consider processing the query:
- Brutus AND Caesar
  - Locate Brutus in the Dictionary;
    - Retrieve its postings.
  - Locate Caesar in the Dictionary;
    - Retrieve its postings.
  - “Merge” the two postings (intersect the document sets):
    E:/third

The merge

Walk through the two postings simultaneously, in time linear in the total number of postings entries
E:/third
If the list lengths are and , the merge takes operations.
E:/third

4.7 The Boolean Retrieval Model & Extended Boolean Models

Exercise : Adapt the merge for the
- Brutus AND NOT Caesar
- Brutus OR NOT Caesar
Can we still run through the merge in time ? What can we achieve?
What about an arbitrary Boolean formula?
- (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
Can we always merge in “linear” time?
- Linear in N(N is the total posting size)
Can we do better?

4.8 Query optimization

4.8.1 merge

Query: Brutus AND Calpurnia AND Caesar
E:/third
从最少频数的两个query开始合并
Process in order of increasing freq
- start with smallest set, then keep cutting further
  E:/third

4.8.2 More general optimization

e.g., madding OR crowd ) AND ignoble OR strife
Get doc. freq.’s for all terms.
Estimate the size of each OR by the sum of its doc. freq.’s (conservative)
Process in increasing order of OR sizes.