IR
1. HMM Exercise
- Transition
 
- Emission
 
- for :
 
分析原因:因为HMM做了局部归一化,导致HMM更容易转移到具有比较少转移状态的状态。
E:/third![image-20211118150932894]()
我们可以看到hmm的状态不容易进行跳转
2. CRF vs. HMM
- we can easily get the equation as flowing:
 
why?
Example:
 : the dog ate the homework 
- 同理,我们可以对式子的第一、二、三项做变化:
 
- 所以,我们可以对原式做变换:
 
- 对于CRF而言
是可以学习的参数  长度:tags
words(sL), tags  tages(s\ +2s(start and end)) 训练时,最大化后验概率
- 取log后:
 
- 根据梯度上升:
 
3. CRF
E:/third
- 假设
 
- 其实就是表示成边的条件概率以及状态条件概率
 - 则可以计算后验概率:
 
- 假设有
个转移特征, 个状态特征, ,记:  
- 然后对转移与状态特征在各个位置
求和:  
- 用
表示特征 的权值:  
- 于是条件随机场可以用以下式子表示:
 
为权值向量: 
- 以
表示全局特征向量即:  
- 则条件随机场可以写成
与 的内积的形式:  
- 其中
 
3.1 Inference:
- lf y consists of 5 variables with 30 values each, how expensive are these?Need to
 - constrain the form of our CRFs to make it tractable
E:/third![image-20211118174858397]()
 
E:/third
E:/third
因为
最好不要随意依赖 ,所以我们要给它加位置信息 
E:/third![image-20211118175238610]()
Notation: omit
from the factor graph entirely (implicit) - Don’t include initial distribution, can bake into other factors
 - Sequential CRFs:
 
3.2 Inference

3.3 Training
- Logistic regression: 
 - Maximize $\mathcal{L}\left(\mathbf{y}^{}, \mathbf{x}\right)=\log P\left(\mathbf{y}^{} \mid \mathbf{x}\right)$
 - Gradient is completely analogous to logistic regression:
 
- forward backward Algorithm
 
- 拟牛顿法:
 
- 学习优化目标:
 
- 梯度函数:
 
BFGS算法:
输入:特征函数
;经验分布 输出:最优化参数值
;最优化模型 (1) 选的初始点
,取 为正定对称矩阵,置 (2) 计算
 若 , 則停止计算: 否则转 (3) 
(3) 戈 求出  
(4) 以为搜索: 求使得: 
      (5) 置
      (6) 计算
 其中,
     (7) 査 
4. Information Retrieval
4.1 Introduction
- Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
 
4.2 Term-document incidence matrices
E:/third
4.3 Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)(反码)
 bitwise AND 110100 AND 110111 AND 101111 = 100100
E:/third![image-20211118213914396]()
4.4 Problem: Can’t build the matrix
- Consider 
million documents, each with about 1000 words.  - Avg 6 bytes/word including spaces/punctuation
- 6GB of data in the documents.
 
 - Say there are 
 distinct terms among these.   matrix has half a trillion 0’s and 1’s. - But it has no more than one billion 1’s
- matrix is extremely sparse
 
 - What’s a better representation?
- We only record the 1 positions.
 
 
4.5 Inverted index
- For each term t , we must storkkkkke a list of all documents that contain t
- Identify each doc by a docID , a document serial number
 
 - Can we used fixed size arrays for this?
E:/third![image-20211118214413425]()
E:/third![image-20211118214716987]()
 
4.5 Initial stages of text processing
- Tokenization
- Cut character sequence into word tokens
- Deal with “ John’s”, a state of the art solution
 
 
 - Cut character sequence into word tokens
 - Normalization
- Map text and query term to same form
- You want U.S.A. and USA to match
 
 
 - Map text and query term to same form
 - Stemming
- We may wish different forms of a root to match
- authorize authorization
 
 
 - We may wish different forms of a root to match
 - Stop words
- We may omit very common words (or not)
- the, a, to, of
 
 
 - We may omit very common words (or not)
 
4.5.1 Indexer steps: Token sequence
- Sequence of (Modified token, Document ID) pairs.
E:/third![image-20211118215340391]()
 
4.5.2 Indexer steps: Sort
- Sort by terms
- And then docID
 - Core indexing step
E:/third![image-20211118215432483]()
 
 
4.5.3 Indexer steps: Dictionary & Postings
- Multiple term entries in a single document are merged.
 - Split into Dictionary and Postings
 - Doc. frequency information is added.
E:/third![image-20211118215530676]()
 
4.5.4 Where do we pay in storage? Pointers Terms
E:/third
4.6 Query processing with an inverted index
4.6.1 Query processing: AND
- Consider processing the query:
- Brutus AND Caesar
- Locate Brutus in the Dictionary;
- Retrieve its postings.
 
 - Locate Caesar in the Dictionary;
- Retrieve its postings.
 
 - “Merge” the two postings (intersect the document sets):
E:/third![image-20211118215005168]()
 
 - Locate Brutus in the Dictionary;
 
 - Brutus AND Caesar
 
The merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
E:/third![image-20211118215727397]()
If the list lengths are
and , the merge takes operations. 
E:/third![image-20211118215808550]()
4.7 The Boolean Retrieval Model & Extended Boolean Models
Exercise : Adapt the merge for the
- Brutus AND NOT Caesar
 - Brutus OR NOT Caesar
 
Can we still run through the merge in time
? What can we achieve? What about an arbitrary Boolean formula?
- (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
 
Can we always merge in “linear” time?
- Linear in N(N is the total posting size)
 
- Can we do better?
 
4.8 Query optimization
4.8.1 merge
Query: Brutus AND Calpurnia AND Caesar
E:/third![image-20211118220452955]()
从最少频数的两个query开始合并
Process in order of increasing freq
- start with smallest set, then keep cutting further
E:/third![image-20211118220558584]()
 
- start with smallest set, then keep cutting further
 
4.8.2 More general optimization
- e.g., madding OR crowd ) AND ignoble OR strife
 - Get doc. freq.’s for all terms.
 - Estimate the size of each OR by the sum of its doc. freq.’s (conservative)
 - Process in increasing order of OR sizes.
 

            












