IR
1. HMM Exercise
- Transition
- Emission
- for :
分析原因:因为HMM做了局部归一化,导致HMM更容易转移到具有比较少转移状态的状态。
E:/third我们可以看到hmm的状态不容易进行跳转
2. CRF vs. HMM
- we can easily get the equation as flowing:
why?
Example:
: the dog ate the homework
- 同理,我们可以对式子的第一、二、三项做变化:
- 所以,我们可以对原式做变换:
- 对于CRF而言
是可以学习的参数 长度:tags
words(sL), tags tages(s\ +2s(start and end)) 训练时,最大化后验概率
- 取log后:
- 根据梯度上升:
3. CRF
E:/third
- 假设
- 其实就是表示成边的条件概率以及状态条件概率
- 则可以计算后验概率:
- 假设有
个转移特征, 个状态特征, ,记:
- 然后对转移与状态特征在各个位置
求和:
- 用
表示特征 的权值:
- 于是条件随机场可以用以下式子表示:
为权值向量:
- 以
表示全局特征向量即:
- 则条件随机场可以写成
与 的内积的形式:
- 其中
3.1 Inference:
- lf y consists of 5 variables with 30 values each, how expensive are these?Need to
- constrain the form of our CRFs to make it tractable
E:/third
E:/third
E:/third
因为
最好不要随意依赖 ,所以我们要给它加位置信息
E:/thirdNotation: omit
from the factor graph entirely (implicit) - Don’t include initial distribution, can bake into other factors
- Sequential CRFs:
3.2 Inference
3.3 Training
- Logistic regression:
- Maximize $\mathcal{L}\left(\mathbf{y}^{}, \mathbf{x}\right)=\log P\left(\mathbf{y}^{} \mid \mathbf{x}\right)$
- Gradient is completely analogous to logistic regression:
- forward backward Algorithm
- 拟牛顿法:
- 学习优化目标:
- 梯度函数:
BFGS算法:
输入:特征函数
;经验分布 输出:最优化参数值
;最优化模型 (1) 选的初始点
,取 为正定对称矩阵,置 (2) 计算
若 , 則停止计算: 否则转 (3)
(3) 戈 求出
(4) 以为搜索: 求使得:
(5) 置
(6) 计算
其中,
(7) 査
4. Information Retrieval
4.1 Introduction
- Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
4.2 Term-document incidence matrices
E:/third
4.3 Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)(反码)
bitwise AND 110100 AND 110111 AND 101111 = 100100
E:/third
4.4 Problem: Can’t build the matrix
- Consider
million documents, each with about 1000 words. - Avg 6 bytes/word including spaces/punctuation
- 6GB of data in the documents.
- Say there are
distinct terms among these. matrix has half a trillion 0’s and 1’s. - But it has no more than one billion 1’s
- matrix is extremely sparse
- What’s a better representation?
- We only record the 1 positions.
4.5 Inverted index
- For each term t , we must storkkkkke a list of all documents that contain t
- Identify each doc by a docID , a document serial number
- Can we used fixed size arrays for this?
E:/third
E:/third
4.5 Initial stages of text processing
- Tokenization
- Cut character sequence into word tokens
- Deal with “ John’s”, a state of the art solution
- Cut character sequence into word tokens
- Normalization
- Map text and query term to same form
- You want U.S.A. and USA to match
- Map text and query term to same form
- Stemming
- We may wish different forms of a root to match
- authorize authorization
- We may wish different forms of a root to match
- Stop words
- We may omit very common words (or not)
- the, a, to, of
- We may omit very common words (or not)
4.5.1 Indexer steps: Token sequence
- Sequence of (Modified token, Document ID) pairs.
E:/third
4.5.2 Indexer steps: Sort
- Sort by terms
- And then docID
- Core indexing step
E:/third
4.5.3 Indexer steps: Dictionary & Postings
- Multiple term entries in a single document are merged.
- Split into Dictionary and Postings
- Doc. frequency information is added.
E:/third
4.5.4 Where do we pay in storage? Pointers Terms
E:/third
4.6 Query processing with an inverted index
4.6.1 Query processing: AND
- Consider processing the query:
- Brutus AND Caesar
- Locate Brutus in the Dictionary;
- Retrieve its postings.
- Locate Caesar in the Dictionary;
- Retrieve its postings.
- “Merge” the two postings (intersect the document sets):
E:/third
- Locate Brutus in the Dictionary;
- Brutus AND Caesar
The merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
E:/thirdIf the list lengths are
and , the merge takes operations.
E:/third
4.7 The Boolean Retrieval Model & Extended Boolean Models
Exercise : Adapt the merge for the
- Brutus AND NOT Caesar
- Brutus OR NOT Caesar
Can we still run through the merge in time
? What can we achieve? What about an arbitrary Boolean formula?
- (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
Can we always merge in “linear” time?
- Linear in N(N is the total posting size)
- Can we do better?
4.8 Query optimization
4.8.1 merge
Query: Brutus AND Calpurnia AND Caesar
E:/third从最少频数的两个query开始合并
Process in order of increasing freq
- start with smallest set, then keep cutting further
E:/third
- start with smallest set, then keep cutting further
4.8.2 More general optimization
- e.g., madding OR crowd ) AND ignoble OR strife
- Get doc. freq.’s for all terms.
- Estimate the size of each OR by the sum of its doc. freq.’s (conservative)
- Process in increasing order of OR sizes.