Smurf
文章50
标签0
分类6
IR

IR

IR

1. HMM Exercise

  • Transition
  • Emission
  • for :
  • 分析原因:因为HMM做了局部归一化,导致HMM更容易转移到具有比较少转移状态的状态。
    E:/third
    image-20211118150932894

  • 我们可以看到hmm的状态不容易进行跳转

2. CRF vs. HMM

  • we can easily get the equation as flowing:
  • why?

  • Example:

  • ​ : the dog ate the homework
  • 同理,我们可以对式子的第一、二、三项做变化:
  • 所以,我们可以对原式做变换:
  • 对于CRF而言​是可以学习的参数
  • 长度:tagswords(sL), tags​​ tages(s\ +2s(start and end))

  • 训练时,最大化后验概率

  • 取log后:
  • 根据梯度上升:

3. CRF

E:/third
image-20211118173049253

  • 假设
  • 其实就是表示成边的条件概率以及状态条件概率
  • 则可以计算后验概率:
  • 假设有个转移特征,个状态特征,,记:
  • 然后对转移与状态特征在各个位置求和:
  • 表示特征的权值:
  • 于是条件随机场可以用以下式子表示:
  • 为权值向量:
  • 表示全局特征向量即:
  • 则条件随机场可以写成 的内积的形式:
  • 其中

3.1 Inference:

  • lf y consists of 5 variables with 30 values each, how expensive are these?Need to
  • constrain the form of our CRFs to make it tractable
    E:/third
    image-20211118174858397

E:/third
image-20211118175100503
E:/third
image-20211118175111520

  • 因为最好不要随意依赖,所以我们要给它加位置信息
    E:/third
    image-20211118175238610

  • Notation: omit from the factor graph entirely (implicit)

  • Don’t include initial distribution, can bake into other factors
  • Sequential CRFs:

3.2 Inference

image-20211118175800342

3.3 Training

  • Logistic regression:
  • Maximize $\mathcal{L}\left(\mathbf{y}^{}, \mathbf{x}\right)=\log P\left(\mathbf{y}^{} \mid \mathbf{x}\right)$
  • Gradient is completely analogous to logistic regression:
  • forward backward Algorithm
  • 拟牛顿法:
  • 学习优化目标:
  • 梯度函数:
  • BFGS算法:

    输入:特征函数;经验分布

    输出:最优化参数值;最优化模型

    (1) 选的初始点​,取​为正定对称矩阵,置

    (2) 计算 ​ 若 ​, 則停止计算: 否则转 (3)
    (3) 戈 ​ 求出
    (4) 以为搜索: 求 使得:​

​ (5) 置

​ (6) 计算,若,则停止计算;否则,按下式计算

​ 其中,

​ (7) 査 , 转 (3).

4. Information Retrieval

4.1 Introduction

  • Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

4.2 Term-document incidence matrices

E:/third
image-20211118213625610

4.3 Incidence vectors

  • So we have a 0/1 vector for each term.

  • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)(反码) ​​​ bitwise AND

  • 110100 AND 110111 AND 101111 = 100100
    E:/third
    image-20211118213914396

4.4 Problem: Can’t build the matrix

  • Consider million documents, each with about 1000 words.
  • Avg 6 bytes/word including spaces/punctuation
    • 6GB of data in the documents.
  • Say there are ​ distinct terms among these.
  • ​ matrix has half a trillion 0’s and 1’s.
  • But it has no more than one billion 1’s
    • matrix is extremely sparse
  • What’s a better representation?
    • We only record the 1 positions.

4.5 Inverted index

  • For each term t , we must storkkkkke a list of all documents that contain t
    • Identify each doc by a docID , a document serial number
  • Can we used fixed size arrays for this?
    E:/third
    image-20211118214413425
    E:/third
    image-20211118214716987

4.5 Initial stages of text processing

  • Tokenization
    • Cut character sequence into word tokens
      • Deal with “ John’s”, a state of the art solution
  • Normalization
    • Map text and query term to same form
      • You want U.S.A. and USA to match
  • Stemming
    • We may wish different forms of a root to match
      • authorize authorization
  • Stop words
    • We may omit very common words (or not)
      • the, a, to, of

4.5.1 Indexer steps: Token sequence

  • Sequence of (Modified token, Document ID) pairs.
    E:/third
    image-20211118215340391

4.5.2 Indexer steps: Sort

  • Sort by terms
    • And then docID
    • Core indexing step
      E:/third
      image-20211118215432483

4.5.3 Indexer steps: Dictionary & Postings

  • Multiple term entries in a single document are merged.
  • Split into Dictionary and Postings
  • Doc. frequency information is added.
    E:/third
    image-20211118215530676

4.5.4 Where do we pay in storage? Pointers Terms

E:/third
image-20211118215647957

4.6 Query processing with an inverted index

4.6.1 Query processing: AND

  • Consider processing the query:
    • Brutus AND Caesar
      • Locate Brutus in the Dictionary;
        • Retrieve its postings.
      • Locate Caesar in the Dictionary;
        • Retrieve its postings.
      • “Merge” the two postings (intersect the document sets):
        E:/third
        image-20211118215005168

The merge

  • Walk through the two postings simultaneously, in time linear in the total number of postings entries
    E:/third
    image-20211118215727397

  • If the list lengths are and , the merge takes operations.
    E:/third
    image-20211118215808550

4.7 The Boolean Retrieval Model & Extended Boolean Models

  • Exercise : Adapt the merge for the

    • Brutus AND NOT Caesar
    • Brutus OR NOT Caesar
  • Can we still run through the merge in time ? What can we achieve?

  • What about an arbitrary Boolean formula?

    • (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
  • Can we always merge in “linear” time?

    • Linear in N(N is the total posting size)
  • Can we do better?

4.8 Query optimization

4.8.1 merge

  • Query: Brutus AND Calpurnia AND Caesar
    E:/third
    image-20211118220452955

  • 从最少频数的两个query开始合并

  • Process in order of increasing freq

    • start with smallest set, then keep cutting further
      E:/third
      image-20211118220558584

4.8.2 More general optimization

  • e.g., madding OR crowd ) AND ignoble OR strife
  • Get doc. freq.’s for all terms.
  • Estimate the size of each OR by the sum of its doc. freq.’s (conservative)
  • Process in increasing order of OR sizes.
本文作者:Smurf
本文链接:http://example.com/2021/08/15/nlp%20learning/Chapter8_IR/
版权声明:本文采用 CC BY-NC-SA 3.0 CN 协议进行许可