Smurf
文章50
标签0
分类6
Phrase queries and positional indexes

Phrase queries and positional indexes

Phrase queries and positional indexes

Phrase queries and positional indexes

1. Phrase Queries

  • 如何解决短语索引的信息检索?

  • Thus the sentence I went to university at southeast of China is not a match.

    • The concept of phrase queries has proven easily understood by users; one of the few advanced search ” ideas that works
  • Many more queries are implicit phrase queries:
    • natural language processing
  • For this, it no longer suffices to store only
    • \<term : docs > entries
    • 倒排无法实现

1.1 Solution 1: Biword indexes

  • Index every consecutive pair of terms in the text as a phrase

    • 两两组合,进行存储
  • For example the text “Friends , Romans, Countrymen” would generate the biwords

    • friends romans
    • romans countrymen
  • Each of these biwords is now a dictionary term

  • Two word phrase query processing is now immediate.

    • 这种方法对于两个单词的组合
  • 对于长索引,我们可以进行两两拆分,从而形成多个二元词组的合取

    • standford university palo alto
    • standford unversity and university palo and palo alto
  • 除非极端情况,会造成FN,大部分情况就合适的
  • 但是这种方法会造成大量的存储开销

1.2 Solution 2: Positional Indexes

  • In the postings, store, for each term the position(s) in which tokens of it appear:
  • <term , number of docs containing term
    doc1: position1, position2 …
    doc2: position1, position2 …
    etc.>
    • 存储了文档的位置
  • Positional Index Example
  • Which of docs 1,2,4,5 could contain to be or not to be

  • <be :993427; 1: 7, 18, 33, 72, 86,231; 2: 3, 149; 4: 17, 191, 291, 430,434; 5: 363, 367,...>
  • 我们可以发现4和5有可能是存在这句话的,因为他们的差分序列含有4这一项
  • 一般而言,两个词的信息都是存在的,那么这种情况如何处理?
  • to:
    • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191…
  • be:
    • 1:17,19 ; 4:17,191,291,430,434; 5:14,19,101…
  • 算法思路,利用两级倒排索引,先从1开始,然后看to没有1,所以删除1,再看2,发现be,没有2,所以删除2,这样就一起搜索到4;
  • 从4里我们通过to be只差1的关系,可以得到一堆候选
    • 4:8,16,190,429,433;
    • 4:17,191,291,430,434;
  • 紧接着再通过to be or not to be剩余的位置关系筛选到几个
    • 4:429,433;
    • 4:430,434;
  • 会占据2-4倍的存储开销:A positional index is 2-4 as large as a nonpositional index

  • Positional index size 35 50% of volume of original text

    • Caveat: all of this holds for “English like” languages

2. Ranked Retrieval

  • Thus far, our queries have all been Boolean.
    • Documents either match or don’t.
    • 对于布尔型排序会过于绝对
  • Good for expert users with precise understanding of their needs and the collection.

    • Also good for applications: Applications can easily consume 1000s
      of results
    • 好处,对于专家可以写很多精准的查询,查到就有,查不到就一定有
    • 封闭世界:有则一定能查到;开放世界:查不到可能是由于知识库的缺失
  • Not good for the majority of users.

    • Most users incapable of writing Boolean queries (or they are,
      but they think it’s too much work).
      • 普通人无法使用布尔查询
    • Most users don’t want to wade through 1000s of results.
      • This is particularly true of web search.
    • 如何解决这种要么很多答案,要么找不到答案这种情况
  • Feast or Famine : Not A Problem In Ranked Retrieval

  • When a system produces a ranked result set, large result sets are not an issue

    • Indeed, the size of the result set is not an issue
    • We just show the top k ( ≈ 10) results
    • We don’t overwhelm the user
    • Premise: the ranking algorithm works

2.2 Scoring as The Basis Of Ranked Retrieval

  • We wish to return in order the documents most likely to be
    useful to the searcher

  • How can we rank order the documents in the collection with
    respect to a query

  • Assign a score say in [0, 1] to each document

  • This score measures how well document and query “match“

  • Let’s start with a one term query

  • If the query term does not occur in the document: score
    should be 0

    • 首先document不包含关键字的肯定score=0,
  • The more frequent the query term in the document, the
    higher the score (should be ) really?

    • 不同document含有的term频次不一样,对于单个词出现越多越相关,但是对于多个词就不一定

3.3 Scoring with the Jaccard Coefficient

  • A commonly used measure of overlap of two sets A and B is
    the Jaccard coefficient
  • A and B don’t have to be the same size.

  • Always assigns a number between 0 and 1.

  • What is the query document match score that the Jaccard coefficient computes for each of the two documents below?

    • Query : ides of march
    • Document 1: caesar died in march
      • J(Q,D1)=1/6
    • Document 2: the long march
      • J(Q,D2)=1/5

3.3.1 Issues With Jaccard For Scoring

  • It doesn’t consider term frequency (how many times a term
    occurs in a document)

    • Rare terms in a collection are more informative than frequent
      terms
      • 出现次数越少信息熵越大
    • Jaccard doesn’t consider this information
  • We need a more sophisticated way of normalizing for length

    • We can use

3.4 Term Frequency Weighting

3.4.1 Recall: Term-document Incidence Matrix

image-20211123110509505

  • Each document is represented by a binary vector

  • Consider the number of occurrences of a term in a document:

    • Each document is a count vector in :a column below

image-20211123110648771

3.4.2 Term Frequency ​​​

  • The term frequency ​​​ of term ​​​ in document ​​​ is defined as the number of times that ​​​ occurs in ​​​.
    • 词t在文章d出现的次数
  • We want to use tf when computing query-document match scores. But now?
    • 出现9次和出现10次其实并不是差很多,其实出现次数与查询并不线性相关

3.4.3 Log-frequency Weighting

  • The log frequency weight of term t in d is
  • 区分出现一次和出现零次
  • Score for a document-query pair: sum over terms t in both q and d:
    • 计算quiery和document里共同出现的词的分数之和
  • The score is 0 if none of the query terms is present in the document.

3.4.4 (Inverse) Document Frequency Weighting

Document Frequency
  • Rare terms are more informative than frequent terms
    • Recall stop words
  • 对于信息量比较高的词我们将会给予更高的权重

3.4.5 idf Weight

  • 从所有语料里进行搜索,df越高,这个术语含有的信息越低

  • ​​ is the document frequency of ​​ : the number of documents that contain ​​

    • ​​ is an inverse measure of the informativeness of ​​
  • 一个词在所有文档出现的的频率
  • We define the idf (inverse document frequency) of by
    • We use instead of to “dampen“(抑制) the effect of idf.
  • 是文档数,是t出现的文档数
  • Will turn out the base of the log is immaterial不重要

image-20211125152421238

3.4.6 Effect of idf on Ranking

  • Question: Does idf have an effect on ranking for one-term queries, like

    • iPhone
  • idf has no effect on ranking one term queries

  • idf affects the ranking of documents for queries with at least two terms
    • 如果quiery只包含一个单词,那么idf对其没有影响;如果包含多个单词,则有影响
    • tf越大与当前document越相关,term的信息量
  • idf使得多个词的搜索的TF计算有权重,相当于对TF进行了加权求和

3.4.7 Collection vs. Document Frequency

  • collection frequency 相当于把term在所有语料出现的次数进行了求和

image-20211125152944526

  • idf为何要和文档有关?
    • the 在100个文档有关,那么相当于信息量为0
    • 东南大学在其中一篇文档的出现100次,其实信息量很大
    • 所以我们不能采用collection feq,要采用doc feq

3.4.8 tf-idf Weighting

  • The tf-idf weight of a term is the product of its tf weight and its idf weight.
  • Best known weighting scheme in information retrieval
    • Note: the “-“ in tf-idf is a hyphen, not a minus sign!
    • Alternative names: tf.idf,
  • Increases with the number of occurrences within a document
  • Increases with the rarity of the term in the collection

  • 这样我们就可以把原来的布尔矩阵换成tf-idf值

3.4.9 Final Ranking of Documents for a Query

Binary → count → weight matrix

image-20211125153713553

  • Each document is now represented by a real-valued vector of tf-idf weights

3.5 The Vector Space Model (VSM)

3.5.1 Documents As Vectors

  • Now we have a |V|-dimensional vector space
  • Terms are axes of the space
  • Documents are points or vectors in this space
  • Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
  • These are very sparse vectorsmost entries are zero
  • 向量表示过于稀疏,维度过大

3.5.2 Queries As Vectors

  • Key idea 1: Do the same for queries: represent them as vectors in the space
    • idf和document计算方式相同,tf可以理解为把quiery当作一个document然后去计算他的tf
  • Key idea 2: Rank documents according to their proximity((时间或空间)邻近) to the query in this space
    • proximity = similarity of vectors
    • proximity ≈ inverse of distance
    • 距离的倒数和相似性有区别

3.5.3 Formalizing Vector Space Proximity

  • Euclidean distance is a bad idea, because Euclidean distance is large for vectors of different lengths.

  • The Euclidean distance between and is large even though the distribution of terms in the query and the distribution of terms in the document ​​ are very similar.

    • 简言之就是相似度应该更考虑特征的分布而不是考虑实值,因为文档的长度一定是更长的,可能会造成每个维度上的值都比较大。

image-20211125154932464

3.5.4 Use Angle Instead Of Distance

  • The Euclidean distance between the two documents can be quite large
  • The angle between the two documents is 0, corresponding to maximal similarity.
  • Key idea: Rank documents according to angle with query.

3.5.5 From Angles To Cosines

  • The following two notions are equivalent.
    • Rank documents in decreasing order of the angle between query and document
    • Rank documents in increasing order of cosine(query,document)
  • Cosine is a monotonically decreasing function for the interval

3.5.6 computing cosines

  • A vector can be (length-) normalized by dividing each of its components by its length - for this we use the ​ norm:
  • Dividing a vector by its norm makes it a unit (length) vector (on surface of unit hypersphere)
  • Effect on the two documents ​ and ​ ( ​ appended to itself) from earlier slide: they have identical vectors after length normalization.
    • Long and short documents now have comparable weights

image-20211125155732979

  • is the tf-idf weight of term in the query is the tf-idf weight of term in the document is the cosine similarity of and or, equivalently, the cosine of the angle between and .

  • For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

  • for ​ length-normalized.
  • Cosine Similarity Illustrated

image-20211125155912496

3.5.7 Cosine similarity amongst 3 documents

  • How similar are the novels “SaS: Sense and SensibilityPaP: Pride and PrejudiceWH: Wuthering Heights”?
  • Term frequencies (counts)

as

  • Note: To simplify this example, we don’t do idf weighting.

image-20211125161059077

3.6 Calculating if-idf Cosine Scores in an IR System

3.6.1 tf-idf Weighting Has Many Variants

image-20211123120047019

  • Many search engines allow for different weightings for queries vs. documents
  • SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq using the acronyms from the previous table
  • A very standard weighting scheme is: lnc.ltc
  • Document : logarithmic tf (l as first character), no idf and cosine normalization
  • Query : logarithmic tf (l in leftmost column), idf (t in second column), cosine normalization …

3.6.2 tf-idf example: lnc.ltc

  • Document: car insurance auto insurance
  • Query: best car insurance

image-20211125161546654

  • Exercise: what is N, the number of docs?

Computing cosine scores

image-20211125162858920

4. Evaluating Search Engines

  • Assume 10 rel docs in collection

image-20211123120946439

  • R:1/10,1/10,1/10,1/5,3/10,3/10,2/5,2/5,2/5,2/5
  • P:1 ,0.5, 0.33,0.5,0.6 , 0.5, 4/7,0.5,4/9,2/5

image-20211123121003441

4.1 Two Current Evaluation Measures…

  • Mean average precision (MAP)
    • AP: Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved
    • Avoids interpolation, use of fixed recall levels
    • Does weight most accuracy of top returned results
    • MAP for set of queries is arithmetic average of APs
    • Macro-averaging: each query counts equally
    • 简言之就是对于一个问答集合,算他们的平均正确率
  • R-precision
    • If have known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of top Rel docs returned
      • 如果已知(尽管可能不完整)一组大小为Rel的相关文档,则计算返回的top Rel文档的精度
    • Perfect system could score 1.0.
本文作者:Smurf
本文链接:http://example.com/2021/08/15/nlp%20learning/Chapter9_tfidf/
版权声明:本文采用 CC BY-NC-SA 3.0 CN 协议进行许可