Phrase queries and positional indexes
Phrase queries and positional indexes
1. Phrase Queries
如何解决短语索引的信息检索?
Thus the sentence I went to university at southeast of China is not a match.
- The concept of phrase queries has proven easily understood by users; one of the few advanced search ” ideas that works
- Many more queries are implicit phrase queries:
- natural language processing
- For this, it no longer suffices to store only
- \<term : docs > entries
- 倒排无法实现
1.1 Solution 1: Biword indexes
Index every consecutive pair of terms in the text as a phrase
- 两两组合,进行存储
For example the text “Friends , Romans, Countrymen” would generate the biwords
- friends romans
- romans countrymen
Each of these biwords is now a dictionary term
Two word phrase query processing is now immediate.
- 这种方法对于两个单词的组合
对于长索引,我们可以进行两两拆分,从而形成多个二元词组的合取
- standford university palo alto
- standford unversity and university palo and palo alto
- 除非极端情况,会造成FN,大部分情况就合适的
- 但是这种方法会造成大量的存储开销
1.2 Solution 2: Positional Indexes
- In the postings, store, for each term the position(s) in which tokens of it appear:
- <term , number of docs containing term
doc1: position1, position2 …
doc2: position1, position2 …
etc.>- 存储了文档的位置
- Positional Index Example
Which of docs 1,2,4,5 could contain to be or not to be
- <be :993427; 1: 7, 18, 33, 72, 86,231; 2: 3, 149; 4: 17, 191, 291, 430,434; 5: 363, 367,...>
- 我们可以发现4和5有可能是存在这句话的,因为他们的差分序列含有4这一项
- 一般而言,两个词的信息都是存在的,那么这种情况如何处理?
- to:
- 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191…
- be:
- 1:17,19 ; 4:17,191,291,430,434; 5:14,19,101…
- 算法思路,利用两级倒排索引,先从1开始,然后看to没有1,所以删除1,再看2,发现be,没有2,所以删除2,这样就一起搜索到4;
- 从4里我们通过to be只差1的关系,可以得到一堆候选
- 4:8,16,190,429,433;
- 4:17,191,291,430,434;
- 紧接着再通过to be or not to be剩余的位置关系筛选到几个
- 4:429,433;
- 4:430,434;
会占据2-4倍的存储开销:A positional index is 2-4 as large as a nonpositional index
Positional index size 35 50% of volume of original text
- Caveat: all of this holds for “English like” languages
2. Ranked Retrieval
2.1 Problem with Boolean search:
- Thus far, our queries have all been Boolean.
- Documents either match or don’t.
- 对于布尔型排序会过于绝对
Good for expert users with precise understanding of their needs and the collection.
- Also good for applications: Applications can easily consume 1000s
of results - 好处,对于专家可以写很多精准的查询,查到就有,查不到就一定有
- 封闭世界:有则一定能查到;开放世界:查不到可能是由于知识库的缺失
- Also good for applications: Applications can easily consume 1000s
Not good for the majority of users.
- Most users incapable of writing Boolean queries (or they are,
but they think it’s too much work).- 普通人无法使用布尔查询
- Most users don’t want to wade through 1000s of results.
- This is particularly true of web search.
- 如何解决这种要么很多答案,要么找不到答案这种情况
- Most users incapable of writing Boolean queries (or they are,
Feast or Famine : Not A Problem In Ranked Retrieval
When a system produces a ranked result set, large result sets are not an issue
- Indeed, the size of the result set is not an issue
- We just show the top k ( ≈ 10) results
- We don’t overwhelm the user
- Premise: the ranking algorithm works
2.2 Scoring as The Basis Of Ranked Retrieval
We wish to return in order the documents most likely to be
useful to the searcherHow can we rank order the documents in the collection with
respect to a queryAssign a score say in [0, 1] to each document
This score measures how well document and query “match“
Let’s start with a one term query
If the query term does not occur in the document: score
should be 0- 首先document不包含关键字的肯定score=0,
The more frequent the query term in the document, the
higher the score (should be ) really?- 不同document含有的term频次不一样,对于单个词出现越多越相关,但是对于多个词就不一定
3.3 Scoring with the Jaccard Coefficient
- A commonly used measure of overlap of two sets A and B is
the Jaccard coefficient
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
What is the query document match score that the Jaccard coefficient computes for each of the two documents below?
- Query : ides of march
- Document 1: caesar died in march
- J(Q,D1)=1/6
- Document 2: the long march
- J(Q,D2)=1/5
3.3.1 Issues With Jaccard For Scoring
It doesn’t consider term frequency (how many times a term
occurs in a document)- Rare terms in a collection are more informative than frequent
terms- 出现次数越少信息熵越大
- Jaccard doesn’t consider this information
- Rare terms in a collection are more informative than frequent
We need a more sophisticated way of normalizing for length
- We can use
- We can use
3.4 Term Frequency Weighting
3.4.1 Recall: Term-document Incidence Matrix
Each document is represented by a binary vector
Consider the number of occurrences of a term in a document:
- Each document is a count vector in
:a column below
- Each document is a count vector in
3.4.2 Term Frequency
- The term frequency
of term in document is defined as the number of times that occurs in . - 词t在文章d出现的次数
- We want to use tf when computing query-document match scores. But now?
- 出现9次和出现10次其实并不是差很多,其实出现次数与查询并不线性相关
3.4.3 Log-frequency Weighting
- The log frequency weight of term t in d is
- 区分出现一次和出现零次
- Score for a document-query pair: sum over terms t in both q and d:
- 计算quiery和document里共同出现的词的分数之和
- The score is 0 if none of the query terms is present in the document.
3.4.4 (Inverse) Document Frequency Weighting
Document Frequency
- Rare terms are more informative than frequent terms
- Recall stop words
- 对于信息量比较高的词我们将会给予更高的权重
3.4.5 idf Weight
从所有语料里进行搜索,df越高,这个术语含有的信息越低
is the document frequency of : the number of documents that contain is an inverse measure of the informativeness of
- 一个词在所有文档出现的的频率
- We define the idf (inverse document frequency) of
by - We use
instead of to “dampen“(抑制) the effect of idf.
- We use
是文档数, 是t出现的文档数 - Will turn out the base of the log is immaterial不重要
3.4.6 Effect of idf on Ranking
Question: Does idf have an effect on ranking for one-term queries, like
- iPhone
idf has no effect on ranking one term queries
- idf affects the ranking of documents for queries with at least two terms
- 如果quiery只包含一个单词,那么idf对其没有影响;如果包含多个单词,则有影响
- tf越大与当前document越相关,term的信息量
- idf使得多个词的搜索的TF计算有权重,相当于对TF进行了加权求和
3.4.7 Collection vs. Document Frequency
- collection frequency 相当于把term在所有语料出现的次数进行了求和
- idf为何要和文档有关?
- the 在100个文档有关,那么相当于信息量为0
- 东南大学在其中一篇文档的出现100次,其实信息量很大
- 所以我们不能采用collection feq,要采用doc feq
3.4.8 tf-idf Weighting
- The tf-idf weight of a term is the product of its tf weight and its idf weight.
- Best known weighting scheme in information retrieval
- Note: the “-“ in tf-idf is a hyphen, not a minus sign!
- Alternative names: tf.idf,
- Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
这样我们就可以把原来的布尔矩阵换成tf-idf值
3.4.9 Final Ranking of Documents for a Query
Binary → count → weight matrix
- Each document is now represented by a real-valued vector of tf-idf weights
3.5 The Vector Space Model (VSM)
3.5.1 Documents As Vectors
- Now we have a |V|-dimensional vector space
- Terms are axes of the space
- Documents are points or vectors in this space
- Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
- These are very sparse vectors – most entries are zero
- 向量表示过于稀疏,维度过大
3.5.2 Queries As Vectors
- Key idea 1: Do the same for queries: represent them as vectors in the space
- idf和document计算方式相同,tf可以理解为把quiery当作一个document然后去计算他的tf
- Key idea 2: Rank documents according to their proximity((时间或空间)邻近) to the query in this space
- proximity = similarity of vectors
- proximity ≈ inverse of distance
- 距离的倒数和相似性有区别
3.5.3 Formalizing Vector Space Proximity
Euclidean distance is a bad idea, because Euclidean distance is large for vectors of different lengths.
The Euclidean distance between
and is large even though the distribution of terms in the query and the distribution of terms in the document are very similar.- 简言之就是相似度应该更考虑特征的分布而不是考虑实值,因为文档的长度一定是更长的,可能会造成每个维度上的值都比较大。
3.5.4 Use Angle Instead Of Distance
- The Euclidean distance between the two documents can be quite large
- The angle between the two documents is 0, corresponding to maximal similarity.
- Key idea: Rank documents according to angle with query.
3.5.5 From Angles To Cosines
- The following two notions are equivalent.
- Rank documents in decreasing order of the angle between query and document
- Rank documents in increasing order of cosine(query,document)
- Cosine is a monotonically decreasing function for the interval
3.5.6 computing cosines
- A vector can be (length-) normalized by dividing each of its components by its length - for this we use the
norm:
- Dividing a vector by its
norm makes it a unit (length) vector (on surface of unit hypersphere) - Effect on the two documents
and ( appended to itself) from earlier slide: they have identical vectors after length normalization.- Long and short documents now have comparable weights
is the tf-idf weight of term in the query is the tf-idf weight of term in the document is the cosine similarity of and or, equivalently, the cosine of the angle between and .For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):
- for
length-normalized. - Cosine Similarity Illustrated
3.5.7 Cosine similarity amongst 3 documents
- How similar are the novels “SaS: Sense and SensibilityPaP: Pride and PrejudiceWH: Wuthering Heights”?
- Term frequencies (counts)
- Note: To simplify this example, we don’t do idf weighting.
3.6 Calculating if-idf Cosine Scores in an IR System
3.6.1 tf-idf Weighting Has Many Variants
- Many search engines allow for different weightings for queries vs. documents
- SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq using the acronyms from the previous table
- A very standard weighting scheme is: lnc.ltc
- Document : logarithmic tf (l as first character), no idf and cosine normalization
- Query : logarithmic tf (l in leftmost column), idf (t in second column), cosine normalization …
3.6.2 tf-idf example: lnc.ltc
- Document: car insurance auto insurance
- Query: best car insurance
- Exercise: what is N, the number of docs?
Computing cosine scores
4. Evaluating Search Engines
- Assume 10 rel docs in collection
- R:1/10,1/10,1/10,1/5,3/10,3/10,2/5,2/5,2/5,2/5
- P:1 ,0.5, 0.33,0.5,0.6 , 0.5, 4/7,0.5,4/9,2/5
4.1 Two Current Evaluation Measures…
- Mean average precision (MAP)
- AP: Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved
- Avoids interpolation, use of fixed recall levels
- Does weight most accuracy of top returned results
- MAP for set of queries is arithmetic average of APs
- Macro-averaging: each query counts equally
- 简言之就是对于一个问答集合,算他们的平均正确率
- R-precision
- If have known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of top Rel docs returned
- 如果已知(尽管可能不完整)一组大小为Rel的相关文档,则计算返回的top Rel文档的精度
- Perfect system could score 1.0.
- If have known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of top Rel docs returned