Phrase queries and positional indexes

1. Phrase Queries

如何解决短语索引的信息检索？
Thus the sentence I went to university at southeast of China is not a match.
- The concept of phrase queries has proven easily understood by users; one of the few advanced search ” ideas that works
Many more queries are implicit phrase queries:
- natural language processing
For this, it no longer suffices to store only
- \<term : docs > entries
- 倒排无法实现

1.1 Solution 1: Biword indexes

Index every consecutive pair of terms in the text as a phrase
- 两两组合，进行存储
For example the text “Friends , Romans, Countrymen” would generate the biwords
- friends romans
- romans countrymen
Each of these biwords is now a dictionary term
Two word phrase query processing is now immediate.
- 这种方法对于两个单词的组合
对于长索引，我们可以进行两两拆分，从而形成多个二元词组的合取
- standford university palo alto
- standford unversity and university palo and palo alto
除非极端情况，会造成FN，大部分情况就合适的
但是这种方法会造成大量的存储开销

1.2 Solution 2: Positional Indexes

In the postings, store, for each term the position(s) in which tokens of it appear:
<term , number of docs containing term
doc1: position1, position2 …
doc2: position1, position2 …
etc.>
- 存储了文档的位置
Positional Index Example
Which of docs 1,2,4,5 could contain to be or not to be
<be :993427; 1: 7, 18, 33, 72, 86,231; 2: 3, 149； 4: 17, 191, 291, 430,434; 5: 363, 367,...>
我们可以发现4和5有可能是存在这句话的，因为他们的差分序列含有4这一项
一般而言，两个词的信息都是存在的，那么这种情况如何处理？
to:
- 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191…
be:
- 1:17,19 ; 4:17,191,291,430,434; 5:14,19,101…
算法思路，利用两级倒排索引，先从1开始，然后看to没有1,所以删除1，再看2，发现be，没有2，所以删除2，这样就一起搜索到4；
从4里我们通过to be只差1的关系，可以得到一堆候选
- 4:8,16,190,429,433;
- 4:17,191,291,430,434;
紧接着再通过to be or not to be剩余的位置关系筛选到几个
- 4:429,433;
- 4:430,434;
会占据2-4倍的存储开销：A positional index is 2-4 as large as a nonpositional index
Positional index size 35 50% of volume of original text
- Caveat: all of this holds for “English like” languages

2. Ranked Retrieval

2.1 Problem with Boolean search:

Thus far, our queries have all been Boolean.
- Documents either match or don’t.
- 对于布尔型排序会过于绝对
Good for expert users with precise understanding of their needs and the collection.
- Also good for applications: Applications can easily consume 1000s
  of results
- 好处，对于专家可以写很多精准的查询，查到就有，查不到就一定有
- 封闭世界：有则一定能查到；开放世界：查不到可能是由于知识库的缺失
Not good for the majority of users.
- Most users incapable of writing Boolean queries (or they are,
  but they think it’s too much work).
  - 普通人无法使用布尔查询
- Most users don’t want to wade through 1000s of results.
  - This is particularly true of web search.
- 如何解决这种要么很多答案，要么找不到答案这种情况

Feast or Famine : Not A Problem In Ranked Retrieval
When a system produces a ranked result set, large result sets are not an issue
- Indeed, the size of the result set is not an issue
- We just show the top k ( ≈ 10) results
- We don’t overwhelm the user
- Premise: the ranking algorithm works

2.2 Scoring as The Basis Of Ranked Retrieval

We wish to return in order the documents most likely to be
useful to the searcher
How can we rank order the documents in the collection with
respect to a query
Assign a score say in [0, 1] to each document
This score measures how well document and query “match“
Let’s start with a one term query
If the query term does not occur in the document: score
should be 0
- 首先document不包含关键字的肯定score=0，
The more frequent the query term in the document, the
higher the score (should be ) really？
- 不同document含有的term频次不一样，对于单个词出现越多越相关，但是对于多个词就不一定

3.3 Scoring with the Jaccard Coefficient

A commonly used measure of overlap of two sets A and B is
the Jaccard coefficient

$\begin{array}{ll} jaccard (A,B) = |A\cap B | / |A\cup B| \\ jaccard (A,A) = 1\\ jaccard (A,B) = 0\text{ if }A\cap B = 0\\ \end{array}$

A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
What is the query document match score that the Jaccard coefficient computes for each of the two documents below？
- Query : ides of march
- Document 1: caesar died in march
  - J(Q,D1)=1/6
- Document 2: the long march
  - J(Q,D2)=1/5

3.3.1 Issues With Jaccard For Scoring

It doesn’t consider term frequency (how many times a term
occurs in a document)
- Rare terms in a collection are more informative than frequent
  terms
  - 出现次数越少信息熵越大
- Jaccard doesn’t consider this information
We need a more sophisticated way of normalizing for length
- We can use

3.4 Term Frequency Weighting

3.4.1 Recall: Term-document Incidence Matrix

Each document is represented by a binary vector
Consider the number of occurrences of a term in a document:
- Each document is a count vector in :a column below

3.4.2 Term Frequency

The term frequency of term in document is defined as the number of times that occurs in .
- 词t在文章d出现的次数
We want to use tf when computing query-document match scores. But now?
- 出现9次和出现10次其实并不是差很多，其实出现次数与查询并不线性相关

3.4.3 Log-frequency Weighting

The log frequency weight of term t in d is

$w_{t, d}=\left\{\begin{array}{cc} 1+\log _{10} \mathrm{tf}_{t, d}, & \text { if } \mathrm{tf}_{t, d}>\mathrm{O} \\ \mathrm{O}, & \text { otherwise } \end{array}\right.$

区分出现一次和出现零次
Score for a document-query pair: sum over terms t in both q and d:
- 计算quiery和document里共同出现的词的分数之和

$\text{score}=\sum_{t \in q \cap d}\left(1+\log \mathrm{tf}_{t, d}\right)$

The score is 0 if none of the query terms is present in the document.

3.4.4 (Inverse) Document Frequency Weighting

Document Frequency

Rare terms are more informative than frequent terms
- Recall stop words
对于信息量比较高的词我们将会给予更高的权重

3.4.5 idf Weight

从所有语料里进行搜索，df越高，这个术语含有的信息越低
is the document frequency of : the number of documents that contain
- is an inverse measure of the informativeness of
一个词在所有文档出现的的频率
We define the idf (inverse document frequency) of by
- We use instead of to “dampen“(抑制) the effect of idf.

$\text{Idf}_{t}=\log _{10}\left(N / \mathrm{df}_{t}\right)$

是文档数，是t出现的文档数
Will turn out the base of the log is immaterial不重要

3.4.6 Effect of idf on Ranking

Question: Does idf have an effect on ranking for one-term queries, like
- iPhone
idf has no effect on ranking one term queries
idf affects the ranking of documents for queries with at least two terms
- 如果quiery只包含一个单词，那么idf对其没有影响；如果包含多个单词，则有影响
- tf越大与当前document越相关，term的信息量
idf使得多个词的搜索的TF计算有权重，相当于对TF进行了加权求和

3.4.7 Collection vs. Document Frequency

collection frequency 相当于把term在所有语料出现的次数进行了求和

idf为何要和文档有关？
- the 在100个文档有关，那么相当于信息量为0
- 东南大学在其中一篇文档的出现100次，其实信息量很大
- 所以我们不能采用collection feq，要采用doc feq

3.4.8 tf-idf Weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight.

$\mathrm{w}_{t, d}=\left(1+\log \mathrm{tf}_{t, d}\right) \times \log _{10}\left(N / \mathrm{df}_{t}\right)$

Best known weighting scheme in information retrieval
- Note: the “-“ in tf-idf is a hyphen, not a minus sign!
- Alternative names: tf.idf,
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
这样我们就可以把原来的布尔矩阵换成tf-idf值

3.4.9 Final Ranking of Documents for a Query

$\operatorname{Score}(q, d)=\sum_{t \in q \cap d} \operatorname{tf} . \mathrm{idf}_{t, d}$

Binary → count → weight matrix

Each document is now represented by a real-valued vector of tf-idf weights

3.5 The Vector Space Model (VSM)

3.5.1 Documents As Vectors

Now we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
These are very sparse vectors – most entries are zero
向量表示过于稀疏，维度过大

3.5.2 Queries As Vectors

Key idea 1: Do the same for queries: represent them as vectors in the space
- idf和document计算方式相同,tf可以理解为把quiery当作一个document然后去计算他的tf
Key idea 2: Rank documents according to their proximity（(时间或空间)邻近） to the query in this space
- proximity = similarity of vectors
- proximity ≈ inverse of distance
- 距离的倒数和相似性有区别

3.5.3 Formalizing Vector Space Proximity

Euclidean distance is a bad idea, because Euclidean distance is large for vectors of different lengths.
The Euclidean distance between and is large even though the distribution of terms in the query and the distribution of terms in the document are very similar.
- 简言之就是相似度应该更考虑特征的分布而不是考虑实值，因为文档的长度一定是更长的，可能会造成每个维度上的值都比较大。

3.5.4 Use Angle Instead Of Distance

The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal similarity.
Key idea: Rank documents according to angle with query.

3.5.5 From Angles To Cosines

The following two notions are equivalent.
- Rank documents in decreasing order of the angle between query and document
- Rank documents in increasing order of cosine(query,document)
Cosine is a monotonically decreasing function for the interval

3.5.6 computing cosines

A vector can be (length-) normalized by dividing each of its components by its length - for this we use the norm:

$\|\vec{x}\|_{2}=\sqrt{\sum_{i} x_{i}^{2}}$

Dividing a vector by its norm makes it a unit (length) vector (on surface of unit hypersphere)
Effect on the two documents and ( appended to itself) from earlier slide: they have identical vectors after length normalization.
- Long and short documents now have comparable weights

is the tf-idf weight of term in the query is the tf-idf weight of term in the document is the cosine similarity of and or, equivalently, the cosine of the angle between and .
For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

$\cos (\vec{q}, \vec{d})=\vec{q} \cdot \vec{d}=\sum_{i=1}^{|V|} q_{i} d_{i}$

for length-normalized.
Cosine Similarity Illustrated

3.5.7 Cosine similarity amongst 3 documents

How similar are the novels “SaS: Sense and SensibilityPaP: Pride and PrejudiceWH: Wuthering Heights”?
Term frequencies (counts)

Note: To simplify this example, we don’t do idf weighting.

3.6 Calculating if-idf Cosine Scores in an IR System

3.6.1 tf-idf Weighting Has Many Variants

Many search engines allow for different weightings for queries vs. documents
SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq using the acronyms from the previous table
A very standard weighting scheme is: lnc.ltc
Document : logarithmic tf (l as first character), no idf and cosine normalization
Query : logarithmic tf (l in leftmost column), idf (t in second column), cosine normalization …

3.6.2 tf-idf example: lnc.ltc

Document: car insurance auto insurance
Query: best car insurance

Exercise: what is N, the number of docs?

$N=10^{2.3}\times 5000=99763$ $\begin{aligned} &\text { Doc length }=\sqrt{1^{2}+0^{2}+1^{2}+1.3^{2}} \approx 1.92 \\ &\text { Score }=0+0+0.27+0.53=0.8 \end{aligned}$

Computing cosine scores

4. Evaluating Search Engines

Assume 10 rel docs in collection

R：1/10，1/10，1/10，1/5，3/10，3/10，2/5，2/5，2/5，2/5
P：1 ，0.5， 0.33，0.5，0.6 ， 0.5， 4/7，0.5，4/9，2/5

4.1 Two Current Evaluation Measures…

Mean average precision (MAP)
- AP: Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved
- Avoids interpolation, use of fixed recall levels
- Does weight most accuracy of top returned results
- MAP for set of queries is arithmetic average of APs
- Macro-averaging: each query counts equally
- 简言之就是对于一个问答集合，算他们的平均正确率
R-precision
- If have known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of top Rel docs returned
  - 如果已知（尽管可能不完整）一组大小为Rel的相关文档，则计算返回的top Rel文档的精度
- Perfect system could score 1.0.

Phrase queries and positional indexes

Phrase queries and positional indexes

1. Phrase Queries

1.1 Solution 1: Biword indexes

1.2 Solution 2: Positional Indexes

2. Ranked Retrieval

2.1 Problem with Boolean search:

2.2 Scoring as The Basis Of Ranked Retrieval

3.3 Scoring with the Jaccard Coefficient

3.3.1 Issues With Jaccard For Scoring

3.4 Term Frequency Weighting

3.4.1 Recall: Term-document Incidence Matrix

3.4.2 Term Frequency ​​​

3.4.3 Log-frequency Weighting

3.4.4 (Inverse) Document Frequency Weighting

Document Frequency

3.4.5 idf Weight

3.4.6 Effect of idf on Ranking

3.4.7 Collection vs. Document Frequency

3.4.8 tf-idf Weighting

3.4.9 Final Ranking of Documents for a Query

Binary → count → weight matrix

3.5 The Vector Space Model (VSM)

3.5.1 Documents As Vectors

3.5.2 Queries As Vectors

3.5.3 Formalizing Vector Space Proximity

3.5.4 Use Angle Instead Of Distance

3.5.5 From Angles To Cosines

3.5.6 computing cosines

3.5.7 Cosine similarity amongst 3 documents

3.6 Calculating if-idf Cosine Scores in an IR System

3.6.1 tf-idf Weighting Has Many Variants

3.6.2 tf-idf example: lnc.ltc

Computing cosine scores

4. Evaluating Search Engines

4.1 Two Current Evaluation Measures…

3.4.2 Term Frequency