Statistical Resoning

1.Def

Statistical Reasoning tries to find suitable statistical models to fit the samples and predicts the expected probabilities of the inferred knowledge. 预测未来知识出现的概率
knowledge graph embedding based reasoning
inductive rule learning based reasoning
multi-hop reasoning

Tasks:

Predicting the missing link.
Given e1 and e2, predict the relation r.
Predicting the missing entity.
Given e1 （e2）and relation r, predict the missing entity e2 （e1）.
Fact Prediction.
Given a triple, predict whether it is true or false.

2. Embedding: Meaning of a Word

What is the meaning of a word?
By ontologies? By Knowledge Graph?
But ontologies and KGs are hard to construct and often incomplete 无法穷举
How to encode the meaning of a word?

3. One-hot Representation

Vocabulary: (cat, mat, on, sat, the)
- cat: 10000 mat: 01000 on: 00100 sat: 00010 the: 00001
“The cat sat on the mat”

Disadvantage: too sparsity

One-hot representation:
- Foundation of Bag-of-words Model
无法衡量语义相关度

4. Distributional Representation

When a word w appears in text, its context is the set of words that appear nearby (within a fixed-size window)：用中心词周围的词表示该词
Use many contexts of w to build up a representation of w

建立一个稠密向量

5. Word Vectors

We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Note: word vectors are sometimes called word embeddings. They are a distributed representation.

Similarity:

$\begin{gathered} \operatorname{euclidean}(u, v)=\sqrt{\sum_{i=1}^{n}\left|u_{i}-v_{i}\right|^{2}} \\ \operatorname{cosine}(u, v)=1-\frac{\sum_{i=1}^{n} u_{i} \times v_{i}}{\|u\|_{2} \times\|v\|_{2}} \end{gathered}$

6. Advantage of Distributed Representation

Deal with data sparsity problem in NLP
Realize knowledge transfer across domains and across objects
Provide a unified representation for multi-task learning

6.1 Representation Learning

What is the representation learning?
- Objects are represented as dense, real-value and low-dimensional vector

6.2 Different ways of KG Representation

Tensor: 自由度更高，隐式知识，但不容易扩展，不容易解释

6.3 Knowledge Graph Embedding: Application

Entity Prediction
- 卧虎藏龙 Has-director ?
- 卧虎藏龙 Has-director：Ang Lee

Relation Prediction

Recommendation System

7. TransE: Take Relation as Translation

For a fact (head, relation, tail), take the relation as a translation operator from the head to the tail .

实体经过关系的翻译到另一个实体

TransE

For each triple
, h is translated to t by r.

$t=h+r$

Train TransE Energy Function:

$f(h, r, t)=|h+r-t|_{L_{1} / L_{2}}$

If the triple is true, the translated distance between (h + r) and t is shorter.
L1 (Manhattan) distance:

$\mathbf{d}_{1}(a, b)=\|a-b\|_{1}=\sum_{i=1}\left|a_{i}-b_{i}\right|$

L2 (Euclidean) distance:

$\begin{aligned} \mathbf{d}_{2}(a, b)=\|a-b\|=\|a-b\|_{2}=\sqrt{\sum_{i=1}^{d}\left(a_{i}-b_{i}\right)^{2}} \end{aligned}$

TransE

Triple1:
Triple2:
Triple3:
…
false triple examples：
…

How to distinguish？(true and false)

Minimize the distance between (h+l) and t.
Maximize the distance between (h’+l) to a randomly sampled tail t’ (negative example).
- 最小化正类表示的差距，最大化负类表示的差距

Tbatch就是一个正例和负例元组的集合

input Training set $S=\{(h, \ell, t)\}$, entities and relations. sets $E$ and $L$, margin $\lambda$, embeddings dim. $k$.
Initialize entity and relationship embedding;
Entity and relationship embedding normalization;
For each entity e（Suppose there are M elements in the entity set E）

$e=\frac{e_{i}}{\sqrt{e_{1}^{2}+e_{2}^{2}+\cdots+e_{M}^{2}}}$

4、Negative Sampling

Evaluation protocol:

Metrics: 遍历所有实例，进行距离计算，并排序

Link Prediction
( WALL-E , _has_genre , ? )
Mean Ranks: the mean of those predicted ranks.
Hits@10: the proportion of correct entities ranked in the top 10.
e.g. Entity 1: rank -> 50; Entity 2: rank -> 100; MR = (50+100)/2 = 75

8.Question

We have two types of relations in KG, for example:

Symmetric Relation:
- e.g., (stu1, classmate, stu2), (stu2, classmate, stu1)
Composition（组合） Relation:
- e.g., (B, husband_of, A)，(A, mother_of, C)，(B, father_of, C)

Which Relation can be modeled by TransE? Why?

TransE cannot model symmetric relations

TransE can model composition relations，when $r_3=r_1+r_2$

Can TransE model 1-to-N relations?
- e.g., (qiguilin, teacher_of, stu1), (qiguilin, teacher_of, stu2)，
  (qiguilin, teacher_of, stu3), (qiguilin, teacher_of, stu4)…
- 不能，否则stui均相等

Issue of TransE

TransE is too simple to handle complex relations
- 1-to-N, N-to-1, N-to-N relations 不可能发生

$\begin{array}\\ a+r_1=b_1\\ a+r_2=b_2 \end{array}$

9. Variants（变种） of TransE: TransH

For each relation, define a hyperplane $W_r$ and a relation vector dr. Then project the head entity vector $h$ and the tail entity vector $t$ onto the hyperplane $W_r$. 将向量映射到超平面做翻译

For example:

in TransE, h and h’’ will overlap. While in TransH, entity h and entity h’’ will overlap only with the projection h⊥.

10. Variants of TransE: TransR

Both TransE and TransH models assume that entities and relationships are vectors in the same semantic space.

假设每一个关系，有自己的向量空间
因为毛主席和奥巴马虽然在总统空间接近，但是诗人空间却是不接近

TransR proposes:

Build entity and relation embeddings in the separate entity space and relation spaces;
Then projecting entities from entity space to the corresponding relation space and building translations between projected entities.

TransR:

Mapping entity embeddings into different semantic spaces

$\mathbf{h}_{r}=\mathrm{h} \mathbf{M}_{r}, \quad \mathbf{t}_{r}=\mathrm{tM}_{r}$

The score(energy) function is correspondingly defined as (same as TransE):

$f_{r}(h, t)=\left\|\mathbf{h}_{r}+\mathbf{r}-\mathbf{t}_{r}\right\|_{2}^{2}$

11. Summary

Statistical reasoning uses statistical models to fit the samples and predicts the expected probabilities of the inferred knowledge .
Knowledge graph embedding based reasoning actually performs entity prediction and relation prediction with vector calculations.
Translation-based models are now widely used KG embedding models for KG completion and other applications due to its good performance and succinctness.