Knowledge Graph Alignment

1. Why do we need Knowledge Graph Alignment?

Knowledge graph construction needs to fuse the data from multiple sources!

Identifying the same entity with different descriptions from multiple sources is the key of knowledge graph alignment!

我们希望把不同源的同一信息进行融合
The Same Entity in Multiple Sources
Knowledge graph alignment identifies relationships between classes (equivalence, subClassOf, and etc.), properties (equivalence, subPropertyOf, and etc.), and instances (sameAs).

Knowledge Graph Alignment consists of:
- ontology matching (i.e., schema matching),
- instance matching.

2. Ontology Matching (本体匹配)：

It is the process of finding correspondences (i.e., relationships) between classes (or properties) of different ontologies.
- 在不同的本体寻找匹配的类别和属性
- 作用在class和property上

2.1 Benefits of Ontology Matching

Creating global ontologies from local ontologies
- 整合全局本体
Reuse information between ontologies
- 全局本体是公认的知识，则不用再重新建
Dealing with heterogeneity
- 处理异构性
Queries across multiple distributed resources
- 利于在多个分布式条件下做查询

2.2 Ontology Matching Process

Definition (Matching process) The matching process can be seen as a function $f$ which, from a pair of ontologies to match $o$ and $o’$, an input alignment $A$, a set of parameters $p$ and a set of resources $r$, returns an alignment $A$ ‘ between these ontologies:

$A^{\prime}=f\left(o, o^{\prime}, A, p, r\right)$

parameters: weight, threshold… resources: common
sense knowledge, domain-specific thesauri…

2.3 Ontology Matching Techniques

Element-level matching techniques
- Analysing entities or instances in isolation
- Ignoring their relations with other entities or their instances
- 不会考虑实体外部的关系，只考虑词之间的内部关系
Structure-level techniques
- Analysing how entities or their instances appear together in a structure (e.g. by representing ontologies as a graph)
- 会考虑实体周围的关系，例如$IsA$

2.4 Element-level Matching Techniques: String-based

2.4.1 Prefix

takes as input two strings and checks whether the first string starts with the second one
net $=$ network; but also hot $=$ hotel
利用前缀进行匹配，如果出现前缀完全等于另一个词，则形成匹配

2.4.2 Suffix

takes as input two strings and checks whether the first string ends with the second one
ID = PID; but also word $=$ sword
后缀匹配

2.4.3 Edit distance - Levenshtein distance is used here

takes as input two strings and calculates the number of edition operations, (e.g., insertions, deletions, substitutions) of characters required to transform one string into another
normalized by length of the maximum string

这里editor修正为$2/7$

2.4.4 N-gram

takes as input two strings and calculates the number of common n-grams (i.e., sequences of $n$ characters) between them, normalized by $\max ($ length $($ string 1$)$, length $($ string 2$))$
Example:
- trigrams(nikon) $=\{$ nik, iko, kon $\}$
- trigrams(nike) $=\{$ nik, ike $\}$
- $\operatorname{sim}($ nikon, nike $)=1 / 3$

2.4.5 Exercise

Compute the trigram based similarity between two strings University and Universe
trigrams(University)={Uni,niv,ive,ver,ers,rsi,sit,ity}
trigrams(Universe)={Uni,niv,ive,ver,ers,rse}
$\text{sim(University,Universe)}=5/8$

2.5 Element-level Matching Techniques

2.5.1 Language-based

Tokenization

parses names into tokens by recognizing punctuation, cases
Hands-Free_Kits $\rightarrow\langle$ hands, free, kits $\rangle$
Lemmatization
analyses morphologically tokens in order to find all their possible basic forms
Kits $\rightarrow$ Kit

Elimination

discards “empty” tokens that are articles, prepositions, conjunctions, etc.
a, the, by, type of, their, from

2.5.2 Resource-based

WordNet

A $\sqsubseteq B$ if $A$ is a hyponym of $B$
- Brand $\sqsubseteq$ Name
  $A=B$ if they are synonyms
Quantity $=$ Amount
$A \perp B$ if they are antonyms反义词 or the siblings in the part of hierarchy
- Microprocessors $\perp \mathrm{PC}$ Board

2.5.3 Constraint-based

Datatype comparison
- integer < real
  - 可能是sub的关系
- date $\in[1 / 4 / 200530 / 6 / 2005]<$ date $[$ year $=2005]$
  $\{a, c, g, t\}[1-10]<\{a, c, g, u, t\}+$
Multiplicity comparison
$\left[\begin{array}{ll}1 & 1\end{array}\right]<\left[\begin{array}{ll}0 & 10\end{array}\right]$
- Multiplicity:[minCardinality maxCardinality] of a property
- 完全不相交，可能就是不想关
Can be turned into a distance by estimating the ratio of domain coverage of each datatype.
如果整数

2.6 Structure-level Matching Techniques

Graph-based Techniques:
- consider the input ontologies as labeled graphs;
- if two nodes from two ontologies are similar, their neighbors may also be somehow similar (similar subclasses, superclasses, and properties).
Taxonomy-based Techniques:
- are also graph algorithms which consider only is-a relations between classes;
- is-a links connect terms that are already similar, therefore their neighbors may be somehow similar.
Model-based Techniques
- handle the input ontologies based on its semantic interpretation (e.g, model-theoretic semantics);
- if two classes (or properties) are the same, then they share the same interpretation.
Instance-based Techniques:
- compare sets of instances of classes to decide if these classes match or not (i.e., exist equivalence or subClassOf relations).
- 用实例集合去比较，如果完全相同则就是equivalence

2.6.1 Exercise

Compute the Jaccard similarity between the first-order neighbor classes of the classes Car and Automobile from different ontologies.

$\begin{array}{l} A\cup B=\{\text{Vehicle},\text{Jeep},\text{Truck},\text{Van}\}\\ A\cap B=\{\text{Vehicle},\text{Truck}\}\\ \text{Jaccard}(\text{Car},\text{Automobile})=\frac{|A\cap B|}{|A\cup B|}=\frac{1}{2} \end{array}$

2.7 Matcher Composition

Sequential composition of matchers

Problem: error accumulation and low coverage
- 第一个正确率0.95，那么一直乘会越来越小1

2.8 Parallel composition of matchers

可以得到结果后进行投票或者只是得到特征再送进机器学习模型
- e.g.A single similarity measure composed by the similarity obtained from their names, the similarity of their superclasses, the similarity of their instances and that of their properties

3. A Real-World Case: Book Ontology Matching

Given two ontologies $O_{1}$ and $O_{2}$, generate candidate matched classes by pairing any two classes from the two ontologies. A pair of candidate matched classes is denoted as $\left(C_{1 \mathrm{k}}, C_{2 \mathrm{P}}\right)$, and note that $\left(C_{2 \mathrm{p}}, C_{1 \mathrm{k}}\right)$ is a different pair since we need to measure the asymmetric similarities between classes.
- 寻找非对称相似度

3.1 String Similarity

$C L \operatorname{sim}\left(C_{1 k}, C_{2 p}\right)=\frac{\operatorname{LCS}\left(I\left(C_{1 k}\right), I\left(C_{2 p}\right)\right)}{\left|I\left(C_{1 k}\right)\right|}$

where LCS means the length of the longest common substring, $l(\cdot)$ returns the label of the class, and $|\cdot|$ returns the length of the input label.

3.2 Neighbor Class Set Similarity

$\operatorname{NCSsim}\left(C_{1 k}, C_{2 p}\right)=\frac{\left|N C S\left(C_{1 k}\right) \cap N C S\left(C_{2 p}\right)\right|}{\left|N C S\left(C_{1 k}\right)\right|}$

where $\left|\mathrm{NCS}\left(C_{1 \mathrm{k}}\right) \cap \mathrm{NCS}\left(C_{2 \mathrm{p}}\right)\right|$ is the size of the intersection of $\mathrm{NCS}\left(C_{1 \mathrm{k}}\right)$ and $\operatorname{NCS}\left(C_{2 \mathrm{p}}\right), N C S(\cdot)$ returns the set of first-order neighbor classes in the given ontology.

3.3 Neighbor Class Set Similarity

We submit the label l(C) of a snippets as the textual context;
- 先把本体扔到搜索引擎得到他的长文本
In the top-k returned snippets of Web pages, the words co-occurred with l(C) in the same sentence are extracted;
- 把搜索中的topk个abstract搜集，然后抽取共现词
After removing the stopwords and the words with low frequency (e.g., less than 3), TF-IDF is adopted for weighting each word $u$ :
对于所有共现词可以计算tf-idf
- 即合并所有的abstract可以计算tf，以及计算document的freq

$w_{u}=t f_{u} \cdot i d f_{u}$

Textual Context Similarity.
- The textual context vector representation of a class $C$ is denoted as: $\mathrm{TC}(\mathrm{C})=<\mathrm{w}_{1}(\mathrm{C}), \mathrm{w}_{2}(\mathrm{C}), \ldots, \mathrm{w}_{\mathrm{n}}(\mathrm{C})>$, and $n$ is the number of all words.
- The textual context similarity between classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$ is computed as:
  - 内积除以模

$T C \operatorname{sim}\left(C_{1 k}, C_{2 p}\right)=\frac{\sum_{v=1}^{n} T C\left(C_{1 k}\right)_{v} \cdot T C\left(C_{2 p}\right)_{v}}{\sum_{v=1}^{n} T C\left(C_{1 k}\right)_{v}{ }^{2}}$

3.4 Instance Set Similarity

$\operatorname{ISSim}\left(C_{1 k}, C_{2 p}\right)=\frac{\left.\mid \operatorname{IS(C}_{1 k}\right) \cap \operatorname{IS}\left(C_{2 p}\right) \mid}{\left.\mid \operatorname{IS(C}_{1 k}\right) \mid}$

where $I S(\cdot)$ returns the instance set of the class, and we can identify the same book instances using the ISBN number in the book domain.

3.5 Exercise

3.5.1

Given two ontologies as follows, compute the String Similarity, Neighbor Class Set Similarity, Instance Set Similarity (introduced in the real-world case: book ontology matching) on the class pairs (book, Textbook) and (Textbook, book), respectively.

StringSimilarity
- len(book)=4,len(Textbook)=8
- substring(book, Textbook)=book
- len(substring)=4
- strIngSimilarity(book, Textbook)=4/4=1
- strIngSimilarity(Textbook, book)=4/8=1/2
Neighbor Class Set Similarity
- Neighbor(book)={literature,volume}
- Neighbor(Textbook)={volume,Engineering,Science}
- Neighbor Class Set Similarity(book, Textbook)=1/2
- Neighbor Class Set Similarity(Textbook, book)=1/3
Instance Set Similarity
- Instance(book)={Red Sorghum,Introduction to Algorithm}
- Instance(Textbook)={Red Sorghum}
- Instance Set Similarity(book, Textbook)=1/2
- Instance Set Similarity(Textbook, book)=1
- Given two ontologies $O_{1}$ and $O_{2}$, generate candidate matched classes by pairing any two classes from the two ontologies. A pair of candidate matched classes is denoted as $\left(C_{1 \mathrm{k}}, C_{2 \mathrm{P}}\right)$, and note that $\left(C_{2 \mathrm{p}}, C_{1 \mathrm{k}}\right)$ is a different pair since we need to measure the asymmetric similarities between classes.

3.5.2

Question: now we have String Similarity, Neighbor Class Set Similarity, Textual Context Similarity, and Instance Set Similarity, please tell which one belongs to the element-level matching techniques? Which one is a structure-level matching technique?
- String Similarity,Textual Context Similarity element-level matching techniques
- Neighbor Class Set Similarity, Instance Set Similarity structure-level matching technique

3.6 Aggregate these similarities:self-training

Aggregate these similarities by a semi-supervised learning strategy: self-training, for binary classification on subClassOf relations:
- In each iteration, self-training accepts the labeled data as training data and learns a classifier.
- Then the classifier is applied to the unlabeled data and adds class pairs of high confidence to the labeled data to train a new classifier for the next iteration.
- The whole process will terminate if the difference between the predicted labels of the class pairs given by classifiers in the two consecutive iterations is smaller than a threshold or the maximal number of iterations is achieved.
- 用少量数据训练得弱分类器，从而预测无标签数据，得到高置信度得数据之后再继续训练，终止条件：到达指定epoch或者标签数据不再改变
The binary classifier can use SVM, Random Forest, Neural Networks, and etc.
- 传统方法SVM，Random Forest的效果最好
In each iteration, rules are applied to filter out misclassified relations.
- 每次迭代过程用设定的规则过滤掉错误分类
RULE 1:
- Given two book classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$, if the label string $l\left(C_{1 \mathrm{k}}\right)$ is the suffix of the label string $l\left(C_{2 \mathrm{p}}\right)$, and $l\left(C_{2 \mathrm{p}}\right)$ does not contain “与”, “和”, and “\&”, then $C_{2 \mathrm{p}}$ is the subclass of $C_{1 \mathrm{k}}$.
- Example: 企业管理 subClassOf 管理
- 如果机器学习误分类了即给了很低的置信度，那么我们可以使用这条规则进行修正
RULE 2;
- Given two book classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$, if the label string $l\left(C_{2 \mathrm{p}}\right)$ contains “与” or “和” or “\&”, then using these symbols as separators to segment the label string $l\left(C_{2 \mathrm{p}}\right)$. If one of the segmented strings and $l\left(C_{1 \mathrm{k}}\right)$ are the same, then $C_{1 \mathrm{k}}$ is the subclass of $C_{2 \mathrm{p}}$.
- 先分割再看包含关系
- Example: 计算机 subClassOf 计算机与互联网

With generated subClassOf relations, how to get equivalent classes?

A=B -: A subclassOf B and B subclassOf A

3.7 OAEI

Knowledge Graph Alignment consists of:
- ontology matching (i.e., schema matching),
- instance matching.
Instance Matching (实例匹配)：It is the process of finding different instances of the same real-world objects.

平均四个实体有4个url

4. Instance Matching with Knowledge Graph Embedding

Embedding maps discrete variables to continuous vector representations;
Embedding learning techniques has achieved great progress in CV, NLP, Speech Recognition, and etc.;
Knowledge Graph Embedding aims to map entities and relations to continuous vector representations.

Conventional approaches are challenged by the symbolic, linguistic and schematic heterogeneity of independently-created KGs
Embedding-based approaches measure entity similarities based on entity embeddings
- Three key modules
  - KG embedding
  - Alignment inference
  - How they interact

Knowledge graph embedding: TransE

训练多个Transe，来做三元组对齐
Corpora: (partially-aligned) multilingual KGs
Enabling: inferable embeddings of multilingual semantics
Can be applied to:
- Knowledge alignment
- Cross-lingual Q&A
- Multilingual chat-bots

4.1 MTransE

Knowledge model和Transe一样
Alignment model为了缩小需要匹配的不同语言的三元组

4.2 Different alignment techniques

5. Instance Matching with Rules

The Same Entity in Multiple Sources

seeds是预先得到的一些标签数据
Automatically discovering and refining dataset-specific matching rules in iterations
- Deriving these rules by finding the most discriminative data
  characteristics for a given data source pair.

5.1 Seeds - Lightweight Entity Matching

Punctuation Cleaning:
- Space Shuttle Endeavour $\approx$ Space Shuttle “Endeavour”
- 去掉标点后，如果字符串完全相同，则可以认为这两个实体是完全匹配的

Redirects Information:
- 重定向：百度百科会自动将网页重定向到另一个页面，该页面的标题是重定向信息
- A redirects to B means A and B are synonyms
Mining Properties Equivalences
- 当我们已知两个实体等价，而这两个实体都有多个属性值对，这样我们就可以做属性匹配

For each pair of existing matched instances, their property-value pairs are merged.
- 在一对实例可以找到很多这样的属性对

Matching rule (frequent set mining):
- 算法：Association Rule Mining 把频繁出现的属性值对放在一起，作为实例等价规则，如以下的例子
- baidu:x and hudong:x are matched, iff.
- valueOf(baidu:标签) = valueOf(hudong:中文学名)
- and
- valueOf(baidu:拉丁学名) = valueOf(hudong:二名法)
- and
- valueOf(baidu:纲) = valueOf(hudong:纲)
The Wrapper Algorithm
- 一开始用轻量级算法，从而挖掘出规则
- 然后再通过规则进行实例匹配，不断迭代

The wrapper is an implementation of Expectation-Maximization iterations.
这里使用EM算法，可以这么理解，这里的匹配就是我们要的参数（未知），而规则就是我们要预测的分布
The E-step
- 估计missing data
- The E-step estimates the missing data (matches) using the observed data and the current estimate for the parameters (matching rules).
The M-step
The M-step computes parameters maximizing the likelihood function as the data estimated in E-step are used in lieu of the actual missing data.
M: matches
$\Theta$: parameters

$L(\theta ; M)=\operatorname{Pr}(M \mid \theta)$

给定当前规则的情况下，实例匹配出现的概率
The Likelihood Function
$L(\theta ; M) \approx \frac{\mid \text { ConnectedComponent }(M) \mid}{|\operatorname{Edge}(M)|}$
Assuming that no equivalent instances exist in a single data source, we can infer that an instance is equivalent to at most one another from the other data source.
Incorrect matches in $M$ may result in a node connecting to more than one other node, which is contrary to the assumption.

如果L高，那么规则的置信度就高，那么我们就可以把这个规则加入到规则集
Precisions

Sampling a certain number of output matches.
The X-axis indicates the proportions of selected seeds in complete reference matches.

Question
What are the advantages and disadvantages of the embedding-based method and rule-based method for instance matching?

如果数据质量不高，embedding对噪声的容忍性高
如果数据质量高，那么用规则的方法就会带来更高的准确率