Knowledge Graph Alignment
1. Why do we need Knowledge Graph Alignment?
- Knowledge graph construction needs to fuse the data from multiple sources!
- Identifying the same entity with different descriptions from multiple sources is the key of knowledge graph alignment!
我们希望把不同源的同一信息进行融合
The Same Entity in Multiple Sources
- Knowledge graph alignment identifies relationships between classes (equivalence, subClassOf, and etc.), properties (equivalence, subPropertyOf, and etc.), and instances (sameAs).
- Knowledge Graph Alignment consists of:
- ontology matching (i.e., schema matching),
- instance matching.
2. Ontology Matching (本体匹配):
- It is the process of finding correspondences (i.e., relationships) between classes (or properties) of different ontologies.
- 在不同的本体寻找匹配的类别和属性
- 作用在class和property上
2.1 Benefits of Ontology Matching
- Creating global ontologies from local ontologies
- 整合全局本体
- Reuse information between ontologies
- 全局本体是公认的知识,则不用再重新建
- Dealing with heterogeneity
- 处理异构性
- Queries across multiple distributed resources
- 利于在多个分布式条件下做查询
2.2 Ontology Matching Process
- Definition (Matching process) The matching process can be seen as a function $f$ which, from a pair of ontologies to match $o$ and $o’$, an input alignment $A$, a set of parameters $p$ and a set of resources $r$, returns an alignment $A$ ‘ between these ontologies:
- parameters: weight, threshold… resources: common
- sense knowledge, domain-specific thesauri…
2.3 Ontology Matching Techniques
- Element-level matching techniques
- Analysing entities or instances in isolation
- Ignoring their relations with other entities or their instances
- 不会考虑实体外部的关系,只考虑词之间的内部关系
- Structure-level techniques
- Analysing how entities or their instances appear together in a structure (e.g. by representing ontologies as a graph)
- 会考虑实体周围的关系,例如$IsA$
2.4 Element-level Matching Techniques: String-based
2.4.1 Prefix
- takes as input two strings and checks whether the first string starts with the second one
net $=$ network; but also hot $=$ hotel - 利用前缀进行匹配,如果出现前缀完全等于另一个词,则形成匹配
2.4.2 Suffix
- takes as input two strings and checks whether the first string ends with the second one
- ID = PID; but also word $=$ sword
- 后缀匹配
2.4.3 Edit distance - Levenshtein distance is used here
- takes as input two strings and calculates the number of edition operations, (e.g., insertions, deletions, substitutions) of characters required to transform one string into another
- normalized by length of the maximum string
- 这里editor修正为$2/7$
2.4.4 N-gram
- takes as input two strings and calculates the number of common n-grams (i.e., sequences of $n$ characters) between them, normalized by $\max ($ length $($ string 1$)$, length $($ string 2$))$
- Example:
- trigrams(nikon) $=\{$ nik, iko, kon $\}$
- trigrams(nike) $=\{$ nik, ike $\}$
- $\operatorname{sim}($ nikon, nike $)=1 / 3$
2.4.5 Exercise
Compute the trigram based similarity between two strings University and Universe
trigrams(University)={Uni,niv,ive,ver,ers,rsi,sit,ity}
trigrams(Universe)={Uni,niv,ive,ver,ers,rse}
$\text{sim(University,Universe)}=5/8$
2.5 Element-level Matching Techniques
2.5.1 Language-based
Tokenization
- parses names into tokens by recognizing punctuation, cases
- Hands-Free_Kits $\rightarrow\langle$ hands, free, kits $\rangle$
Lemmatization - analyses morphologically tokens in order to find all their possible basic forms
Kits $\rightarrow$ Kit
Elimination
- discards “empty” tokens that are articles, prepositions, conjunctions, etc.
- a, the, by, type of, their, from
2.5.2 Resource-based
WordNet
- A $\sqsubseteq B$ if $A$ is a hyponym of $B$
- Brand $\sqsubseteq$ Name
$A=B$ if they are synonyms
- Brand $\sqsubseteq$ Name
- Quantity $=$ Amount
- $A \perp B$ if they are antonyms反义词 or the siblings in the part of hierarchy
- Microprocessors $\perp \mathrm{PC}$ Board
2.5.3 Constraint-based
Datatype comparison
integer < real
- 可能是sub的关系
date $\in[1 / 4 / 200530 / 6 / 2005]<$ date $[$ year $=2005]$
$\{a, c, g, t\}[1-10]<\{a, c, g, u, t\}+$
- Multiplicity comparison
$\left[\begin{array}{ll}1 & 1\end{array}\right]<\left[\begin{array}{ll}0 & 10\end{array}\right]$- Multiplicity:[minCardinality maxCardinality] of a property
- 完全不相交,可能就是不想关
Can be turned into a distance by estimating the ratio of domain coverage of each datatype.
如果整数
2.6 Structure-level Matching Techniques
- Graph-based Techniques:
- consider the input ontologies as labeled graphs;
- if two nodes from two ontologies are similar, their neighbors may also be somehow similar (similar subclasses, superclasses, and properties).
- Taxonomy-based Techniques:
- are also graph algorithms which consider only is-a relations between classes;
- is-a links connect terms that are already similar, therefore their neighbors may be somehow similar.
- Model-based Techniques
- handle the input ontologies based on its semantic interpretation (e.g, model-theoretic semantics);
- if two classes (or properties) are the same, then they share the same interpretation.
- Instance-based Techniques:
- compare sets of instances of classes to decide if these classes match or not (i.e., exist equivalence or subClassOf relations).
- 用实例集合去比较,如果完全相同则就是equivalence
2.6.1 Exercise
- Compute the Jaccard similarity between the first-order neighbor classes of the classes Car and Automobile from different ontologies.
2.7 Matcher Composition
- Sequential composition of matchers
- Problem: error accumulation and low coverage
- 第一个正确率0.95,那么一直乘会越来越小1
2.8 Parallel composition of matchers
- 可以得到结果后进行投票或者只是得到特征再送进机器学习模型
- e.g.A single similarity measure composed by the similarity obtained from their names, the similarity of their superclasses, the similarity of their instances and that of their properties
3. A Real-World Case: Book Ontology Matching
- Given two ontologies $O_{1}$ and $O_{2}$, generate candidate matched classes by pairing any two classes from the two ontologies. A pair of candidate matched classes is denoted as $\left(C_{1 \mathrm{k}}, C_{2 \mathrm{P}}\right)$, and note that $\left(C_{2 \mathrm{p}}, C_{1 \mathrm{k}}\right)$ is a different pair since we need to measure the asymmetric similarities between classes.
- 寻找非对称相似度
3.1 String Similarity
- where LCS means the length of the longest common substring, $l(\cdot)$ returns the label of the class, and $|\cdot|$ returns the length of the input label.
3.2 Neighbor Class Set Similarity
- where $\left|\mathrm{NCS}\left(C_{1 \mathrm{k}}\right) \cap \mathrm{NCS}\left(C_{2 \mathrm{p}}\right)\right|$ is the size of the intersection of $\mathrm{NCS}\left(C_{1 \mathrm{k}}\right)$ and $\operatorname{NCS}\left(C_{2 \mathrm{p}}\right), N C S(\cdot)$ returns the set of first-order neighbor classes in the given ontology.
3.3 Neighbor Class Set Similarity
- We submit the label l(C) of a snippets as the textual context;
- 先把本体扔到搜索引擎得到他的长文本
- In the top-k returned snippets of Web pages, the words co-occurred with l(C) in the same sentence are extracted;
- 把搜索中的topk个abstract搜集,然后抽取共现词
- After removing the stopwords and the words with low frequency (e.g., less than 3), TF-IDF is adopted for weighting each word $u$ :
- 对于所有共现词可以计算tf-idf
- 即合并所有的abstract可以计算tf,以及计算document的freq
- Textual Context Similarity.
- The textual context vector representation of a class $C$ is denoted as: $\mathrm{TC}(\mathrm{C})=<\mathrm{w}_{1}(\mathrm{C}), \mathrm{w}_{2}(\mathrm{C}), \ldots, \mathrm{w}_{\mathrm{n}}(\mathrm{C})>$, and $n$ is the number of all words.
- The textual context similarity between classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$ is computed as:
- 内积除以模
3.4 Instance Set Similarity
- where $I S(\cdot)$ returns the instance set of the class, and we can identify the same book instances using the ISBN number in the book domain.
3.5 Exercise
3.5.1
- Given two ontologies as follows, compute the String Similarity, Neighbor Class Set Similarity, Instance Set Similarity (introduced in the real-world case: book ontology matching) on the class pairs (book, Textbook) and (Textbook, book), respectively.
- StringSimilarity
- len(book)=4,len(Textbook)=8
- substring(book, Textbook)=book
- len(substring)=4
- strIngSimilarity(book, Textbook)=4/4=1
- strIngSimilarity(Textbook, book)=4/8=1/2
- Neighbor Class Set Similarity
- Neighbor(book)={literature,volume}
- Neighbor(Textbook)={volume,Engineering,Science}
- Neighbor Class Set Similarity(book, Textbook)=1/2
- Neighbor Class Set Similarity(Textbook, book)=1/3
- Instance Set Similarity
- Instance(book)={Red Sorghum,Introduction to Algorithm}
- Instance(Textbook)={Red Sorghum}
- Instance Set Similarity(book, Textbook)=1/2
- Instance Set Similarity(Textbook, book)=1
- Given two ontologies $O_{1}$ and $O_{2}$, generate candidate matched classes by pairing any two classes from the two ontologies. A pair of candidate matched classes is denoted as $\left(C_{1 \mathrm{k}}, C_{2 \mathrm{P}}\right)$, and note that $\left(C_{2 \mathrm{p}}, C_{1 \mathrm{k}}\right)$ is a different pair since we need to measure the asymmetric similarities between classes.
3.5.2
- Question: now we have String Similarity, Neighbor Class Set Similarity, Textual Context Similarity, and Instance Set Similarity, please tell which one belongs to the element-level matching techniques? Which one is a structure-level matching technique?
- String Similarity,Textual Context Similarity element-level matching techniques
- Neighbor Class Set Similarity, Instance Set Similarity structure-level matching technique
3.6 Aggregate these similarities:self-training
Aggregate these similarities by a semi-supervised learning strategy: self-training, for binary classification on subClassOf relations:
- In each iteration, self-training accepts the labeled data as training data and learns a classifier.
- Then the classifier is applied to the unlabeled data and adds class pairs of high confidence to the labeled data to train a new classifier for the next iteration.
- The whole process will terminate if the difference between the predicted labels of the class pairs given by classifiers in the two consecutive iterations is smaller than a threshold or the maximal number of iterations is achieved.
- 用少量数据训练得弱分类器,从而预测无标签数据,得到高置信度得数据之后再继续训练,终止条件:到达指定epoch或者标签数据不再改变
The binary classifier can use SVM, Random Forest, Neural Networks, and etc.
- 传统方法SVM,Random Forest的效果最好
In each iteration, rules are applied to filter out misclassified relations.
- 每次迭代过程用设定的规则过滤掉错误分类
RULE 1:
- Given two book classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$, if the label string $l\left(C_{1 \mathrm{k}}\right)$ is the suffix of the label string $l\left(C_{2 \mathrm{p}}\right)$, and $l\left(C_{2 \mathrm{p}}\right)$ does not contain “与”, “和”, and “\&”, then $C_{2 \mathrm{p}}$ is the subclass of $C_{1 \mathrm{k}}$.
- Example: 企业管理 subClassOf 管理
- 如果机器学习误分类了即给了很低的置信度,那么我们可以使用这条规则进行修正
RULE 2;
- Given two book classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$, if the label string $l\left(C_{2 \mathrm{p}}\right)$ contains “与” or “和” or “\&”, then using these symbols as separators to segment the label string $l\left(C_{2 \mathrm{p}}\right)$. If one of the segmented strings and $l\left(C_{1 \mathrm{k}}\right)$ are the same, then $C_{1 \mathrm{k}}$ is the subclass of $C_{2 \mathrm{p}}$.
- 先分割再看包含关系
- Example: 计算机 subClassOf 计算机与互联网
With generated subClassOf relations, how to get equivalent classes?
- A=B -: A subclassOf B and B subclassOf A
3.7 OAEI
Knowledge Graph Alignment consists of:
- ontology matching (i.e., schema matching),
- instance matching.
Instance Matching (实例匹配):It is the process of finding different instances of the same real-world objects.
- 平均四个实体有4个url
4. Instance Matching with Knowledge Graph Embedding
- Embedding maps discrete variables to continuous vector representations;
- Embedding learning techniques has achieved great progress in CV, NLP, Speech Recognition, and etc.;
- Knowledge Graph Embedding aims to map entities and relations to continuous vector representations.
- Conventional approaches are challenged by the symbolic, linguistic and schematic heterogeneity of independently-created KGs
- Embedding-based approaches measure entity similarities based on entity embeddings
- Three key modules
- KG embedding
- Alignment inference
- How they interact
- Three key modules
- Knowledge graph embedding: TransE
训练多个Transe,来做三元组对齐
Corpora: (partially-aligned) multilingual KGs
Enabling: inferable embeddings of multilingual semantics
Can be applied to:
- Knowledge alignment
- Cross-lingual Q&A
- Multilingual chat-bots
4.1 MTransE
- Knowledge model和Transe一样
- Alignment model为了缩小需要匹配的不同语言的三元组
4.2 Different alignment techniques
5. Instance Matching with Rules
- The Same Entity in Multiple Sources
seeds是预先得到的一些标签数据
Automatically discovering and refining dataset-specific matching rules in iterations
- Deriving these rules by finding the most discriminative data
characteristics for a given data source pair.
- Deriving these rules by finding the most discriminative data
5.1 Seeds - Lightweight Entity Matching
- Punctuation Cleaning:
- Space Shuttle Endeavour $\approx$ Space Shuttle “Endeavour”
- 去掉标点后,如果字符串完全相同,则可以认为这两个实体是完全匹配的
Redirects Information:
- 重定向:百度百科会自动将网页重定向到另一个页面,该页面的标题是重定向信息
- A redirects to B means A and B are synonyms
Mining Properties Equivalences
- 当我们已知两个实体等价,而这两个实体都有多个属性值对,这样我们就可以做属性匹配
- For each pair of existing matched instances, their property-value pairs are merged.
- 在一对实例可以找到很多这样的属性对
Matching rule (frequent set mining):
- 算法:Association Rule Mining 把频繁出现的属性值对放在一起,作为实例等价规则,如以下的例子
- baidu:x and hudong:x are matched, iff.
- valueOf(baidu:标签) = valueOf(hudong:中文学名)
- and
- valueOf(baidu:拉丁学名) = valueOf(hudong:二名法)
- and
- valueOf(baidu:纲) = valueOf(hudong:纲)
The Wrapper Algorithm
- 一开始用轻量级算法,从而挖掘出规则
- 然后再通过规则进行实例匹配,不断迭代
- The wrapper is an implementation of Expectation-Maximization iterations.
- 这里使用EM算法,可以这么理解,这里的匹配就是我们要的参数(未知),而规则就是我们要预测的分布
- The E-step
- 估计missing data
- The E-step estimates the missing data (matches) using the observed data and the current estimate for the parameters (matching rules).
- The M-step
- The M-step computes parameters maximizing the likelihood function as the data estimated in E-step are used in lieu of the actual missing data.
- M: matches
- $\Theta$: parameters
给定当前规则的情况下,实例匹配出现的概率
The Likelihood Function
- Assuming that no equivalent instances exist in a single data source, we can infer that an instance is equivalent to at most one another from the other data source.
- Incorrect matches in $M$ may result in a node connecting to more than one other node, which is contrary to the assumption.
如果L高,那么规则的置信度就高,那么我们就可以把这个规则加入到规则集
Precisions
Sampling a certain number of output matches.
The X-axis indicates the proportions of selected seeds in complete reference matches.
Question
What are the advantages and disadvantages of the embedding-based method and rule-based method for instance matching?
- 如果数据质量不高,embedding对噪声的容忍性高
- 如果数据质量高,那么用规则的方法就会带来更高的准确率