Smurf
文章50
标签0
分类6
Knowledge Graph Alignment

Knowledge Graph Alignment

Knowledge Graph Alignment

1. Why do we need Knowledge Graph Alignment?

  • Knowledge graph construction needs to fuse the data from multiple sources!

image-20211201102106740

  • Identifying the same entity with different descriptions from multiple sources is the key of knowledge graph alignment!

image-20220202000614244

  • 我们希望把不同源的同一信息进行融合

  • The Same Entity in Multiple Sources

  • Knowledge graph alignment identifies relationships between classes (equivalence, subClassOf, and etc.), properties (equivalence, subPropertyOf, and etc.), and instances (sameAs).

image-20211201102533869

  • Knowledge Graph Alignment consists of:
    • ontology matching (i.e., schema matching),
    • instance matching.

2. Ontology Matching (本体匹配):

  • It is the process of finding correspondences (i.e., relationships) between classes (or properties) of different ontologies.
    • 在不同的本体寻找匹配的类别和属性
    • 作用在class和property上

image-20211201103035976

2.1 Benefits of Ontology Matching

  • Creating global ontologies from local ontologies
    • 整合全局本体
  • Reuse information between ontologies
    • 全局本体是公认的知识,则不用再重新建
  • Dealing with heterogeneity
    • 处理异构性
  • Queries across multiple distributed resources
    • 利于在多个分布式条件下做查询

2.2 Ontology Matching Process

  • Definition (Matching process) The matching process can be seen as a function $f$​​​ which, from a pair of ontologies to match $o$​​​ and $o’$​​​, an input alignment $A$​​​, a set of parameters $p$​​​ and a set of resources $r$​​​, returns an alignment $A$​​​ ‘ between these ontologies:

image-20211201103518903

image-20211201103529183

  • parameters: weight, threshold… resources: common
  • sense knowledge, domain-specific thesauri…

2.3 Ontology Matching Techniques

  • Element-level matching techniques
    • Analysing entities or instances in isolation
    • Ignoring their relations with other entities or their instances
    • 不会考虑实体外部的关系,只考虑词之间的内部关系
  • Structure-level techniques
    • Analysing how entities or their instances appear together in a structure (e.g. by representing ontologies as a graph)
    • 会考虑实体周围的关系,例如$IsA$

2.4 Element-level Matching Techniques: String-based

2.4.1 Prefix

  • takes as input two strings and checks whether the first string starts with the second one
    net $=$ network; but also hot $=$​ hotel
  • 利用前缀进行匹配,如果出现前缀完全等于另一个词,则形成匹配

2.4.2 Suffix

  • takes as input two strings and checks whether the first string ends with the second one
  • ID = PID; but also word $=$​ sword
  • 后缀匹配

2.4.3 Edit distance - Levenshtein distance is used here

  • takes as input two strings and calculates the number of edition operations, (e.g., insertions, deletions, substitutions) of characters required to transform one string into another
  • normalized by length of the maximum string

image-20211202161306966

  • 这里editor修正为$2/7$

2.4.4 N-gram

  • takes as input two strings and calculates the number of common n-grams (i.e., sequences of $n$ characters) between them, normalized by $\max ($ length $($ string 1$)$, length $($ string 2$))$
  • Example:
    • trigrams(nikon) $=\{$ nik, iko, kon $\}$
    • trigrams(nike) $=\{$ nik, ike $\}$
    • $\operatorname{sim}($ nikon, nike $)=1 / 3$​​

2.4.5 Exercise

  • Compute the trigram based similarity between two strings University and Universe

  • trigrams(University)={Uni,niv,ive,ver,ers,rsi,sit,ity}

  • trigrams(Universe)={Uni,niv,ive,ver,ers,rse}

  • $\text{sim(University,Universe)}=5/8$

2.5 Element-level Matching Techniques

2.5.1 Language-based

Tokenization

  • parses names into tokens by recognizing punctuation, cases
  • Hands-Free_Kits $\rightarrow\langle$ hands, free, kits $\rangle$
    Lemmatization
  • analyses morphologically tokens in order to find all their possible basic forms
    Kits $\rightarrow$ Kit

Elimination

  • discards “empty” tokens that are articles, prepositions, conjunctions, etc.
  • a, the, by, type of, their, from

2.5.2 Resource-based

WordNet

  • A $\sqsubseteq B$ if $A$ is a hyponym of $B$​
    • Brand $\sqsubseteq$ Name
      $A=B$​ if they are synonyms
  • Quantity $=$​ Amount
  • $A \perp B$ if they are antonyms反义词 or the siblings in the part of hierarchy
    • Microprocessors $\perp \mathrm{PC}$​ Board

2.5.3 Constraint-based

  • Datatype comparison

    • integer < real

      • 可能是sub的关系
    • date $\in[1 / 4 / 200530 / 6 / 2005]<$ date $[$ year $=2005]$
      $\{a, c, g, t\}[1-10]<\{a, c, g, u, t\}+$

  • Multiplicity comparison
    $\left[\begin{array}{ll}1 & 1\end{array}\right]<\left[\begin{array}{ll}0 & 10\end{array}\right]$​
    • Multiplicity:[minCardinality maxCardinality] of a property
    • 完全不相交,可能就是不想关
  • Can be turned into a distance by estimating the ratio of domain coverage of each datatype.

  • 如果整数

2.6 Structure-level Matching Techniques

  • Graph-based Techniques:
    • consider the input ontologies as labeled graphs;
    • if two nodes from two ontologies are similar, their neighbors may also be somehow similar (similar subclasses, superclasses, and properties).
  • Taxonomy-based Techniques:
    • are also graph algorithms which consider only is-a relations between classes;
    • is-a links connect terms that are already similar, therefore their neighbors may be somehow similar.
  • Model-based Techniques
    • handle the input ontologies based on its semantic interpretation (e.g, model-theoretic semantics);
    • if two classes (or properties) are the same, then they share the same interpretation.
  • Instance-based Techniques:
    • compare sets of instances of classes to decide if these classes match or not (i.e., exist equivalence or subClassOf relations).
    • 用实例集合去比较,如果完全相同则就是equivalence

2.6.1 Exercise

  • Compute the Jaccard similarity between the first-order neighbor classes of the classes Car and Automobile from different ontologies.

image-20211201110313663

2.7 Matcher Composition

  • Sequential composition of matchers

image-20211201111030594

  • Problem: error accumulation and low coverage
    • 第一个正确率0.95,那么一直乘会越来越小1

2.8 Parallel composition of matchers

image-20211201111318050

  • 可以得到结果后进行投票或者只是得到特征再送进机器学习模型
    • e.g.A single similarity measure composed by the similarity obtained from their names, the similarity of their superclasses, the similarity of their instances and that of their properties

3. A Real-World Case: Book Ontology Matching

  • Given two ontologies $O_{1}$​ and $O_{2}$​, generate candidate matched classes by pairing any two classes from the two ontologies. A pair of candidate matched classes is denoted as $\left(C_{1 \mathrm{k}}, C_{2 \mathrm{P}}\right)$​, and note that $\left(C_{2 \mathrm{p}}, C_{1 \mathrm{k}}\right)$​​​​ is a different pair since we need to measure the asymmetric similarities between classes.
    • 寻找非对称相似度

3.1 String Similarity

  • where LCS means the length of the longest common substring, $l(\cdot)$​ returns the label of the class, and $|\cdot|$​ returns the length of the input label.

image-20211201111818791

3.2 Neighbor Class Set Similarity

  • where $\left|\mathrm{NCS}\left(C_{1 \mathrm{k}}\right) \cap \mathrm{NCS}\left(C_{2 \mathrm{p}}\right)\right|$ is the size of the intersection of $\mathrm{NCS}\left(C_{1 \mathrm{k}}\right)$ and $\operatorname{NCS}\left(C_{2 \mathrm{p}}\right), N C S(\cdot)$ returns the set of first-order neighbor classes in the given ontology.

image-20211201112013462

3.3 Neighbor Class Set Similarity

  • We submit the label l(C) of a snippets as the textual context;
    • 先把本体扔到搜索引擎得到他的长文本
  • In the top-k returned snippets of Web pages, the words co-occurred with l(C) in the same sentence are extracted;
    • 把搜索中的topk个abstract搜集,然后抽取共现词
  • After removing the stopwords and the words with low frequency (e.g., less than 3), TF-IDF is adopted for weighting each word $u$​ :
  • 对于所有共现词可以计算tf-idf
    • 即合并所有的abstract可以计算tf,以及计算document的freq
  • Textual Context Similarity.
    • The textual context vector representation of a class $C$ is denoted as: $\mathrm{TC}(\mathrm{C})=<\mathrm{w}_{1}(\mathrm{C}), \mathrm{w}_{2}(\mathrm{C}), \ldots, \mathrm{w}_{\mathrm{n}}(\mathrm{C})>$, and $n$ is the number of all words.
    • The textual context similarity between classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$​ is computed as:
      • 内积除以模

3.4 Instance Set Similarity

  • where $I S(\cdot)$​ returns the instance set of the class, and we can identify the same book instances using the ISBN number in the book domain.

image-20211201113406869

3.5 Exercise

3.5.1

  • Given two ontologies as follows, compute the String Similarity, Neighbor Class Set Similarity, Instance Set Similarity (introduced in the real-world case: book ontology matching) on the class pairs (book, Textbook) and (Textbook, book), respectively.

image-20211201114529125

  • StringSimilarity
    • len(book)=4,len(Textbook)=8
    • substring(book, Textbook)=book
    • len(substring)=4
    • strIngSimilarity(book, Textbook)=4/4=1
    • strIngSimilarity(Textbook, book)=4/8=1/2
  • Neighbor Class Set Similarity
    • Neighbor(book)={literature,volume}
    • Neighbor(Textbook)={volume,Engineering,Science}
    • Neighbor Class Set Similarity(book, Textbook)=1/2
    • Neighbor Class Set Similarity(Textbook, book)=1/3
  • Instance Set Similarity
    • Instance(book)={Red Sorghum,Introduction to Algorithm}
    • Instance(Textbook)={Red Sorghum}
    • Instance Set Similarity(book, Textbook)=1/2
    • Instance Set Similarity(Textbook, book)=1
    • Given two ontologies $O_{1}$​ and $O_{2}$​, generate candidate matched classes by pairing any two classes from the two ontologies. A pair of candidate matched classes is denoted as $\left(C_{1 \mathrm{k}}, C_{2 \mathrm{P}}\right)$​, and note that $\left(C_{2 \mathrm{p}}, C_{1 \mathrm{k}}\right)$​ is a different pair since we need to measure the asymmetric similarities between classes.

3.5.2

  • Question: now we have String Similarity, Neighbor Class Set Similarity, Textual Context Similarity, and Instance Set Similarity, please tell which one belongs to the element-level matching techniques? Which one is a structure-level matching technique?
    • String Similarity,Textual Context Similarity element-level matching techniques
    • Neighbor Class Set Similarity, Instance Set Similarity structure-level matching technique

3.6 Aggregate these similarities:self-training

  • Aggregate these similarities by a semi-supervised learning strategy: self-training, for binary classification on subClassOf relations:

    • In each iteration, self-training accepts the labeled data as training data and learns a classifier.
    • Then the classifier is applied to the unlabeled data and adds class pairs of high confidence to the labeled data to train a new classifier for the next iteration.
    • The whole process will terminate if the difference between the predicted labels of the class pairs given by classifiers in the two consecutive iterations is smaller than a threshold or the maximal number of iterations is achieved.
    • 用少量数据训练得弱分类器,从而预测无标签数据,得到高置信度得数据之后再继续训练,终止条件:到达指定epoch或者标签数据不再改变
  • The binary classifier can use SVM, Random Forest, Neural Networks, and etc.

    • 传统方法SVM,Random Forest的效果最好
  • In each iteration, rules are applied to filter out misclassified relations.

    • 每次迭代过程用设定的规则过滤掉错误分类
  • RULE 1:

    • Given two book classes $C_{1 \mathrm{k}}$​ and $C_{2 \mathrm{p}}$​, if the label string $l\left(C_{1 \mathrm{k}}\right)$​ is the suffix of the label string $l\left(C_{2 \mathrm{p}}\right)$​, and $l\left(C_{2 \mathrm{p}}\right)$​ does not contain “与”, “和”, and “\&”, then $C_{2 \mathrm{p}}$​ is the subclass of $C_{1 \mathrm{k}}$​​​.
    • Example: 企业管理 subClassOf 管理
    • 如果机器学习误分类了即给了很低的置信度,那么我们可以使用这条规则进行修正
  • RULE 2;

    • Given two book classes $C_{1 \mathrm{k}}$ and $C_{2 \mathrm{p}}$, if the label string $l\left(C_{2 \mathrm{p}}\right)$ contains “与” or “和” or “\&”, then using these symbols as separators to segment the label string $l\left(C_{2 \mathrm{p}}\right)$. If one of the segmented strings and $l\left(C_{1 \mathrm{k}}\right)$ are the same, then $C_{1 \mathrm{k}}$ is the subclass of $C_{2 \mathrm{p}}$​​​.
    • 先分割再看包含关系
    • Example: 计算机 subClassOf 计算机与互联网

With generated subClassOf relations, how to get equivalent classes?

  • A=B -: A subclassOf B and B subclassOf A

3.7 OAEI

image-20211201120124352

image-20211201120236875

  • Knowledge Graph Alignment consists of:

    • ontology matching (i.e., schema matching),
    • instance matching.
  • Instance Matching (实例匹配):It is the process of finding different instances of the same real-world objects.

image-20211201120349850

image-20211201120412598

  • 平均四个实体有4个url

4. Instance Matching with Knowledge Graph Embedding

  • Embedding maps discrete variables to continuous vector representations;
  • Embedding learning techniques has achieved great progress in CV, NLP, Speech Recognition, and etc.;
  • Knowledge Graph Embedding aims to map entities and relations to continuous vector representations.

image-20211201120632324

  • Conventional approaches are challenged by the symbolic, linguistic and schematic heterogeneity of independently-created KGs
  • Embedding-based approaches measure entity similarities based on entity embeddings
    • Three key modules
      • KG embedding
      • Alignment inference
      • How they interact

image-20211201120650279

  • Knowledge graph embedding: TransE

image-20211201120736488

  • 训练多个Transe,来做三元组对齐

  • Corpora: (partially-aligned) multilingual KGs

  • Enabling: inferable embeddings of multilingual semantics

  • Can be applied to:

    • Knowledge alignment
    • Cross-lingual Q&A
    • Multilingual chat-bots

image-20211202164729754

4.1 MTransE

image-20211202164928765

  • Knowledge model和Transe一样
  • Alignment model为了缩小需要匹配的不同语言的三元组

4.2 Different alignment techniques

image-20211202165249949

image-20211202165305293

5. Instance Matching with Rules

  • The Same Entity in Multiple Sources

image-20211208095620225

image-20211208095728113

  • seeds是预先得到的一些标签数据

  • Automatically discovering and refining dataset-specific matching rules in iterations

    • Deriving these rules by finding the most discriminative data
      characteristics for a given data source pair.

5.1 Seeds - Lightweight Entity Matching

  • Punctuation Cleaning:
    • Space Shuttle Endeavour $\approx$​ Space Shuttle “Endeavour”
    • 去掉标点后,如果字符串完全相同,则可以认为这两个实体是完全匹配的

image-20211208100015522

  • Redirects Information:

    • 重定向:百度百科会自动将网页重定向到另一个页面,该页面的标题是重定向信息
    • A redirects to B means A and B are synonyms
  • Mining Properties Equivalences

    • 当我们已知两个实体等价,而这两个实体都有多个属性值对,这样我们就可以做属性匹配

image-20211208100233557

  • For each pair of existing matched instances, their property-value pairs are merged.
    • 在一对实例可以找到很多这样的属性对

image-20211208100353730

  • Matching rule (frequent set mining):

    • 算法:Association Rule Mining 把频繁出现的属性值对放在一起,作为实例等价规则,如以下的例子
    • baidu:x and hudong:x are matched, iff.
    • valueOf(baidu:标签) = valueOf(hudong:中文学名)
    • and
    • valueOf(baidu:拉丁学名) = valueOf(hudong:二名法)
    • and
    • valueOf(baidu:纲) = valueOf(hudong:纲)
  • The Wrapper Algorithm

    • 一开始用轻量级算法,从而挖掘出规则
    • 然后再通过规则进行实例匹配,不断迭代

image-20211208100845737

  • The wrapper is an implementation of Expectation-Maximization iterations.
  • 这里使用EM算法,可以这么理解,这里的匹配就是我们要的参数(未知),而规则就是我们要预测的分布
  • The E-step
    • 估计missing data
    • The E-step estimates the missing data (matches) using the observed data and the current estimate for the parameters (matching rules).
  • The M-step
  • The M-step computes parameters maximizing the likelihood function as the data estimated in E-step are used in lieu of the actual missing data.
  • M: matches
  • $\Theta$: parameters
  • 给定当前规则的情况下,实例匹配出现的概率

  • The Likelihood Function

  • Assuming that no equivalent instances exist in a single data source, we can infer that an instance is equivalent to at most one another from the other data source.
  • Incorrect matches in $M$ may result in a node connecting to more than one other node, which is contrary to the assumption.

image-20211208101602385

  • 如果L高,那么规则的置信度就高,那么我们就可以把这个规则加入到规则集

  • Precisions

image-20211208101745264

  • Sampling a certain number of output matches.

  • The X-axis indicates the proportions of selected seeds in complete reference matches.

Question
What are the advantages and disadvantages of the embedding-based method and rule-based method for instance matching?

  • 如果数据质量不高,embedding对噪声的容忍性高
  • 如果数据质量高,那么用规则的方法就会带来更高的准确率
本文作者:Smurf
本文链接:http://example.com/2021/08/15/knowledge%20engineering/12.%20Knowledge%20Graph%20Alignment/
版权声明:本文采用 CC BY-NC-SA 3.0 CN 协议进行许可