Knowledge Graph Construction from Semi-Structured Data

1. What is semi-structured data?

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.
A typical example: online encyclopedias
- 百科数据

2. Knowledge Graphs Built from Online Encyclopedias

BabelNet世界最大多语言词典

2.1 Modules of Knowledge Graph Construction from Online Encyclopedias

Fact Extraction
Type Inference（推断类别）
Taxonomy Induction （分类归纳）

2.2 Fact Extraction

Facts are factual knowledge represented as triples, each of which is in the form of .
subject: an entity (an individual entity or a conceptual entity (i.e., concept))
predicate: property/relation
object: an entity or a literal value

2.2.1 Fact Extraction from Infoboxes

Editing Format：为渲染前编辑的形式（用于抽取）
Rendered Output：渲染之后的输出

2.2.2 Method

Generic Infobox Extraction: do not map synonymous properties,
- 不进行融合，即不做同义词归一化
- e.g., birthdate与dateOfBirth
Mapping-based Infobox Extraction: map synonymous properties
- 写了大量的规则，做同义词匹配
- Rules: http://mappings.dbpedia.org/

DBpedia Ontology: 2795 properties, 685 concepts with subClassOf relations.https://wiki.dbpedia.org/services-resources/ontology
- 做完归一化后可以得到本体

2.2.2.1 Fact Extraction from DRpedia

2.2.2.2 Fact Extraction from YAGO

YAGO also defines infobox heuristics(启发法) to map infobox properties to YAGO relations.
Difference between YAGO and DBpedia: YAGO uses LEILA to parse literals of different types, such as dates and quantities, and normalizes units of measurement to ISO units.
- 利用LEILA进行literal解析，并归一到ISO单位

2.2.3 Fact Extraction from Categories

正则表达式抽取

subject: the article entity
- Qiang Yang
predicate: the target relation
- birthOnDate
object: the string captured by the brackets of the regular expression
- 1964 births
提取完之后需要做property
- Note: Domain and Range checking is necessary.

如图我们可以看到，直接把pedia的主题作为subject，然后从categories里面进行正则化匹配，从而得到predicate 和object

匹配完后需要进行domain和range的checking

2.3.4 Exercise

In the Wikipedia article page of “Mo Yan”, the categories as follows. Please extract facts from them in turtle using YAGO category maps without domain and range checking, and clarify the subsequent validation steps on the extracted fact.
YAGO category maps

@prefix expr: <http://www.example.org/resource/>
@prefix expp: <http://www.example.org/property/>

expr: MoYan expp: bornOnDate "1955";
			expp: hasWonPrice expr: Han Chinese Nobel,
							  expr: Nobel,
							  expr: Mao Dun Literature Prize.

需要check property的 range和domain

2.3 Type Inference

One of the most valuable kinds of knowledge is type information, which refers to the axioms stating that an instance is of a certain type.

1 2	Barack Obama rdf:type President of the United States Italy rdf:type Country in Europe

2.3.1 Type Inference: Applications

Type是连接schema和Instance层的重要关系

Type Inference in Knowledge Graph Construction: 一种是从0开始的构建任务
- 另外一种是有部分已经有type，其他的是做一种不全工作

2.3.2 Type Inference from Infoboxes

2.3.2.1 Names of infobox templates

Names of infobox templates are extracted as entity types (DBpedia).
- raw type : Mapping to DBpedia Ontology $\rightarrow$ ontology classes
- 直接从Info Box 找，进行映射

2.3.2.2 concept-instance pairs

Property-value pairs in infoboxes are mapped to concept-instance pairs (Zhishi.me).
- Each article in the online encyclopedia is taken as an instance, so all articles form an instance set $I=\{i_1, i_2,…, i_n\}$;
- Each category (i.e., article category) in the online encyclopedia is treated as a
  concept (i.e., class), so all categories compose a concept/class set C={c1, c2,…, cm};
- In the infobox of each article, a set of property-value pairs can be extracted as
  $\{
  ,… ,
  \}$.
  - 可以这么理解在维基百科里我们可以获得每篇文章的实体，以及每个类别的实体，然后我们像针对某一篇文章进行map，我们就可以从那篇文章的InfoBox里找到所有的 property-value pairs，然后我们就可以检查property是否在classes里，value是不是在instance set里，如果是那么就映射成了v type p
instances

concepts

property-value pairs:{<name, Schindler’s List>, ,…}

Property-value pairs in infoboxes are mapped to concept-instance pairs (Zhishi.me).
- For each property-value pair $$, if $p_k$ belongs to the concept set $C$ and $v_k$ belongs to the instance set $I$, then we infer that $p_k$ is the type of $v_k$.

2.3.3 Type Inference from Categories

Each article is usually assigned to several categories in online encyclopedias, so these categories can be taken as important candidate types for entities/ instances.
- 在pedia上，我们可以发现有分配categories信息，而这样的信息可以作为候选type
YAGO is the first work to infer entity types from the categories in Wikipedia by using heuristics（启发法）.

YAGO classifies all categories into conceptual ones, which are actually entity types, and non-conceptual ones.
- 但是Categories 有正确也有错误，所以要进行筛选
- YAGO parses each category name like Naturalized citizens of Germany into a pre-modifier (Naturalized), a head (citizens) and a post-modifier (Germany);
  - 大致意思就是把每个catygories解析成pre-modifier+head+post-modifier形式
- If the head is not an acronym缩写 (e.g., NYC) and is a plural复数 word, then the given category is a candidate conceptual category;
- For each of the candidate conceptual categories, if it contains a non-conceptual word (manually summarized by YAGO), then it may be an administrative category (e.g., Articles with unsourced statements),
  - 如果候选词里面包含了一些YAGO总结的非概念性词，那么应该舍弃掉
- or a relational category (e.g., 1987 births), or a thematic category (Physics), and it will be filtered out.
  - 关系性的词以及理论词

如图我们可以发现大部分复数中心词都是正确的category

2.3.3.1 Problems:

The heuristics cannot be applied in some languages (e.g. Chinese and Japanese), in which nouns have no explicit singular or plural forms.
- 对于没有复数的语言不利；
The heuristics cannot catch the semantic associations between instances and classes (i.e., categories), which may lead to mistakes in the process of type inference.
- 没有考虑语义关系，可能造成映射错误，因为并不是所有复数的都是category，如图所示

2.3.3.2 MulType

To solve the above problems, MulType mines a language-independent feature: attribute (i.e., property) to perform language-independent type inference.
The assumption of the method:
- Given the attributes (i.e., properties) “actors, release date, director” of an instance, we intuitively infer its type as “Movie”.
  - actor 等具有代表性，可以推断
- Given the attributes “name, foreign name” of an instance, we cannot infer its type by intuition.
  - name foreign name 没有区分度，无法推断
- 代表属性假设
- MulType uses attributes to build semantic associations between instances and classes (i.e., categories), and presents an attribute-driven type inference assumption: In Wikipedia, if an instance contains representative attributes of one of its classes (i.e., categories), there may exist a rdf:type relation from the instance to the category with a high probability.
  - 如果一个实体包含一个类的代表属性，那么他有很高的概率是这个类

2.3.3.3 process

Based on the above assumption, MulType extracts type information in two steps: attribute extraction and type information generation.

Attribute extraction:
- Instance attribute extraction: Extract instance attributes directly from infoboxes (see Figure (a)).
- Class attribute extraction: Extract class attributes from infobox templates (see Figure(b)), which are templates that provide standardized information (i.e., attributes) across related articles.
- Not enough! There are only thousands of infobox templates.

2.3.3.4 Class attribute extraction:

Definition 1: Infobox Template based Extraction Rule (IT-ER). Given an infobox template it and a class $c$, the local names of it and $c$ are respectively denoted as $n(i t)$ and $n(c) .$ All the attributes of it can be propagated传播 to $c$ if
- $n($ it $)$ and $n(c)$ are the same (ignoring case and the difference in singular and plural forms for some languages),
- or “Category:n(it)” can redirect to $c$,
- or $n($ it $)$ can redirect to $n(c)$ or $n(c)$ can redirect to $n($ it $)$.
可以这么理解，为了从模板进行抽取关系，我们需要进行匹配，匹配其实就是看模板的local name和类别的local name，如果“一样”，则可以讲模板的所有属性传递到某一类别

Examples:

1) Infobox template“Template:Infobox islands”and class“Category:Islands”have the same local name“islands”;

same local name

2) The singular forms of“n(Category:Countries)” (i.e.,“Country”) and“n(Template:Infobox country)” (i.e.,“country”) are the same when ignoring case;

忽略大小写

3) If we submit the query “Category:n(Template:Infobox university)”to Wikipedia, it can be redirected to“Category:Universities and colleges”;

对于两个category如果有从定向的关系，那么这两者的属性是一样的

“n(Category: States of The United States)”can redirect to“n(Template:Infobox U.S. state)”.
- 模板和category有重定向关系，具体来看，我们可以从WIKI的主题下面的小字看到这种关系

Top-Down

Definition 2: Top-Down Hierarchy-based Extraction Rule (TDH-ER). If a class c has no attribute extracted from infobox templates, and it has super-classes with attributes, then all the attributesof its super-classes should be inherited by c.
- 对于那些无法从模板的到属性的类，我们可以从他的父类进行继承
This is inspired by the inheritance in object-oriented programming that one class can inherit all the attributes of its super-class.
- Example: class Manager inherits the attribute name and gender from class Person.
Sets(one set corresponds to one language) of isA relations between classes are collected from open knowledge bases, including DBpedia, Yago, BabelNet, WikiTaxonomy and Zhishi.schema, in order to refine the class hierarchies of different languages in Wikipedia.
- 我们可以通过一些知识百科搜集的isA关系来建立这种层次结构

Bottom-Up

Definition 3: Bottom-Up Hierarchy-based Extraction Rule(BUH-ER). Ilf a class c has no attribute extracted from infobox templates, and it has hyponyms（下义词） (include instances and sub-classes) with attributes, then the attributes can be propagated to c when they are shared by more than half of these hyponyms.
- 对于一些上位词，它可以从下位词拿到属性，但必须有半数下位词拥有这个属性
This is based on the idea that a class is traditionally a placeholder for a set of instances sharing similar attributes (Dowty, Wall, & Peters, 1981).
- Dowty, D. R., Wall, R., & Peters, S. (1981). Introduction to Montague semantics. Springer Netherlands.

有解决70%的页面没有InfoBox，其次InfoBox的并不完整

2.3.3.5 acquire the most similar instances

这里是想要通过近义词来完成type任务，因为并不是所有实例都有infobox，那么就无法通过实例来提取attribute
Type Information Generation
- Acquisition of Most Similar Instances:
  - 得到更多相似的实体
- Infoboxes are not always contained by the corresponding article pages of instances.
  - 信息框并不总是包含在实例的相应文章页面中。
- The information in existing infoboxes may be not complete.
  - 缺了的属性可以从相似词里拿
try to leverage the attributes of most similar instances to complement the given instance’s attributes
- Example: The city Zhenjiang misses the attribute mayor, but its similar instance Changzhou has the attribute mayor, which can be borrowed by Zhenjiang.
To acquire the most similar instances that have attributes, different similarity metrics can be used to measure the similarity degree between instances.
Given two instances $i_1$ and $i_2$, the Context Similarity Metric is computed as

$\operatorname{CSM}\left(i_{1}, i_{2}\right)=\frac{v\left(i_{1}\right) \times v\left(i_{2}\right)}{\left|v\left(i_{1}\right)\right| \times\left|v\left(i_{2}\right)\right|}$

where $v(i)$ is the vector representation of instance $i$. Such vector representations can be trained by pre-trained models on the whole Wikipedia.
- 大致意思是把词word embedding成vector，再计算相似度
Given two instances $i_1$ and $i_2$, the Existing Classes Similarity Metric is computed as

$\operatorname{ECSM}\left(i_{1}, i_{2}\right)=\frac{\mid E C \operatorname{set}\left(\left(i_{1}\right) \cap E C \operatorname{set}\left(i_{2}\right) \mid\right.}{\left|E \operatorname{Cset}\left(i_{1}\right) \cup E C \operatorname{set}\left(i_{2}\right)\right|}$

类似于Jaccard相似度
where $v(i)$ is the set of classes (i.e., categories) in the Wikipedia article pages for each instance.
- 就是每个article里categories的集合

Example of Existing Classes Similarity Metric : suppose two instances $i_1= \text{IntelliJ IDEA}, i_2= \text{Apache XMLBeans}$
$\text{ECSM}(i_1,i_2)=1/8=0.125$
To combine the Context Similarity Metric and Existing Classes Similarity Metric, an Integrated Instance Similar Score is defined by maximizing the product of these two metrics:

$\operatorname{IISS}\left(i_{1}, i_{2}\right)=\left(\operatorname{CSM}\left(i_{1}, i_{2}\right)+1\right) \times\left(\operatorname{ECSM}\left(i_{1}, i_{2}\right)+1\right)$

We add 1 to each value of both the metrics before the multiplication to avoid a 0 Integrated Instance Similar Score. If there does not exist an instance satisfying $\operatorname{IISS}\left(i_{1}, i_{2}\right)>1$, the instance $i_{1}$ will not get any similar instance.
- 如果有很多>1，则可以做一个排序，然后取top5

2.3.4 Type Information Generation

Type Inference with Random Graph Walk:
- After extracting attributes of instances and classes, each instance, its attributes and the classes in its corresponding Wikipedia article page naturally form a graph, in which the given instance and its classes are linked by shared attributes.
- To compute the probability of the class being the type of the given instance leveraging the graph structure, a random graph walk model is used to infer the types of each given instance.

the initial built graph of instance $i$

2.3.4.1 Type Inference with Random Graph Walk:

After adding most similar instances of $i$ that have attributes into the built graph
- 把最相似的实体加入图中

the attribute-driven type inference assumption:
- In Wikipedia, if an instance contains representative attributes of one of its classes (i.e., categories), there may exist a rdf:type relation from the instance to the category with a high probability.
- 如果一个实体拥有一个类的代表属性，那有一个较高概率属于这个类别
To model the above assumption in the random graph walk model, it is supposed that if some class can be reached by more of its representative attributes on the built graph of each instance, then the probability of the class being the type of the given instance is higher.
- 拥有越多一个类的代表特征，概率越大
使用random graph walk来看哪些可以尽可能走到category

2.3.4.2

Transition Probability Starting from the Given Instance. Given an instance $i$, the transition probability $P_{L A}$ from $i$ to one of its attributes $a_{j}^{i}$ in the built graph is

就是一个加权

$P_{I A}\left(i, a_{j}^{i}\right)=\alpha \cdot \frac{W_{e i g h t}\left(a_{j}^{i}\right)}{\sum_{j=1}^{n} W e i g h t\left(a_{j}^{i}\right)}$

where $\alpha \in[0,1]$ is a constant, Weight $\left(a_{j}^{i}\right)$ is the reciprocal value of the frequency that $\left(a_{j}^{i}\right)$ occurs in the attribute sets of all the classes in Wikipedia.
The fewer classes an attribute is shared by, the more representative this attribute is.
The lower frequency of an attribute, the higher the weight.

2.3.4.3

Transition Probability Starting from the Given Instance. Given an instance $i$, the transition probability $P_{I S}$ from $i$ to one of its most similar instance $s_{r}^{i}$ in the built graph is defined as

$P_{I S}\left(i, s_{r}^{i}\right)=(1-\alpha) \cdot \frac{\operatorname{IISS}\left(i, s_{r}^{i}\right)}{\operatorname{IISS}(i, *)}$

实体到相似词的概率
$\text { where } \alpha \in[0,1] \text { is a constant, } I I S S\left(i, s_{}^{i}\right) \text { is the Integrated Instance Similar Score between }$ $i$ and $s_{r}^{i}, \operatorname{IISS}(i, )$ is the sum of Integrated Instance Similar Score between $i$ and each of its most similar instances.

2.3.4.4

Transition Probability Starting from an Attribute. Given an instance $i$, the transition probability $P_{A C}$ from the attribute $a_{j}^{i}$ to a class $c_{k}^{i}$ in the built graph is defined as

$P_{A C}\left(a_{j}^{i}, c_{k}^{i}\right)=\frac{1}{\left|N_{C}\right|}$

一个属性被越少的类拥有，则代表性越高
- a1 10000 个类有 -> 1/10000
- a2 100个类有 $\rightarrow$ 1/100
- 我们的目的是是为了从i出发能走到正确的type，我们希望走到最具代表性的属性，通过具有代表性的属性走到最终的type，那么i就有越大可能性属于某个type
When walking a step from an instance to an attribute, the walk tends to choose the most representative attribute so that it has better chance to walk to the correct classes, i.e., types.
- 实体游走时倾向于走最具代表性特征的路径，从而找到正确的class
where $N_{C}$ is the number of the classes that have edges connecting $a_{j}^{i}$ in the built graph.
For example, $P_{A C}\left(a_{1}, c_{2}\right)=P_{A C}\left(a_{3}, c_{2}\right)=0.5$

用$a_j$到$c_i$的变数的倒数来表示边的概率

2.3.4.5 process

a. Starting from an instance $i$, the walk may get to an attribute $a_{j}^{i}$ either by the directed edge from $i$ to $a_{j}^{i}$ with the transition probability $P_{I A}\left(i, a_{i}^{i}\right)$ or by two steps through one of $i$ ‘s most similar instances $s_{r}^{i}$ with the transition probabilities. $P_{L S}\left(i, s_{n}^{i}\right)$ and $P_{L A}\left(s_{x}^{i}, a_{i}^{i}\right) .$

b. Starting from an attribute $a_{j}^{i}$, walk a random step to one of the classes $c_{k}^{i}$ with the transition probability $P_{A C}\left(a_{j}^{i}, c_{k}^{i}\right)$.
c. Based on the random walk process, the probability starting from the instance $i$ to a class $c_{k}^{i}$ is computed as

$\begin{aligned} P_{r g w}\left(i, c_{k}^{i}\right)=& \sum_{j=1}^{n^{+}} P_{I A}\left(i, a_{j}^{i}\right) \cdot P_{A C}\left(a_{j}^{i}, c_{k}^{i}\right)+\\ & \sum_{r=1}^{t} P_{I S}\left(i, s_{r}^{i}\right) \cdot \sum_{j=1}^{n^{+}} P_{I A}\left(s_{r}^{i}, a_{j}^{i}\right) \cdot P_{A C}\left(a_{j}^{i}, c_{k}^{i}\right) \end{aligned}$

d. After normalizing $P_{r g w}\left(i, c_{k}^{i}\right)$, if it is greater than a threshold, then $c_{l}^{i}$ is inferred as $i$ ‘s type.

3. Type Inference from Text

Two steps on type inference from text:
- Type extraction, which extracts the types of each entity from the first sentence of its corresponding article.
- 为维基百科第一句话进行抽取
  - The first sentence in each article gives a textual definition of each entity, and it usually provides important type information.
- Type disambiguation, which links extracted types to correct Wikipedia entities.
  - Since the types extracted in the first step are just plain strings, the second step aims to assign real senses to these types.
  - 把提取的信息映射到class的实体，这样就达到了消歧
Type inference with a lightweight Syntactic Pattern: Type Extraction
为维基百科第一句话进行抽取
- To extract entity types from the first sentences in Wikipedia articles, LHD directly applies a simple syntactic pattern as follows

where A is a sequence of any tokens, and B denotes candidate types.
Type inference with a lightweight Syntactic Pattern: Type Extraction
- Only using the pattern to extract entity types is effective but the precision cannot be guaranteed. Thus, several heuristics are proposed to improve the extraction quality.
- 简单但是有噪音，消除噪音

Search Linker. This linker takes the extracted type as the input, and leverages the Wikipedia Search API to return the disambiguated Wikipedia article.
- 用这个API去检查

Exercise

1)Given the Wikipedia article page of “IntelliJ IDEA”, including the first sentence, categories, and the infobox. Please write correct types from them (directly use plain strings).

第一句话：integrated development environment(IDE)
categories：Free integrated development environments，Integrated development environments, Java development tools，Products
Infobox：Software

4. Taxonomy Induction（分类归纳）

4.1 directed acyclic graph

A taxonomy is a directed acyclic graph consisting of is-a relations between entities, including conceptual entities and individual entities.
example: a part of the taxonomy of Google products

4.2 not form a taxonomy

Here, Taxonomy induction is to induce a taxonomy from the online encyclopedia.

Wikipedia has its own categorization system, but categories do not form a taxonomy with a fully-fledged subsumption hierarchy, so it is only a thematically organized thesaurus.
- 这里意思是讲pedia的层级关系可能和我们想要的IsA关系不一样，所以需要重新生成

蓝色框不是正确的下位词

4.3 Solution

Taxonomy induction from Wikipedia is to refine the Wikipedia category system by removing not-is-a relations. (WikiTaxonomy)
The first step - Pre-Cleansing:
- 预处理，总结pedia那些用于管理的词条，应该删去
- Remove the categories used for Wikipedia administration, i.e., remove the categories whose labels contain any of the following strings: wikipedia, wikiprojects, lists, mediawiki, template, user, portal, categories, articles, and pages.
- Wikipedia organizes many category pairs using patterns: Y X and X by Z (e.g., Miles Davis albums, Albums by artist)
- pattens 去组织的这些pair只是为了管理，并不是IsA
- The relation between these categories is defined as is-refined-by, which is to better structure and simplify the categorization system and should be removed.
The second step - Syntax-based Method：
- 基于语法的方法
- if a category pair share the same lexical head, then there exists an is-a relation between these two categories,
  - 中心词一样，就有IsA
  - e.g., British Computer Scientists is-a Computer Scientists
- if the lexical head of one of the category occurs in non-head position in the other category, then a not-is-a relation is labeled between these categories.
  - 如果其中一个类别的词头出现在另一个类别的非词头位置，那么这些类别之间的关系就被标记为not-is-a关系。
  - e.g., Crime comics not-is-a Crime
The third step - Connectivity-based Method ：

For each category $c$, we find the article titled as the category name, e.g., article Microsoft for category Microsoft;
On the found article, we collect all categories whose lexical heads are plural nouns caSet $=\left\{c a_{1}, c a_{2}, \ldots, c a_{n}\right\}$
For each $c$ ‘s super category $s c$ in the category system, we label the relation between $c$ and $s c$ as $i s-a$, if the head lemma of $s c$ matches the head lemma of at least one category in caSet.

The fourth step - Lexico-Syntactic based Method：
- lexico-syntactic patterns are leveraged to identify is-a and not-is-a relations between categories from large-scale corpora, e.g., all article in Wikipedia.

The fifth step - Inference based Method：
- propagate the previously found relations based on the properties of transitivity of the is-a relation.

4.4 Exercise

Please extract a taxonomy from the following sentence (denote the answer as A is-a B):
- IBM, AMD, and Intel are High-tech companies using nanotechnology for several years.
- High-tech is-a company
- IBM is-a High-tech company
- AMD is-a High-tech company
- Intel is-a High-tech company