Knowledge Graph Construction from Semi-Structured Data and Unstructured Data

1. Taxonomy Induction

A taxonomy is a directed acyclic graph consisting of is-a relations between entities, including conceptual entities and individual entities.
example: a part of the taxonomy of Books

Here, Taxonomy induction is to induce a taxonomy from the online encyclopedia.
Wikipedia has its own categorization system, but categories do not form a taxonomy with a fully-fledged subsumption hierarchy, so it is only a thematically organized thesaurus.

Taxonomy induction from Wikipedia is to refine the Wikipedia category system by removing not-is-a relations. (WikiTaxonomy)

1.1 The first step - Pre-Cleansing:

Remove the categories used for Wikipedia administration, i.e., remove the categories whose labels contain any of the following strings: wikipedia, wikiprojects, lists, mediawiki, template, user, portal, categories, articles, and pages.
Wikipedia organizes many category pairs using patterns: Y X and X by Z (e.g., Miles Davis albums, Albums by artist)
- 不是Is-A关系，remove
The relation between these categories is defined as is-refined-by, which is to better structure and simplify the categorization system and should be removed.

1.2 The second step - Syntax-based Method：

if a category pair share the same lexical head, then there exists an is-a relation between these two categories,
- lexical head 词汇中心词
- e.g., British Computer Scientists is-a Computer Scientists
if the lexical head of one of the category occurs in non-head position in the other category, then a not-is-a relation is labeled between these categories.
- categories的中心词出现在非中心词的位置，则不是IS-A关系
- e.g., Crime comics not-is-a Crime

1.3 The third step - Connectivity-based Method ：

For each category $c$, we find the article titled as the category name, e.g., article Microsoft for category Microsoft;
- 首先对于每个category c，我们可以找到其article
On the found article, we collect all categories whose lexical heads are plural nouns $c a S e t=\left\{c a_{1}, c a_{2}, \ldots, c a_{n}\right\} ;$
- 然后在我们找到的article页面下，收集每个category对应的中心词集合
For each $c$ ‘s super category $s c$ in the category system, we label the relation between $c$ and $s c$ as $i s-a$, if the head lemma of $s c$ matches the head lemma of at least one category in caset.
- 对于c的上级类sc，我们可以从其名称找到他的head lemmma（词根）
- 比如Human names的中心词就是names
- 然后进行匹配，如果某个sc和c是IS-A关系，则其lexical head 和head lemma是相同的

1.4 The fourth step - Lexico-Syntactic based Method：

lexico-syntactic patterns are leveraged to identify is-a and not-is-a relations between categories from large-scale corpora, e.g., all article in Wikipedia.

1.5 The fifth step - Inference based Method：

propagate the previously found relations based on the properties of transitivity of the is-a relation.
- 传递性

1.6 Exercise

Please extract a taxonomy from the following sentence (denote the answer as A is-a B):
- IBM, AMD, and Intel are High-tech companies using nanotechnology for several years.
  - High-tech company is-a company
  - IBM is-a High-tech company
  - AMD is-a High-tech company
  - Intel is-a High-tech company

1.7 Summary: KG construction from Semi-Structured Data

Online encyclopedias are typical semi-structured data for knowledge graph construction.
We have introduced techniques on fact extraction, type inference, and taxonomy induction.
All introduced techniques have already been used to build real-word knowledge graphs, and shown good effectiveness and practicability.
There is no perfect technique on knowledge graph construction, so we need to study more.

2. Knowledge Graph Construction from Unstructured Data (Text)

2.1 Basic Tasks

面向文本的实体链接或者发现新的实体
关系抽取，已知两个实体看是否有关系/槽填充
事件抽取

2.2 Two Specific Tasks in Knowledge Engineering

General IS-A Relation Extraction
- (benefit to build taxonomies)
Terminology/ Term Extraction
- 术语抽取，利于领域构建知识图谱
- (benefit to domain-specific knowledge graph construction)

2.3 General is-a Relation Extraction

2.3.1 problem1

The general is-a relation is the semantic relationship between a more specific word (i.e., hyponym下位词) and the more general term (i.e., hypernym上位词).
- hyponym e.g., daisy and rose
- hypernym e.g., flower
Features of is-a relations:
- Transitivity: $A$ is-a $B, B$ is-a $C \rightarrow A$ is-a $C$
  e.g., dog is-a mammal, mammal is-a animal $\rightarrow$ dog is-a anima
- Asymmetry: $A$ is-a $B \nrightarrow B$ is-a $A$
  e.g., dog is-a animal $\nrightarrow$ animal is-a dog
Task Description:
- Triple generation:
  - Input: a large scale corpus
  - 即输入是一大段语料

Output: triples denoting is-a relations
- Beijing is-a capital, Beijing is-a city, Tianjin is-a city, Hebei is-a province, Shanghai is-a city

2.3.2 problem2

Task Description:
- IS-A relation prediction:
- Input: a pair of candidate hyponym and hypernym, and the corresponding vector representations
  - 输入为下位词和上位词组成的pair
- Output: true or false
  - e.g., Pair(dog, animal) → true Pair(dog, cat) → false

2.4 Methods Classification:

Pattern-based Methods (task: triple generation)
Distributional Methods (task: is-a relation prediction)
- Unsupervised Distributional Methods
- Supervised Distributional Methods

2.4.1 Pattern-based Methods：Hearst Patterns

Exercise

Please extract is-a relations from the following sentence with Hearst Patterns, and derive all is-a relations by the transitivity of the is-a relation.
There are further opportunities on exporting UK red meat to such countries as China, Japan, Korea, and Southeast Asian countries, including Singapore or Vietnam.
- China is-a country
- Japan is-a country
- Korea is-a country
- Southeast Asian country is-a country
- Singapore is-a Southeast Asian country
- Vietnam is-a Southeast Asian country
- Inference:
  - Singapore is-a country
  - Vietnam is-a country

Why Introducing Distributional Methods?

Limitations of Pattern-based Methods:
- The coverage and generalization are uncertain:
  - 无法推广
  - The hyponym and hypernym must appear in a sentence at the same time.
  - 召回率低
Distributional Methods aim to solve the problem of co-occurrence sparsity between the hyponym and hypernym.

2.4.2 Unsupervised Distributional Methods

Distributional Inclusion Hypothesis:
- It assumes that a hyponym only appears in some of its hypernym‘s contexts, but a hypernym appears in all contexts of its hyponyms.
- 下位词只出现在一些上位词的contexts里，而上位词出现在下位词所有contexts里
- e.g., the concept “fruit” has a broader spectrum of contexts than its hyponyms, such as ”apple“, ”banana“ and “pear”.
A Classic Asymmetric Distributional Similarity Measure: WeedsPrec.
- 一种无监督提取is-a关系的方法
- It captures the features of $u$, which are included in the set of features for a broader term $v$.
- For each term $u, v$ is candidate hypernym;
  - $u,v$都是候选上位词
- $f$ represents a contextual word with which $u$ co-occurs;
  - $f$代表u的context word
- $F_{u}$ is a set of $f$
  - $F_u\cap F_v$是$u,v$背景词的交集
- $W_{u}(\mathrm{f})$ quantifies the statistical association between the $f$ and $u$, such as Point-Wise Mutual Information (i.e., $P M I(u, f))$.
Point-Wise Mutual Information:
- compute the point-wise mutual information between a word w and a context word c.

$P M I(w, c)=\log \frac{p(w, c)}{p(w) p(c)}$

$N:$ How many sentences does the corpus contain?
$ f(w) \leq N:$ How many sentences contain $w$ ?
$ f(w, c) \leq f(w):$ How many sentences contain $w$ and $c$ ?
$ f(\mathrm{c}) \leq N:$ How many sentences contain $c$ ?
$p(w)=f(w) / N $
$p(\mathrm{c})=f(c) / N $
$p(w, c)=f(w, c) / N$
When the correlation between two words $w$ and $c$ is strong, $P(w, c)$ will be much larger than $P(w) \mathrm{P}(c)$, so $P M I(w, c)$ is larger.

2.5 Supervised Distributional Methods

Represent the term pair $(u, v)$ as a combination of $\boldsymbol{u}$ and $\boldsymbol{v}$ (vector representations)
- Concat : $u \oplus v$
- Diff: $v-u$
- Sum: $\quad u+v$
- Dot-product: $u \cdot v$
train a binary classifier over the representation $(\boldsymbol{u}, \boldsymbol{v})$

2.5.1 What’s wrong with simple calculations?

这种方法可能由于数据集的原因导致并没有学会推理而是记住某个词就是上位词

2.5.2 Project learning

学习如何将下位词映射到上位词的空间，再进行分类
Project learning learns a function to measure how possible there is an is-a relation between two words.
KeyPoint: A projection tensor $T$ is used to project the hyponym vector into the hypernym vector.
- 下位词映射到上位词

Input: Given a query $\mathbf{q}$ and a candidate hypernym $\mathbf{h}$
Output: The possibility that pair(q, h) is an is-a relation
Step1: look up word embeddings $\mathbf{q}$ and $\mathbf{h}$ through a embedding table
Step2 : Randomly initialize the projection vector $\boldsymbol{T}(K \times d \times d)$
Step3: Calculate the similarity vector s:

$s=q^{T} T_{i} h$

Step4: Map vector $s$ to score $\mathrm{y}, \mathbf{F}$ is generally a multilayer perceptron :

$\mathrm{y}=\mathrm{F}(s)$

3. Terminology/ Term Extraction

Terminology extraction is associated with some other tasks, such as NER,keyword extraction, etc.
Different from other tasks, terminology extraction is highly related to the domain.
Terminology extraction is th key issue of ontology construction, text summarization, knowledge graphs, etc.

3.1 Definition—Framework

3.1.1 Input Text:

Eg: He did not try to navigate after the first bold flight, for the reaction had taken something out of his soul.

3.1.2 Preprocessing

Tokenization: Tokenization describes splitting paragraphs into sentences, or sentences into individual words.
- Eg: [‘He’, ‘did’, ‘not’, ‘try’, ‘to’, ‘navigate’, ‘after’, ‘the’, ‘first’, ‘bold’, ‘flight’, ‘,’, ‘for’, ‘the’, ‘reaction’, ‘had’, ‘taken’, ‘something’, ‘out’, ‘of’, ‘his’, ‘soul’, ‘.’]
Cleaning(Stopwords):**A majority of the words in a given text are connecting parts of a sentence rather than showing subjects, objects or intent. Word like ‘the’ or ‘and’ can be removed by comparing text to a list of stopword.
Eg: [‘try’, ‘navigate’, ‘first’, ‘bold’, ‘flight’, ‘reaction’, ‘taken’, ‘something’, ‘soul’, ‘.’]
POS: Part of Speech (POS) often requires look at the proceeding and following words and combined with either a rule-based or stochastic method.
- 词性标注
- Eg: [(‘try’, ‘VB’), (‘to’, ‘TO’), (‘navigate’, ‘VB’), (‘first’, ‘JJ’), (‘bold’, ‘JJ’), (‘flight’, ‘NN’), (‘reaction’, ‘NN’), (‘taken’, ‘VBN’), (‘something’, ‘NN’), (‘soul’, ‘NN‘)]
Stemming:Stemming is a process where words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.
- 通过去除后缀找词根
- Eg: The stemmed form of leafs is: leaf
- Eg:The stemmed form of leaves is: leav
Lemmazation:Lemmazation is an alternative approach from stemming to removing inflection.
- 找词根
- Eg: The lemmatized form of leafs is: leaf
- Eg: The lemmatized form of leaves is: leaf

3.1.3 Filtering

Filtering:
- Common Dictionary Filtering (Filter by common Chinese dictionary)
- 去除常用词
  - If Candidate Terms appear in common Chinese dictionary:Delete the Candidate Terms
Irregular Filtering (Filter irregular words)
- For each Candidate Terms in Candidate list, Calculate the frequency of strings appearing in the corpus:
  $\mathrm{A}=$ frequency of Candidate Terms striing
  $B=$ frequency of Candidate Terms string removing the first character
  $\mathrm{C}=$ frequency of Candidate Terms string removing the last character Then: score $=$ $\mathrm{A} /(\mathrm{B}+\mathrm{C}-\mathrm{A})$
  If score $<$ Threshold: Delete the Candidate Terms
Example:
A = “right of transit passage”
B = ”right of transit“ score = 0.99, keep the Candidate Terms:
C = “of transit passage”. ”right of transit passage“

3.2 Approaches

Linguistic-based approaches
Statistical-based approaches
Graph-based approaches

3.2.1 Linguistic-based approaches — Chunker

NPs——Noun Phrases:
- NPs: Noun Phrases
  - A type of phrase whose grammatical function is equivalent to a noun
- Noun phrases can name a person, place, thing or idea.
- Examples:
  - I want a skateboard.
  - The yellow house is for sale.
- Noun phrases can generally serve as subject, object, attributive and other components in a sentence.
Theory
- More than 90% of the terms extracted in corpus are Noun Phrases

Toolkit
- NLTK RegexpParser: Convert regular expressions into syntax trees
- Step1: Define patterns of NPs
- Step2: Find Candidate Terms using Chunker
- Step3: Candidate Terms Filtering

Step1: Define patterns of NPs

一些模板

{\}
{*+}
{\**+}
{\*+}

Step2: Find Candidate Terms using Chunker

Define NP patterns using methods before
- NPChunker = nltk.RegexpParser(patterns)
POS tagging for each sentence
- tagged_words = [nltk.pos_tag(word) for word in train_dataset]
Using NPChunker to get tree and Candidate Terms
- tree = NPChunker.parse(tagged_words)

3.2.2 Statistical-based approaches

Theory
- Candidate terminology with higher frequency is more likely to be a terminology.
Statistical Feature:
- Termhood: Measure the relevance between term and domain.
  - TF-IDF
Unithood: Measure the collocation and adhesion within term.
- MI(Mutual information)
Evaluate the importance of a word to a document.
Assuming word with higher TF value and higher IDF value is more relevant to domain.
- TF: Term frequency.

$\mathrm{TF}=\frac{\text { Number of certain word in a document }}{\text { Number of all words in a document }}$

IDF: Inverse document frequency

$\text{IDF}=\log \left(\frac{\text { Number of all documents in corpu }}{\text { Number of documents containing the certain word }+1}\right)$

Formula
- TF-IDF value is directly proportional to the number of occurrences of a word in the document and inversely proportional to the number of occurrences of the word in the whole corpus.

Exercise——TF-IDF

Suppose there are 100 words in a document, and the word “cow” appears three times. The word “cow” has appeared in 1,000 documents, and the total number of documents is 10,000,000. What is the TF-IDF score of the word “cow”?

$\begin{array}{ll} \text{TF}=\frac{3}{100}\\ \text{IDF}=\log \frac{10000000}{1000+1}=3.9996\\ TF\times IDF=0.119987 \end{array}$

例：假定《亚洲的网络技术》一文长度为1000个词，“亚洲”、“网络”、“技术”各出现20次，则这三个词的“词频”（TF）都为0.02。然后，搜索Google发现，包含“的”字的网页共有250亿张（假定这就是中文网页总数），包含“亚洲”的网页共有62.3亿张，包含“网络”的网页为0.484亿张，包含“技术”的网页为0.973亿张。计算“亚洲”、“网络”、“技术”的TF-IDF值.

$\begin{array}{ll} \text{IDF(亚洲)}=\lg\frac{250}{26.3}=0.603\\ \text{IDF(网络)}=\lg\frac{250}{0.484}=2.713\\ \text{IDF(技术)}=\lg\frac{250}{0.973}=2.410\\ \text{TF-IDF(亚洲)}=0.603\times 0.02=0.01206\\ \text{TF-IDF(网络)}=2.713\times 0.02=0.05426\\ \text{TF-IDF(技术)}=2.410\times 0.02=0\\ \end{array}$

4. Statistical-based approaches—MI

4.1 PMI

A special case of MI. It is used to calculate the degree of association between words in NLP.
- 用于计算两个词的联系程度

$P M I(x ; y)=\log _{2} \frac{p(x, y)}{p(x) p(y)}=\log _{2} \frac{p(x \mid y)}{p(x)}=\log _{2} \frac{p(y \mid x)}{p(y)}$

xy represents $2 \sim \mathrm{n}(\mathrm{n} \geq 2)$ words. For example, when two words are used, $\mathrm{x}$ represents the former word and $\mathrm{y}$ represents the latter word; In three words, $\mathrm{x}$ represents the first (two) words and y represents the last two (one) words; And so on. $-$ Usually used for double-word terminology.
- xy表示一个组合,xy是内部的词
When the correlation between words $x y$ is strong: $P M I>0$, and when it is weak: $P M I \approx 0$
A large PMI value means that the combination between words is tight, and the more likely it is to become a terminology.

4.2 Exercise— PMI

Use the following Co-occurrence Matrix to represent the frequency of simultaneous appearance of two words in text. Calculate the PMI between information and data.

$\begin{array}{l} P(\text{Information,data})=6/19\\ P(\text{Information})=11/19\\ P(\text{data})=7/19\\ PMI(\text{Information}|\text{data})=\log_2\frac{6/19}{(11/19)\times(7/19)}=\log_2\frac{114}{77} \end{array}$

5. Graph-based methods

5.1 PageRank And TextRank

PageRank:
- Calculate the importance of webpages based on the link between them.
  - 如果这个网页被多次链接，那么这个网页更重要
TextRank:
- Regard ‘word’ as ‘webpage’
  Calculate the importance of words based on the
- co-occurrence between them.
- Turn the directed graph in PageRank into an undirected graph.
- 把单词当成网页，算单词的重要性
Feature
- TextRank can extract terminologies from a single document without relying on other corpora.
  - 无需训练，可直接抽取术语

5.2 PageRank

5.2.1 PageRank’s Theory

Regard the Internet as a directed graph, with webpages as nodes in the graph and links between webpages as edges in the graph.
When a webpage is linked by many other webpages, it means that this webpage is more important, that is, the PR value (PageRank value) of this webpage will be higher.
If a webpage with a high PR value links to another webpage, the PR value of the linked webpage will increase accordingly.
简单理解就是互联网是一张巨大的有向图，网页被链接越多那么重要程度越高，并且这种重要程度可以传递给邻居

5.2.2 PageRank’s Formula

Divide the PR value of a webpage equally according to the total number of its links, and take this value as the voting value of the webpage to its’ linked webpage. Therefore, for webpage $i$, its PR value can be expressed as:
- $i$处的PR值等于邻居的PR值除以其自身的出度

$P R(\mathrm{i})=\sum_{j \in B_{\mathrm{i}}} \frac{P R(\mathrm{j})}{O u t(j)}$

$P R(\mathrm{i}): \mathrm{PR}($ PageRank )score of webpage i.
$B_{\mathrm{i}}$ : Collection of webpages that linked to webpage i.
Out(i): Out degree of webpage $\mathrm{i}$.

5.2.3 Example

As shown in the right figure, suppose a set consisting of only four webpages: $A, B, C$ and $\mathrm{D}$. If webpages $\mathrm{B}, \mathrm{C}$ and $\mathrm{D}$ are all linked to webpage $\mathrm{A}$, and webpages $\mathrm{B}, \mathrm{C}$ and $\mathrm{D}$ have no other links, then the PR value of webpage $A$ will be the sum of the PR values of webpages $B, C$ and $D$ :

$P R(A)=P R(B)+P R(C)+P R(D)$

As shown in the right figure, webpage $B$ has links to webpage $A$ and $C$, webpage $D$ has links to webpages $A, B$ and $C$. Suppose one webpage can only vote for another webpage once. So webpage B will vote for $1 / 2$ of the linked webpage and webpage $\mathrm{D}$ will vote for $1 / 3$ of the linked webpage. In this case, the PR value of webpage A is:

$P R(A)=\frac{P R(B)}{2}+\frac{P R(C)}{1}+\frac{P R(D)}{3}$

5.2.4 Extended formula of PageRank

The above formula assumes that users only click the link of the current webpage to browse the next webpage, but the random browsing method is more in line with the real situation. Thus, the random browsing model is generated, and the $P R$ value of each web page in the random browsing model is calculated by the following formula:
- d 阻尼系数，即有一定可能随机跳转页面
- 最终PageRank会收敛到一个稳态

$P R(i)=(1-\mathrm{d})+d \times \sum_{j \in B i} \frac{P R(j)}{\operatorname{Out}(j)}$

$P R$ value is initially set as $1 / \mathrm{N}$.
N: Num of all webpages.
d: Damping coefficient, representing the probability of browsing webpages accordance with the link, default value is $0.85$.
1-d: The probability that the viewer randomly browses another webpage.

5.3 Graph-based approaches—TextRank

PageRank:Construct graph according to the link relationship between webpages.
TextRank:Construct graph according to the co-occurrence relationship between words.

5.3.1 TextRank’s Formula:

其相当于构建了有向有权图,i点的重要程度由其邻居决定

$W S\left(\mathrm{~V}_{\mathrm{i}}\right)=(1-d)+d \times \sum_{\mathrm{V}_{\mathrm{i}} \epsilon\left(\mathrm{V}_{\mathrm{i}}\right)} \frac{w_{j i}}{\sum_{\mathrm{V}_{k} \in O u t\left(\mathrm{v}_{j}\right)} w_{j k}} W S\left(\mathrm{~V}_{j}\right)$

$w_{j i}$ :Weight of edge connecting node $\mathrm{i}$ and node $\mathrm{j}$.
$W S\left(\mathrm{~V}_{\mathrm{i}}\right):$ Weight of word $\mathrm{i}$, initially value is 1 .
$\operatorname{Out}\left(\mathrm{V}_{\mathrm{i}}\right):$ Out degree of word $\mathrm{i}$.

5.3.2 Steps Of TextRank

InputText:
- 淡黄的长裙, 蓬松的头发, 牵着我的手看最新展的油画
Preprocessing:
- 淡黄长裙蓬松头发
- 牵我手看最新展出油画
Construct graph $\mathbf{G}(\mathbf{V}, \mathbf{E}): \mathrm{V}$ is composed of words generated in the above steps, and then use the co-occurrence relationship to construct an edge between any two nodes. The edge between two nodes is only when their corresponding words cooccur in a window of length $\mathrm{K}$. Given $\mathrm{K}=2$ :

Iteration: Iteratively calculate the weight of each node until convergence according to the formula.

$W S\left(\mathrm{~V}_{\mathrm{i}}\right)=(1-d)+d \times \sum_{\mathrm{V}_{\mathrm{i}} \epsilon\left(\mathrm{V}_{\mathrm{i}}\right)} \frac{w_{j i}}{\sum_{\mathrm{V}_{k} \in \text { out }\left(\mathrm{v}_{j}\right)} w_{j k}} W S\left(\mathrm{~V}_{j}\right)$