Triple Extraction from Relational Web Tables
1. Tables are Everywhere on the Web
1.1 Statistics on Web Tables
- Web Tables : In 2008, the WebTables systems (developed by Google) extracts 14.1 billion HTML tables and finds 154 million are high-quality tables (1.1%).
- Web Tables: The Web Data Commons project extracts 233 million Web tables from HTML pages in 2015.
- Wikipedia Tables: In 2019, the Wikipedia snapshot contains more than 3 million tables from more than 520 thousand Wikipedia articles.
1.2 Web Table Types
1.3 Relational Table
Def
- Arelational table describes a set of entities in the core column(s)(相当于主键) along with their attributes in the remaining columns.
1.4 Steps of Triple Extraction
Entity Linking: Mention to Entity
- 使得可以找到真正的信息
Column Typing: Column to Class (i.e., Type)
- 给column分类
Relation Extraction: Semantic Association between Columns to Relation
- 推出列之间的关系
2. Entity Linking
2.1 What is Entity Linking (EL) in Web Tables?
- Mapping the string mentions in table cells to their referent entities in a given knowledge base (KB).
2.2 Detailed Steps of Entity Linking (Assumption: strings in a cell are a mention):
- Candidate Generation:
- Identifying candidate referent entities of each string mention in table cells when given a knowledge base.
- Entity Disambiguation:
- Choosing only one entity from the candidate set as the referent entity of each string mention.
2.3 Entity Linking: Candidate Generation
2.3.1 Dictionary-based method
- collecting all (string, entity) pairs from anchor texts, redirect pages, and disambiguation pages in Wikipedia;
ranking based on co-occurrence frequency.(成对出现概率)
可以理解为通过已知的实体链接来学习出表格里的链接
例如:
Jordan——-basketball player
——-professor
——-actor
- 出现的频率并不一样,可以做一个排序
2.3.2 String similarity based method
Def:
- computing string similarities between mentions and entities
Levenshtein distance
- Edit distance: Levenshtein distance
- The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
Jaccard similarity
- Jaccard similarity J(A, B) measures the similarity between finite sample set A and B.
Exercise
- Compute the Levenshtein distance between “kitten” and “sitting”.
- 1+1+1=3
- Compute the Jaccard similarity between “University of California, Davis” and “University of California, Berkeley”in word level.
- Compute the Jaccard similarity between “English”and “England”based on character-level trigrams.
2.3.3 Synonym-based method
- combine with the dictionary-based method or string similarity based method
- use a knowledge graph containing rich synonym information, e.g., BabelNet, WordNet
With the above methods, Top-K candidate referent entities for each mention can be obtained.
2.4 Entity Disambiguation
2.4.1 Def
Local Disambiguation
- Only use the contextual information of the given mention and target entity for disambiguation without considering the effects of other referent entities in the same table.
Global (Collective) Disambiguation
- Leverage the semantic associations between candidate referent entities to jointly disambiguate all entities for the mentions in a table.
- 不同mention里的实体可以帮助链接
2.4.2 Local Disambiguation: A Generative Model
- The classic generative model is originally proposed to entity linking in text, but it can be applied to Web tables.
- 假设Mention如下述生成过程:
- The generative process for text: three steps
- Choose the referent entity e from the knowledge base with P(e), e.g., “Michael Jeffrey Jordan”;
- Choose a name s with 𝑃(𝑠│𝑒), e.g., “Jordan”;
- Choose a context c with 𝑃(𝑐│𝑒), e.g., “joins Bulls in 1984”.
- Based on the above generative story, the probability of a name mention m (its context is c and its name is s) referring to a specific entity e can be expressed as (assume that s and c are independent given e)
- s|e 与 c|e 条件独立
where P(e) corresponds to the popularity knowledge,
- P(s|e) corresponds to the name knowledge , and
- P(c|e) corresponds to the context knowledge.
Given a name mention m, to perform entity disambiguation, we need to find the entity e which maximizes the probability P(e|m), i.e.,
- Training data:
- a set of annotated name mentions $𝑀=\{𝑚_1,𝑚_2,…,𝑚_𝑛\}$, each m is a triple $𝑚=\{𝑠,𝑒,𝑐\}$;
- example for text
- table: Annotated Web tables from Web Data Commons project.
- 只需将context换一下
Entity popular model:
- P(e) reflects the popularity knowledge of the entity e;
- if e1 is more popular than e2, then $P(e_1)>P(e_2)$;
- A more popular entity $e_1$ usually appears more times than a less popular entity $e_2$ in a large text corpus(这里指的是知识库), i.e., the frequency of occurrence $count(e_1)>count(e_2)$, e.g., in Wikipedia,
- 这里应该是看mentions到知识库里映射实体数来决定
- Thus, we model the popularity knowledge as:
- where $count(e)$ is the count of the name mentions whose referent entities are e in the training data, (对应所有mention中在知识库映射的实体是e的个数)
- N is the number of all entities
and |M| is the total name mention size.(用于平滑,使得所有实体至少有一个很小的base prob)
The estimation is further smoothed using the simple add-one smoothing
method for the zero probability problem.
Entity name model:
P(s|e) encodes the name knowledge of entities;
for a specific entity e, its more frequently used name should be assigned a higher P(s|e), and a zero P(s|e) value should be assigned to those never used names.
- after collecting all (entity, name) pairs from the name mention data set, then the maximum likelihood estimation is used to estimate the entity name model as
- 不同于实体训练,这时候训练集是一个pair,之前的实体训练只要从知识库里找到所有实体就行
where the count(e,s) is the count of the name mention s whose referent entity is e.
Problem: it cannot deal with unseen names.
- unseen names include:
- aliases 别词: “New York” $\rightarrow$ “Big Apple” ;
- acronyms 简写: “Michael Jeffrey Jordan” $\rightarrow$ “MJ” ;
- misspellings: “Barack Obama” $\rightarrow$ “Barack Opama” .
- to solve this problem, apply the statistical translation model (IBM Model 1) to capture all possible name variations of the given word by translation operations as
1) It is retained (translated into itself);
2) It is translated into its acronym;
3) It is omitted(省略)(translated into the word NULL);
4) It is translated into another word (misspelling or alias).
- P(s|e) can be modeled as
where $\varepsilon$ is a normalization factor, $f$ is the full name of entity e, $l_{s}$ and $l_{f}$ are respectively the lengths of $s$ and $f, s_{i}$ is the $i$-th word of $s$, and $f_{j}$ is the $j$-th word of $f$.
- 这么理解就是$f_j$是full name 第j个word,full name 就是原来训练中出现的word,而$s_i$是变换后出现的第i个word
unseen names include:
- aliases: “New York” $\rightarrow$ “Big Apple” ;
- acronyms: “Michael Jeffrey Jordan” $\rightarrow$ “MJ” ;
- misspellings: “Barack Obama” $\rightarrow$ “Barack Opama” .
Entity context model (use text to explain):
$P(c \mid e)$ encodes the context knowledge of entities;
- if the entity $e$ frequently appears in the context $c$, then $P(c \mid e)$ should be higher;
- 如果一个实体与知识库里出现的上下文频繁一致,则概率大
- e.g.,
- $C_1$: _wins NBA MVP.
- $C_2$: _is a researcher in machine learning.
Entity context model (use text to explain):
- the context of each name mention $m$ is the words surrounding $m$;
- the context knowledge of an entity $e$ is encoded in an unigram language model $M_{e} .$
- 这么理解,我们把对于某个实体的context拆开成word级别,根据一元语法,就可以计算每个word的概率,从而形成集合where $P_{e}(t)$ is the probability of the term $t$ appearing in the context of $e$. The term can be a word, a named entity, or a Wikipedia entity.
Entity context model (use text to explain):
- given a context $c$ containing $\mathrm{n}$ terms $t_{1}, t_{2} \ldots t_{n}$, the entity context model estimates the probability $P(c \mid e)$ as
- according to the annotated name mention data set $M, P_{e}(t)$ can be estimated by the maximum likelihood estimation:
- where $Count _{e}(t)$ is the frequency of occurrences of a term $t$ in the contexts of the name mentions whose referent entity is $e$.
- 从$e$的所有context收集所有的term,这样就可以统计每一个term的频率
Entity context model (use text to explain):
- due to the sparse data problem, $P_{e}(t)$ is smoothed by Jelinek-Mercer smoothing method as
- where $P_{g}(t)$ is a general language model estimated using the whole Wikipedia data.
2.4.3 More Features in Entity Disambiguation
Web Table Features:
- Features found in the table (T) or outside the table (C)
- 表格外同样可以找到很多资源
- Single table features (TS) refer to a value in a single cell while multiple features combine values coming from more than one cell (TM)
2.4.4 Global Disambiguation: A Graph Model
- Why global disambiguation works?
- suppose we have three mentions which are needed to perform entity linking
假设有三个mention需要做实体链接,借助他们的候选实体的语义关联,可以消除local ambiguation
Building an Entity Disambiguation Graph for each given table. Each graph consists of:
- Mention Nodes, Entity Nodes
- Mention-Entity Edges: undirected edges between mentions and its candidate referent entities,
- Entity-Entity Edges: undirected edges between entities.
Entity Disambiguation—Computing EL Impact Factors.
- Two EL impact factors are as follows:
- importance of each mention;
- semantic relatedness between different nodes.
- 测量不同节点的语义相关度
- Each node or edge in a constructed Entity Disambiguation Graph is assigned with a probability .
- For mention-entity edges, their probabilities refer to the semantic relatedness between mentions and entities, called Mention-Entity Semantic Relatedness.
- For entity-entity edges, their probabilities are seen as the semantic relatedness between entities, called Entity-Entity Semantic Relatedness
Mention-Entity Semantic Relatedness
- Two features are used to measure Mention-Entity Semantic Relatedness $S R_{m e}(m, e)$ between the given mention $m$ and entity $e$.
String Similarity Feature strSim(m,e): depending on the Levenshtein distance between the string labels of $m$ and $e$.
- 就算mention和实体的相似度
Mention-Entity Context Similarity Feature $contSim _{\text {me }}(m, e)$ :
- 计算mention和entity的背景相似度
The context of mention $m$ is denoted as menContext (m) :
a) Collecting other mentions in the row or column where $m$ locates.- 同行与列词的集合
b) Segmenting all the collected mentions into a set of words.
The context of entity $e$ is denoted as entContext(e):
a) Collecting all the RDF triples which $e$ exists in.
b) Segmenting all the objects (e is the subject) and subjects (e is the object) into a set of words.
从知识库里找所有包含e的三元组,从而组成相应的背景集合
$contSim_{m e}(m, e)$ is computed by the Jaccard similarity between menContext (m) and entContext ( e ).
使用Jaccard similarity计算相似度
Entity Disambiguation-Computing EL Impact Factors.
Two features are used to measure Entity-Entity Semantic Relatedness $S R_{e e}\left(e_{1}, e_{2}\right)$ between entities $e_{1}$ and $e_{2}$
1) Triple Relation Feature $\operatorname{I_{SRDF}}\left(e_{1}, e_{2}\right)$ : verifying whether $e_{1}$ and $e_{2}$ are in the same triple.
2) Entity-Entity Context Similarity Feature contSim $_{e e}\left(e_{1}, e_{2}\right):$
cont Sim $_{\text {ee }}\left(e_{1}, e_{2}\right)$ is computed by the $\operatorname{Jaccard}$ similarity between entContext $\left(e_{1}\right)$ and entContext $\left(e_{2}\right)$.
Entity Disambiguation-Iterative Probability Propagation.
- Iterative probability propagation is used for combining different $\boldsymbol{E} \boldsymbol{L}$ impact factors for the $E L$ decisions.
Given an Entity Disambiguation Graph $G$ with $n$ nodes, we denote $G$ as an $n \times n$ adjacency matrix $A, A_{i j}$ is the normalized semantic relatedness between node $i$ and node $j$, and $A_{i j}=A_{j i}$.
We define an $n \times 1$ vector $r$ to record the probabilities assigned to all the nodes, and iteratively compute $r$ with the following formula until convergence.
2.4.5 A Test on Zhishi.me—Consisting of Three Chinese Linked KBs.
- Applying our proposed approach to more than 70,000Web tables with each single KB in Zhishi.me.
- For each mention, and if any two identified entities from different KBs do not have the samesAs relation, then there exists a conflict here.
2) 对于每条mention,如果有两个相同的实体来自不同的知识库,但是他们没有相同的关系,那么就会产生冲突 - According to the statistics, there exist conflicts in the EL results of
38.94% mentions.- 单一知识库的实体映射问题,这其实可以理解就是,你一开始在做实体链接的时候,如果只用到一个知识库,那么由于这个知识库
Reasons of Conflicts
Reason 1: The samesAs relations are incomplete between KBs.
Solution:
- Completing samesAs relations with a supervised learning model using synonym feature, string similarity feature and entity-entity context similarity feature.
Reason 2: For some KBs, some potential correct referent entities do not rank the highest.
- 没有排序对
Solution:
Grouping the entities representing the same individual into different
sets using the existing and newly learned samesAs relations.- 把相同的实体组成一个集合
Computing the average ranking, the highest ranking and the number of the entities in each set, and then applying three heuristic rules to solving conflicts.
- 计算每个集合的平均rangking,再排序
Detail Rules
Rule 1: If both the average ranking and the highest ranking of the entities in a set rank the highest, and the number of the entities in this set is not less than half of the number of KBs, then we choose this set as the final EL results for the given mention.
Rule 2: If there exist two or more sets that the average ranking and the highest ranking of the sets’ corresponding entities are the same and rank the highest, also the number of the entities in each of these sets is not less than half of the number of KBs, then we choose one set at random as the final EL results for the given mention.
Rule 3: If the number of the entities in each set is less than half of the number of KBs, the original EL results of the given mention remain unchanged.
- 总结就是只有当相同的实体出现到一定数量,才使用这个特殊规则
- 对于一个mention我们去各个KB找相同的候选实体
3. Column Typing: Column to Class (i.e., Type)
- 利用同一列某一个mention的实体类型作为该mention的实体类型
- 或者利用Table本身head的信息比如city
4. Relation Extraction
Definition
Relation extraction refers to the task of associating a pair of columns in atable with the relation that holds between their contents and / or extractingrelationship information from tabular data and representing them in astructured format (e.g., as subject-predicate-object triples).
For binary relationships, the relationship between columns A and B islabeled with R if a substantial number of pairs of values from A and Boccur in the relations database.
- 利用知识库中本来存在的三元组,来推测出未知的三元组