sequence labelling and relation extraction

1. IE

1.1 Simplily Introduction

从有限文本找到相关文本，并从文本收集信息，最后表示出来。
IE systems extract clear, factual information
- Roughly: Who did what to whom when?
E.g..
- Gathering earnings，profits, headquarters, etc. from company reports
  - The headquarters of Alibaba Group, and the global headquarters of the combined Alibaba Group,are located in Hangzhou.
  - headquarters(“Alibaba Group”,”Hangzhou”)
    - 表达成结构化形式
Learn drug-gene product interactions from medical research literature

1.2 Low-level information extraction

Is now available and I think popular in applications like textapp. mail app, etc.
Often seems to be based on regular expressions and name lists

1.3 Named Entity Recognition (NER)

very important sub-task: find and classify names in text,
For example：

人名、组织/机构名、地理位置、时间/日期字符值、金额值、领域实体

1.4 The uses:

Named entities can be indexed,linked off, etc.
Sentiment can be attributed to companies or products.
- 归因产品使用的情绪
A lot of IE relations are associations between named entities.. For question answering, answers are often named entities.

1.5 Concretely:（具体地）

Many web pages tag various entities, with links to topic pages, etc.

Googlel Apple…. smart recognizers for document content
- 通过命名实体识别，给新闻阅读带来更好的体验，即给检索到的实体加入相应的url链接

Recall and precision are straightforward for tasks like IR and text categorization,where there is only one grain size（晶粒尺寸） (documents)
The measure behaves a bit funnily for IE/NEP. when there are(boundary errors (which are common):
- 紫金山森林公园位于南京市玄武区.
- first Bank of China
  - 有可能只识别出Bank of China
  - This counts as both a fp and a fn 这将导致fp和fn上升
Selecting nothing would have been better?
Some other metrics (e.g., Muc scorer) give partial credit(according to complex rules)

1.6 Sequence Models for Named Entity Recognition

Training
- Collect a set of representative training documents
- Label each token for its entity class or other (O)
- Design feature extractors appropriate to the text and classes
- Train a sequence classifier to predict the labels from the data
Testing/Classifying
- Receive a set of testing documents
- Run sequence model inference to label each token
- Appropriately output the recognized entities

第一种encoding会出现边界问题，比如Mengqiu Huang这个人应该是一个整体，但识别时会被分成两部分
- C个类别，那么label有C+1种，对于运算的空间相比于第二种小
- 造成这种原因是一下子出现三个PER，无法判别是否能打包成一个实体，但是通常意义上讲，紧挨着的实体不是同一个类别，所以IO对于大样本更适合。
第二种，当读到Begin是开始，读到I，则是紧接上一个
- 有2C+1个label，效率比较低，但带来了更高的准确率
更多模型：IOBE,IOBS，但是要考虑训练开销于准确率的权衡

1.7 Features

1.7.1 Features for sequence labeling 序列标签的特征

Words
- Current word
- Previous/next word (context)
Other kinds of inferred linguistic classification. 语义级别特征、语法级别特征
- Part-of-speech tags
Label context
- Previous (and perhaps next) label （的特征）

1.7.2 Features: Word substrings

只要出现xazo就是药，出现field就是地点，出现冒号就是电影
- 这种维度的特征对下游任务十分有效

1.7.3 Word Shapes

Map words to simplified representation that encodes attributes such as length, capitalization，numerals ,Greek letters,internal punctuation, etc.
不同形状的词就蕴含了信息

1.8 Maximum entropy Markov models (MEMMs) or Conditional Markov models

1.8.1 Sequence problems

Many problems in NLP have data which is a sequence of characters,words，phrases,lines, or sentences ….
we can think of our task as one of labeling each item

进来一个序列，对每一个文本块进行识别

1.8.2 MEMM Inference in Systems

For a Conditional Markov Model(CMM) a.k.a. a Maximum Entropy Markov Model (MEMM), the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions
A larger space of sequences is usually explored via search

1.8.3 Scoring individual labeling decisions is no more complex than standard classification decisions

We have some assumed labels to use for prior positions
We use features of those and the observed data (which can include
current, previous, and next words) to predict the current label

尽可能保证使用贪心的策略，每次都是使得当前的词最大

1.9 Search

1.9.1 Greedy Inference

Greedy inference:

We just start at the left, and use our classifier at each position to assign a label
The classifier can depend on previous labeling decisions as well as observed data

Advantages:

Fast, no extra memory requirements
Very easy to implement
With rich features including observations to the right, it may perform quite well

Disadvantage:

Greedy.
We make commit errors we cannot recover from

1.9.2 Beam Search

Beam inference:

At each position keep the top k complete sequences.
Extend each sequence in each local way.
The extensions compete for the k slots at the next position.

Advantages:

Fast; beam sizes of 3-5 are almost as good as exact inference in many cases.
Easy to implement (no dynamic programming required).

Disadvantage:

Inexact: the globally best sequence can fall off the beam.

1.9.3 Viterbi Inference

iterbi inference:

Dynamic programming or memoization.
Requires small window of state influence (e.g., past two states are relevant).

Advantages:

Exact: the global best sequence is returned.

Disadvantage:

Harder to implement long-distance state-state interactions (but beam
inference tends not to allow long-distance resurrection of sequences any way).

1.10 CRFs

参考资料：CRF条件随机场的原理、例子、公式推导和应用 - 知乎 (zhihu.com)

Another sequence model: Conditional Random Fields (CRFs)
A whole-sequence conditional model rather than a chaining of local models.

$P(c \mid d, \lambda)=\frac{\exp \sum \lambda_{i} f_{i}(c, d)}{\sum_{c^{\prime}} \exp \sum_{i} \lambda_{i} f_{i}\left(c^{\prime}, d\right)}$

The space of c’s is now the space of sequences
- But if the features f, remain local,the conditional sequence likelihood can be calculated exactly using dynamic programming
Training is slower, but CRFs avoid causal-competition biases
These (or a variant using a max margin criterion) are seen as the state-of-the-art these days … but in practice usually work much the same as MEMMs.

2. Extracting relations from text

Company report: “International Business Machines Corporation (IBM or thecompany) was incorporated in the State of New Vork on June 16,1911,as the Computing-Tabulating-Recording Co.(C-T-R.)…”
Extracted Complex Relation:
- Company-Founding
- Company IBM
- Location New york
- Date June 16,1911
- Original-Name Computing-Tabulating-Recording Co.
But we will focus on the simpler task of extracting relation triples
- Founding-year(IBM,1911)
- Founding-location(IBM,New York)
- 从文本中抽取出关系

2.1 Extracting relation triples from text

2.2 Why Relation Extraction?

NER：find classify——关系最终也会变成一个分类问题
Create new structured knowledge graphs, useful for any app
Augment current knowledge graphs
- Adding words to WordNet thesaurus，facts to FreeBase or DBPedia
Support question answering
- The grand daughter of which actor starred in the movie”E,T.”?
  (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x?y)
But which relations should we extract?

2.3 Automated Content Extraction (ACE)

Physical-Located PER-GPE
- He was in Tennessee
Part-Whole-Subsidiary ORG-ORG
- XYZ, the parent company of ABC
Person-Social-Family PER-PER
- John’ s wife Yoko
Org-AFF-Founder PER-ORG
- steve Jobs , co-founder of Apple…

2.4 UMLS: Unified Medical Language System

134 entity types, 54 relations

Extracting UMLS relations from a sentence

Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes
- Echocardiography, Doppler DIAGNOSES Acquired stenosis

2.5 Databases of Wikipedia Relations

3. How to build relation extractors

Hand written patterns
Supervised machine learning
Semi supervised and unsupervised
- Bootstrapping (using seeds)
- Distant supervision
- Unsupervised learning from the web

3.1 Rules for extracting IS-A relation

Early intuition from Hearst (1992)
- “Agar is a substance prepared from a mixture of red algae（红脂）,such as Gelidium（凝胶）, for laboratory orindustrial use”
What does Gelidium mean?
How do you know?

Hearst’s Patterns for extracting IS-A relations

表示a是b的模板
- “Y such as X((,X)*(, and | or)X)”
- “such Y as X”
- “x or other Y”
- “X and other Y”
- “Y including x”
- “Y,especially X”

3.1.1 Extracting Richer Relations Using Rules

Intuition: relations often hold between specific entities.
- located-in(ORGANIZATION,LOCATION)
- founded (PERSON，ORGANIZATION)
- cures (DRUG, DISEASE)
Start with Named Entity tags to help extract relation!
在已经知道命名实体的类别情况下，会很容易知道他们之间的关系

但这种情况也不一定

Who holds what office in what organization?
- PERSON , POSITION of ORG
  - George Marshall , Secretary of State of the United States
- PERSON named|appointed|chose| etc PERSON Prep? POSITION
  - Truman appointed Marshall Secretary of State
- PERSON [be]? named|appointed| etc ..) Prep? ORG POSITION
  - George Marshall was named US Secretary of State

3.1.2 Summary: Hand-built patterns for relations

Plus:

Human patterns tend to be high-precision.
Can be tailored to specific domains

Minus

Human patterns are often low-recall
A lot of work to think of all possible patterns!. Don’t want to have to do this for every relation!. e’d like better accuracy

3.2 Supervised machine learning for relations

Choose a set of relations we’d like to extract
Choose a set of relevant named entities
Find and label data
- Choose a representative corpus
- Label the named entities in the corpus
- Hand-label the relations between these entities
  - NLP标注，最终导出csv等结构化数据
- Break into training, development , and test

Step

Find all pairs of named entities (usually in same sentence)
Decide if entities are related
- 先看有没有关系，如果没有关系直接过滤掉
If yes,classify the relation

Why the extra step?

有没有关系——局部特征，从而使得把特征进行打包，在进行分类
Faster classification training by eliminating most pairs.
Can use disjist（不相容） feature-sets appropriate for each task.

对于每个句子进行实体识别后，抽取成相关的一条数据集如上

通过句法的依赖关系，得到实体之间的依存关系，如上

3.3 Gazeteer and trigger word features for relation extraction

Trigger list for family: kinship terms
- parent , wife, husband, grandparent, etc. [from WordNet]
Gazeteer:
- Lists of useful geo or geopolitical words
  - Country name list
  - Other sub-entities
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim wagner said.

3.4 Classifiers for supervised methods

Now you can use any classifier you like
- MaxEnt
- Naive Bayes.
- SVM
Train it on the training set, tune on the dev set, test on the test set

3.5 Summary: Supervised Relation Extraction

Plus:

Can get high accuracies with enough hand-labeled training data,if test similar enough to training

Minus:

Labeling a large training set is expensive
Supervised models are brittle,don’t generalize well to different genres