Language Modeling
The Language Modeling Problem
Setup: assume a (finite) vocabulary of words:
We can construct an (infinite) set of strings:
Data: given a training set of example sentences
Problem: estimate a probability distribution
Probabilistic Language Modeling
Goal
- assign a probability
to a sentence
- assign a probability
Related task
- probability of an upcoming word
A model that computes either of these is called a language model or LM
How to compute
e.g.
Markov Assumption
- First-order markov processes
- Second-order markov processes
- we approximate each component in the product
Unigram model
Problem
Bigram model
Estimating
e.g.
<s> I am Sam </s>
,<s> Sam am I </s>
,<s> I do not like green eggs an ham </s>
Raw bigram counts
Raw unigram counts
Raw bigram probabilities
Example: $P(\text{ I want english food})$
- Given
- Get
Practical Issue
- We do everything in log space
- Avoid underflow
- adding is faster than multiplying
N-gram model
- We can extend to trigrams, 4 grams, 5 grams
- In general this is an insufficient model of language
- because language has long distance dependencies
- e.g. The computer which I had just put into the machine room on the fifth floor crashed
- because language has long distance dependencies
- But we can often get away with N-gram models
How to evaluate language models: Perplexity 困惑度
- Perplexity is the inverse probability of the test set, normalized by the number of words
- Lower perplexity
Better model
- Chain rule
- Bigrams
Smoothing
Add-one estimation / Laplace smoothed
- Pretend we saw each word one more time than we did
Advanced smoothing algorithms
Good-Turing
- Replace the empty frequency with those things we’ve seen only
times
- Replace the empty frequency with those things we’ve seen only
Kneser-Ney
Interpolation
- Simple interpolation
- Choose
to maximize the probability of validation data - Fix the N-gram probabilities without smoothing (on the training data)
- Then search for λs that give largest probability to validation set
- Can use any optimization technique (line search or EM usually easiest)
Unknown words
- If we know all the words in advanced
- Vocabulary V is fixed
- Closed vocabulary task
- Often we don’t know this
- Out Of Vocabulary = OOV words
- Open vocabulary task
- Instead: create an unknown word token
- Training of \
probabilities - Create a fixed lexicon L of size V
- At text normalization phase, any training word not in L changed to \
- Now we train its probabilities like a normal word
- At decoding time
- If text input: Use UNK probabilities for any word not in training
- Training of \