Language Modeling

The Language Modeling Problem

Setup: assume a (finite) vocabulary of words:
$\mathcal{V}=\{ \text{the, a, man, telescope, Beckham, two, Madrid, ...}\}$
We can construct an (infinite) set of strings:
$\mathcal{V}^{\dagger}=\{ \text{the, a, the a, the fan, the man, the man with the telescope, ...} \}$
Data: given a training set of example sentences
Problem: estimate a probability distribution

Probabilistic Language Modeling

Goal
- assign a probability to a sentence
Related task
- probability of an upcoming word
A model that computes either of these is called a language model or LM

How to compute

$P(w_1,w_2,...,w_n) = \prod_{i}^n P(w_i \mid w_1,w_2,...,w_{i-1})$

e.g.

$\begin{align} P(\text{its water is so trasparent}) &=P(\text{its}) \\ &\times P(\text{water} \mid \text{its})\\ &\times P(\text{is} \mid \text{its water}) \\ &\times P(\text{so} \mid \text{its water is}) \\ &\times P(\text{transparent} \mid \text{its water is so}) \\ \end{align}$

Markov Assumption

First-order markov processes

$P(\text{the} \mid \text{its water is so transparent that}) = P(\text{the} \mid \text{that})$

Second-order markov processes

$P(\text{the} \mid \text{its water is so transparent that}) = P(\text{the} \mid \text{transparent that})$

we approximate each component in the product

$P(w_1,w_2,...,w_n) = \prod_{i}^n P(w_i \mid w_{i-k},...,w_{i-1}) \\ P(w_i \mid w_1,...,w_{i-1}) = P(w_i \mid w_{i-k},...,w_{i-1})$

Unigram model

$P(w_1,w_2,...,w_n) = \prod_{i}^n P(w_i)$

Problem

$P(\text{the the the the}) \gg P(\text{I like ice cream})$

Bigram model

$P(w_i \mid w_1,...,w_{i-1}) = P(w_i \mid w_{i-1}) \\ P(w_1,w_2,...,w_n) = \prod_{i}^n P(w_i \mid w_{i-1})$

Estimating

$P(w_i \mid w_{i-1}) = \frac{count(w_i,w_{i-1})}{count(w_{i-1})}= \frac{c(w_i,w_{i-1})}{c(w_{i-1})}$

e.g.

<s> I am Sam </s>,<s> Sam am I </s>,<s> I do not like green eggs an ham </s>
$\begin{array}{lll} P(\text{I} \mid \text{<s>}) = 0.67 & P(\text{Sam} \mid \text{<s>})=0.33 & P(\text{am} \mid \text{I})=0.67 \\ P(\text{</s>} \mid \text{Sam})=0.5 & P(\text{Sam} \mid \text{am})=0.5 & P(\text{do} \mid \text{I})=0.33 \end{array}$

Raw bigram counts

Raw unigram counts

Raw bigram probabilities

Example: $P(\text{ I want english food})$

Given

$\begin{array}{ll} P(\text{i} \mid \text{<s>})=0.25 & P(\text{english} \mid \text{want})=0.0011 \\ P(\text{food} \mid \text{english})=0.5 & P(\text{<s/>}\mid \text{food})=0.68 \end{array}$

$P(\text{<s> I want english food</s>})=\begin{aligned} P(I \mid<s>)&\times P(\text {want} \mid I) \times P(\text{english} \mid \text{want}) \times P(\text{food} \mid \text{english}) \times P(\text{</s>}\mid \text {food)} =0.000031 \end{aligned}$

Practical Issue

We do everything in log space
- Avoid underflow
- adding is faster than multiplying

$\log(p_1 \cdot p_2 \cdot p_3 \cdot p_4) = \log p_1 + \log p_2 + \log p_3 + \log p_4$

N-gram model

We can extend to trigrams, 4 grams, 5 grams
In general this is an insufficient model of language
- because language has long distance dependencies
  - e.g. The computer which I had just put into the machine room on the fifth floor crashed
But we can often get away with N-gram models

How to evaluate language models: Perplexity 困惑度

Perplexity is the inverse probability of the test set, normalized by the number of words
Lower perplexity Better model

$PP(W) = P(w_1w_2...w_n)^{-\frac{1}{N}}$

Chain rule

$PP(W) = (\prod_{i=1}^{N}\frac{1}{P(w_i \mid w_1,...,w_{i-1})})^{\frac{1}{N}}$

Bigrams

$PP(W) = (\prod_{i=1}^{N}\frac{1}{P(w_i \mid w_{i-1})})^{\frac{1}{N}}$

Smoothing

Add-one estimation / Laplace smoothed

Pretend we saw each word one more time than we did

$P_{Add-1}(w_i \mid w_{i-1}) = \frac{c(w_{i-1},w_i)+1}{c(w_{i-1})+V}\\ P_{Add-k}(w_i \mid w_{i-1}) = \frac{c(w_{i-1},w_i)+k}{c(w_{i-1})+kV}$

Advanced smoothing algorithms

Good-Turing
- Replace the empty frequency with those things we’ve seen only times
Kneser-Ney

Interpolation

Simple interpolation

$\begin{aligned} \hat{P}\left(w_{n} \mid w_{n-1} w_{n-2}\right)&= \lambda_{1} P\left(w_{n} \mid w_{n-1} w_{n-2}\right) \\ &+\lambda_{2} P\left(w_{n} \mid w_{n-1}\right) \\ &+\lambda_{3} P\left(w_{n}\right) \end{aligned} \quad \sum_i \lambda_i = 1$

Choose to maximize the probability of validation data
- Fix the N-gram probabilities without smoothing (on the training data)
- Then search for λs that give largest probability to validation set
- Can use any optimization technique (line search or EM usually easiest)

Unknown words

If we know all the words in advanced
- Vocabulary V is fixed
- Closed vocabulary task
Often we don’t know this
- Out Of Vocabulary = OOV words
- Open vocabulary task
Instead: create an unknown word token
- Training of \
  probabilities
  - Create a fixed lexicon L of size V
  - At text normalization phase, any training word not in L changed to \
  - Now we train its probabilities like a normal word
- At decoding time
  - If text input: Use UNK probabilities for any word not in training