Smurf
文章50
标签0
分类6
Language Modeling

Language Modeling

Language Modeling

The Language Modeling Problem

  • Setup: assume a (finite) vocabulary of words:

  • We can construct an (infinite) set of strings:

  • Data: given a training set of example sentences

  • Problem: estimate a probability distribution

image-20210824201203648

Probabilistic Language Modeling

  • Goal

    • assign a probability to a sentence
  • Related task

    • probability of an upcoming word
  • A model that computes either of these is called a language model or LM

How to compute

e.g.

Markov Assumption

  • First-order markov processes
  • Second-order markov processes
  • we approximate each component in the product

Unigram model

Problem

Bigram model

Estimating

e.g.

  • <s> I am Sam </s>,<s> Sam am I </s>,<s> I do not like green eggs an ham </s>

Raw bigram counts

image-20210824205811992

Raw unigram counts

image-20210824205907708

Raw bigram probabilities

image-20210824205957641

Example: $P(\text{ I want english food})$
  • Given
  • Get
Practical Issue
  • We do everything in log space
    • Avoid underflow
    • adding is faster than multiplying

N-gram model

  • We can extend to trigrams, 4 grams, 5 grams
  • In general this is an insufficient model of language
    • because language has long distance dependencies
      • e.g. The computer which I had just put into the machine room on the fifth floor crashed
  • But we can often get away with N-gram models

How to evaluate language models: Perplexity 困惑度

  • Perplexity is the inverse probability of the test set, normalized by the number of words
  • Lower perplexity Better model
  • Chain rule
  • Bigrams

Smoothing

Add-one estimation / Laplace smoothed
  • Pretend we saw each word one more time than we did

image-20210824215553779

image-20210824220231842

Advanced smoothing algorithms

  • Good-Turing

    • Replace the empty frequency with those things we’ve seen only times

    image-20210824221903989

  • Kneser-Ney

Interpolation

  • Simple interpolation
  • Choose to maximize the probability of validation data
    • Fix the N-gram probabilities without smoothing (on the training data)
    • Then search for λs that give largest probability to validation set
    • Can use any optimization technique (line search or EM usually easiest)

Unknown words

  • If we know all the words in advanced
    • Vocabulary V is fixed
    • Closed vocabulary task
  • Often we don’t know this
    • Out Of Vocabulary = OOV words
    • Open vocabulary task
  • Instead: create an unknown word token
    • Training of \ probabilities
      • Create a fixed lexicon L of size V
      • At text normalization phase, any training word not in L changed to \
      • Now we train its probabilities like a normal word
    • At decoding time
      • If text input: Use UNK probabilities for any word not in training
本文作者:Smurf
本文链接:http://example.com/2021/08/15/nlp%20learning/Chapter3_n-gram_model/
版权声明:本文采用 CC BY-NC-SA 3.0 CN 协议进行许可