BOW
1. Language model has a history of over one hundred years
Past:
- 𝑛-gram Language Model
Present:
Neural Language Model
Pretrained Language Model
Future:
- Brain-Inspired Language Model
2. Brain-Inspired Language Model
- Humans’ language system in their brains can be divided into three regions including storing languages, sentiment, and representation,
- When people see the sentence, they will think of it as a picture in their brain. So though the two sentences above are very similar, the images may be different.
3. Text Classification
- Assigning subject categories, topics, or genres
- Spam detection
- Authorship identification
- Age/gender identification
- Language Identification
- Sentiment analysis
3.1 input
- a document
- a fixed set of classes
3.2 Output
- a predicted class
3.3 methods
Rules-based on combinations of words or other features
- spam: black-list-address OR (“折扣” AND “降价”)
Accuracy can be high
If rules are carefully refined by expert
But building and maintaining these rules is expensive
3.4 The Bag of Words Representation
- Count the words of the document, and get a dictionary about the document. Then we can get the frequency of each word from the document.
- Sometimes, it’s useless for us to count all words of the document. So we usually calculate the useful words of the document.
4. How to learn the classifier
4.1 Let’s start with Naive Bayes
(Simple “naïve” classification method based on Bayes rule)
4.1.1 Imagine two people Alice and Bob whose word usage pattern you know:
Alice often uses words: love, great, wonderful
Bob often uses words: dog, ball, wonderful
Alice words probabilities: love(0.1), great(0.8), wonderful(0.1)
Bob words probabilities: love(0.3), ball(0.2), wonderful(0.5)
Can you guess who sends: “wonderful love”?
- So Bob do it
4.1.2 Suppose there are two bowls of cookies:
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla.
What is the probability that it came from Bowl 1?
- P(c|x) is the posterior probability of class c given features.
- P(c) is the probability of class.
- P(x|c) is the likelihood which is the probability of features given class.
- P(x) is the prior probability of features.
4.1.3
Based on our training set we can also say the following:
From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
Out of 300 oranges, 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow
From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25) are Yellow
- So it’s banana
4.2 Naive Bayes Classifier
例:给定好评,对应评论的概率? 是否感到很奇怪?
parameters
- Assume that conditionally independent
5. Learning the Naive Bayes Model
- Simply use the frequencies in the data (maximum likelihood estimates)
is equal to the likelihood of documents from class is equal to the likelihood of the word in class Create mega-document for topic j by concatenating all docs in this topic
- Use frequency of w in mega-document
5.1 Laplace (add-1) smoothing
5.2 unknown word
Add one extra word to the vocabulary, the “unknown word”
6. Try again with Textual examples
Priors
Conditional Probabilities:
Choosing a class:
7. Sentiment Classification: Dealing with Negation 否定词
I really like this movie
I really don’t like this movie
Negation changes the meaning of “like” to negative.
Negation can also change negative to positive-ish
- Don’t dismiss this film
- Doesn’t let us get bored
7.1 Sentiment Classification: Dealing with Negation
Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA).
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.
8. Naïve Bayes and Language Modeling
Naïve bayes classifiers can use any sort of feature
- URL, email address, dictionaries, network features
But if, as in the previous slides
- We use only word features
- we use all of the words in the text (not a subset)
Then
- Naive bayes has an important similarity to language modeling.
Each class = a unigram language model
Assigning each word: P(word | c)
Assigning each sentence: P(s|c)=Π P(word|c)
9. Evaluation
precision just represent the rate of positive example predicted by the LM, so it can’t be an evidence that the model is a good model