BOW

BOW

1. Language model has a history of over one hundred years

Past:

𝑛-gram Language Model

Present:

Neural Language Model
Pretrained Language Model

Future:

Brain-Inspired Language Model

2. Brain-Inspired Language Model

Humans’ language system in their brains can be divided into three regions including storing languages, sentiment, and representation,
When people see the sentence, they will think of it as a picture in their brain. So though the two sentences above are very similar, the images may be different.

3. Text Classification

Assigning subject categories, topics, or genres
Spam detection
Authorship identification
Age/gender identification
Language Identification
Sentiment analysis

3.1 input

a document
a fixed set of classes

3.2 Output

a predicted class

3.3 methods

Rules-based on combinations of words or other features

spam: black-list-address OR (“折扣” AND “降价”)

Accuracy can be high

If rules are carefully refined by expert
But building and maintaining these rules is expensive

3.4 The Bag of Words Representation

Count the words of the document, and get a dictionary about the document. Then we can get the frequency of each word from the document.
Sometimes, it’s useless for us to count all words of the document. So we usually calculate the useful words of the document.

4. How to learn the classifier

4.1 Let’s start with Naive Bayes

(Simple “naïve” classification method based on Bayes rule)

4.1.1 Imagine two people Alice and Bob whose word usage pattern you know：

Alice often uses words: love, great, wonderful

Bob often uses words: dog, ball, wonderful

Alice words probabilities: love(0.1), great(0.8), wonderful(0.1)

Bob words probabilities: love(0.3), ball(0.2), wonderful(0.5)

Can you guess who sends: “wonderful love”?

$\begin{array} \\ P(Alice|\text{"wonderful love"})=0.1\times 0.1=0.01 \\ P(Bob|\text{"wonderful love"})=0.5\times 0.3=0.15 \end{array}$

So Bob do it

4.1.2 Suppose there are two bowls of cookies：

Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
Bowl 2 contains 20 of each.

Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla.

What is the probability that it came from Bowl 1?

$\begin{array}\\ P(Bowl_1|vanilla) =\frac{P(Bowl_1, vanilla)}{P(vanilla)}=\frac{P(vanilla|Bowl_1)\times P(Bowl_1)}{\sum_{i=1}^2 P(vanilla|Bow_i)\times P(Bow_i)}\\ =\frac{\frac{1}{2}\times\frac{3}{4}}{\frac{1}{2}\times\frac{3}{4}+\frac{1}{2}\times\frac{1}{2}}=\frac{3}{5} \end{array}$ $P(c \mid x)=\frac{P(x \mid c) P(c)}{P(x)}$

P(c|x) is the posterior probability of class c given features.
P(c) is the probability of class.
P(x|c) is the likelihood which is the probability of features given class.
P(x) is the prior probability of features.

4.1.3

Based on our training set we can also say the following:

From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
Out of 300 oranges, 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow
From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25) are Yellow

$P\left(x_{1}, x_{2}, \ldots, X_{n} \mid c\right)=\prod_{x \in X} P(x \mid c)$ $\begin{gathered} P\left(\frac{\text { Banana }}{\text { Long, } \text { Sweet }, \text { Yellow }}\right)=\frac{P\left(\frac{\text { Long }}{\text { Banana }}\right) \times P\left(\frac{\text { Sweet }}{\text { Banana }}\right) \times P\left(\frac{\text { Yellow }}{\text { Banana }}\right) \times P(\text { Banana })}{P(\text { Long }) P(\text { Sweet }) P(\text { Yellow })} \\ P\left(\frac{\text { Banana }}{\text { Long }, \text { Sweet }, \text { Yellow }}\right)=(0.8) \times(0.7) \times(0.9) \times(0.5)\times\alpha=0.252\alpha \end{gathered}$ $\begin{gathered} P\left(\frac{\text { Orange }}{\text { Long, } \text { Sweet }, \text { Yellow }}\right)=0\end{gathered}$ $\begin{gathered} P\left(\frac{\text { Other }}{\text { Long, Sweet, Yellow }}\right)=\frac{P\left(\frac{\text { Long }}{\text { Other }}\right) \times P\left(\frac{\text { Sweet }}{\text { Other }}\right) \times P\left(\frac{\text { Yellow }}{\text { Other }}\right) \times P \text { (Other) }}{P(\text { Long }) P(\text { Sweet }) P(\text { Yellow })}\\ ={P\left(\frac{(0.5) \times(0.75) \times(0.25) \times(0.2)}{\text { Long, Sweet, Yellow }}\right)=(0.5) \times(0.75) \times(0.25) \times(0.2)\times\alpha=0.01875\alpha} \end{gathered}$

So it’s banana

4.2 Naive Bayes Classifier

$\begin{aligned} &c_{M A P}=\underset{c \in C}{\operatorname{argmax}} P(c \mid d)\\ &=\underset{c \in C}{\operatorname{argmax}} \frac{P(d \mid c) P(c)}{P(d)}\\ &=\underset{c \in C}{\operatorname{argmax}} P(d \mid c) P(c) \end{aligned}$ $\left.c_{M A P}=\underset{c \in C}{\arg \max } P(d / c)\right\}(c)$

例：给定好评，对应评论的概率? 是否感到很奇怪?

$c_{M A P}=\arg \max _{c \in C}[P(d / c) p(c)$ $=\underset{c \in C}{\operatorname{argmax}} P\left(x_{1}, x_{2}, \ldots, x_{n} \mid c\right) P(c)$

parameters

$P\left(x_{1}, x_{2}, \ldots, x_{n} \mid c\right)=\prod_{x \in X} P(x \mid c)$

Assume that conditionally independent

$c_{N B}=\underset{c \in C}{\operatorname{argmax}} P\left(c_{j}\right) \prod_{x \in X} P(x \mid c)$

5. Learning the Naive Bayes Model

Simply use the frequencies in the data (maximum likelihood estimates)

$\begin{gathered} \hat{P}\left(c_{j}\right)=\frac{\operatorname{doccount}\left(C=c_{j}\right)}{N_{d o c}} \\ \hat{P}\left(w_{i} \mid c_{j}\right)=\frac{\operatorname{count}\left(w_{i}, c_{j}\right)}{\sum_{w \in V} \operatorname{count}\left(w, c_{j}\right)} \end{gathered}$

is equal to the likelihood of documents from class
is equal to the likelihood of the word in class
Create mega-document for topic j by concatenating all docs in this topic
- Use frequency of w in mega-document

5.1 Laplace (add-1) smoothing

$\begin{aligned} \hat{P}\left(w_{i} \mid c\right) &=\frac{\operatorname{count}\left(w_{i}, c\right)+1}{\left.\sum_{w \in V}(\operatorname{count}(w, c))+1\right)} \\ &=\frac{\operatorname{count}\left(w_{i}, c\right)+1}{\left.\sum_{w \in V} \operatorname{count}(w, c)\right)+|V|} \end{aligned}$

5.2 unknown word

Add one extra word to the vocabulary, the “unknown word”

$\begin{aligned} \hat{P}\left(w_{u} \mid c\right) &=\frac{\operatorname{count}\left(w_{u}, c\right)+1}{\left(\sum_{w \in V} \operatorname{count}(w, c)\right)+|V+1|} \\ &=\frac{1}{\left(\sum_{w \in V} \operatorname{count}(w, c)\right)+|V+1|} \end{aligned}$

6. Try again with Textual examples

$\begin{array}{r} \hat{P}(c)=\frac{N_{c}}{N} \\ \hat{P}(w \mid c)=\frac{\operatorname{count}(w, c)+1}{\operatorname{count}(c)+|V|} \end{array}$

Priors

$\begin{array}\\ P(c)=\frac{3}{4}\\ P(j)=\frac{1}{4} \end{array}$

Conditional Probabilities:

$\begin{array}\\ P(\text{Chinese|c}) = \frac{5+1}{8+6}=\frac{3}{7}\\ P(\text{Tokyo|c}) = \frac{0+1}{8+6}=\frac{1}{14}\\ P(\text{Japan|c}) = \frac{0+1}{8+6}=\frac{1}{14}\\ P(\text{Chinese|j}) = \frac{1+1}{3+6}=\frac{2}{9}\\ P(\text{Tokyo|j}) = \frac{1+1}{3+6}=\frac{2}{9}\\ P(\text{Japan|j}) = \frac{1+1}{3+6}=\frac{2}{9}\\ \end{array}$

Choosing a class:

$\begin{aligned} \mathrm{P}(\mathrm{c} \mid \mathrm{d} 5) \propto 3 / 4\times (3 / 7)^{3} \times 1 / 14 \times 1 / 14 & \approx 0.0003 \\ \mathrm{P}(\mathrm{j} \mid \mathrm{d} 5) \propto 1 / 4\times(2 / 9)^{3} \times 2 / 9 \times 2 / 9 & \approx 0.0001 \end{aligned}$

7. Sentiment Classification: Dealing with Negation 否定词

I really like this movie
I really don’t like this movie

Negation changes the meaning of “like” to negative.

Negation can also change negative to positive-ish

Don’t dismiss this film
Doesn’t let us get bored

7.1 Sentiment Classification: Dealing with Negation

Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA).

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.

8. Naïve Bayes and Language Modeling

Naïve bayes classifiers can use any sort of feature

URL, email address, dictionaries, network features

But if, as in the previous slides

We use only word features
we use all of the words in the text (not a subset)

Then

Naive bayes has an important similarity to language modeling.

Each class = a unigram language model

Assigning each word: P(word | c)

Assigning each sentence: P(s|c)=Π P(word|c)

9. Evaluation

precision just represent the rate of positive example predicted by the LM, so it can’t be an evidence that the model is a good model

$\begin{gathered} F_{\beta}=\frac{\left(\beta^{2}+1\right) P R}{\beta^{2} P+R} \\ \mathrm{~F}_{1}=\frac{2 P R}{P+R} \end{gathered}$