Chatbots

1. What is Chatbots /Dialogue

1.1 Two kind of conversational agents

(Task-based) Dialogue Agents(Close domain)

Personal assistant, help users achieve a certain task in vertical domains, e.g., education and medical
Combination of rules and statistical components
Frames with slots and values: a set of slots, to be filled with information of a given type. Each associated with a question to the user.

Chatbots (Open domain)

No specific goal, focus on humanlike conversations
For fun, or even for therapy
Rule-based: Pattern-action rules (ELIZA) + A mental model (PARRY): The first system to pass the Turing Test!
Corpus-based: Information Retrieval (XiaoIce) or Neural encoder-decoder (variants of E2E seq2seq model) (BlenderBot)

2. Properties of Human Conversation

2.1 propertites

Turns
- We call each contribution a “turn”
Interruptions and Barge-in:Allowing the user to interrupt
End-pointing
- The task for a speech system of deciding whether the user has
  stopped talking.

Each turn (utterance in a dialogue is a kind of action
Constatives: committing the speaker to something’s being the case
- 使说话者确信某事是真实的
- answering , claiming , confirming , denying , disagreeing , stating
Directives: attempts by the speaker to get the addressee to do
- 演讲者试图让收件人这样做
- advising , asking , forbidding , inviting , ordering , requesting
Commissives: committing the speaker to some future course of action
- 让演讲者对未来的行动做出承诺
- promising, planning, vowing, betting, opposing
Acknowledgments: express the speaker’s attitude regarding the hearer
with respect to some social action
- 表达说话人对听话人关于某些社会行为的态度
- apologizing , greeting , thanking , accepting an acknowledgment

2.1 Grounding

Participants in conversation or any joint activity need to establish common ground.
- 对话应该建立于对事实的一致理解上
Principle of closure . Agents performing an action require evidence, sufficient for
current purposes, that they have succeeded in performing it
- 封闭原则。执行某项操作的代理需要足够的证据，证明他们已成功执行该操作
Speech is an action too! So speakers need to ground each other’s utterances.
- 语言也是行动！因此，演讲者需要让对方的话语有根据。
- Grounding : acknowledging that the hearer has understood 承认听者已经理解

2.1.1 Grounding: Establishing Common Ground

通过一些特殊的连贯词可以缓解尴尬。

2.2 Conversations have structure

Local structure between adjacent speech acts, from the field of conversational analysis 从会话分析的角度看相邻言语行为之间的局部结构
Called adjacency pairs:
- QUESTION … A NSWER
- PROPOSAL … A CCEPTANCE /R EJECTION
- COMPLIMENTS (“Nice jacket!”)… DOWNPLAYER (“Oh, this old thIng?”)

2.3 Another kind of structure: Subdialogues

Correction subdialogue 子对话的正确性
- 要保证可以任意插入子对话，且子对话的结果是对的，并且可能影响到原来的对话
- Agent : OK. There’s #two non stops# 直达站
- Client: #Act actually#, what day of the week is the 15th?
- Agent : It’s a Friday
- Client: Uh hmm. I would consider staying there an extra day til Sunday.
- Agent : OK…OK. On Sunday I have …
Clarification Subdialogues
- User : What do you have going to UNKNOWN WORD on the 5th?
- System: Let’s see, going where on the 5th
- User: Going to Hong Kong. 澄清去哪里
- System : OK, here are some flights…
Presequences 可能在正式开始对话有个开场
- User : Can you make train
- System : Yes I can
- User : Great, I’d like to reserve a seat on the 4pm train to New York.

2.4 Conversational Initiative

对话的主动权
Some conversations are controlled by one person 有些对话由一个人控制
- A reporter interviewing a chef asks questions, and the chef responds.一位采访厨师的记者问了一些问题，厨师回答。
- This reporter has the conversational initiative This reporter has the conversational initiative
Most human conversations have mixed initiative : 大多数人际对话都有混合的主动性：
- I lead, then you lead, then I lead.
Mixed initiative is very hard for NLP systems, which often default to simpler styles that can be frustrating for humans: 混合主动对NLP非常困难，因为NLP经常默认采用更简单的模式
- User initiative (user asks or commands, system
- System initiative (system asks user questions to fill out a form, user can’t change the direction)

2.5 Even harder problems: Inference

推理
Agent: And, what day in May did you want to travel?
Client : OK, uh, I need to be there for a meeting that’s from the 12th to the 15th.

3. Rule based Chatbots

3.1 Eliza pattern/transform rules

大概思想就是提取关键词，然后根据库里关键词对应的答案进行排序
s/. I’M ( depressed|sad ) ./I AM SORRY TO HEAR YOU ARE 1/
s/. I AM ( depressed|sad ) ./WHY DO YOU THINK YOU ARE 1/
s/. all ./IN WHAT WAY?/
s/. always ./CAN YOU THINK OF A SPECIFIC EXAMPLE?/

0 means *
The 3 is the constituent # in pattern
example
- You hate me
- WHAT MAKES YOU THINK I HATE YOU

User: I know everybody laughed at me
“I” is very general:
- I: (I *)—> (YOU SAY YOU)
- ELIZA: YOU SAY YOU KNOW EVERYBODY LAUGHED AT YOU
“Everybody” is more specific and interesting
- Everybody: (Everybody *)—> (WHO IN PARTICULAR ARE YOU THINKING
- ELIZA: WHO IN PARTICULAR ARE YOU THINKING OF?
Implementation: keywords stored with their rank
- Everybody 5 (list of transformation rules
- I 0 (list of transformation rules

3.1.1 Memory

如何对对话进行存储->用于进行后续答案提取，下面举的是提取MY关键词，随机选择一种变换进行存储。

Whenever “MY” is highest keyword 对一些语句进行存储
- Randomly select a transform on the MEMORY list
- Apply to sentence
- Store on a (first in first out) queue
Later, if no keyword matches a sentence 如果没有关键词匹配，就返回队首
- Return the top of the MEMORY queue instead

3.1.2 Ethical implications: Anthropomorphism and Privacy

大致是一些人沉迷于和聊天机器人谈话，还要求他们的对话具有隐私性

3.2 PARRY: A computational model of schizophrenia

Another chatbot with a clinical psychology focus
Used to study schizophrenia 精神分裂症
Same pattern response structure as Eliza
But a much richer:
- control structure 控制结构
- language understanding capabilities 语言理解能力
- model of mental state. 心理状态模型。
  - variables modeling levels of Anger, Fear, Mistrust
  - 模拟愤怒、恐惧、不信任程度的变量

3.2.1 model of mental state 心理状态模型

Affect variables
Fear (0 20) Anger (0 20) Mistrust (0 15)
Start with all variables low
After each user turn
- Each user statement can change Fear and Anger
  - E.g., Insults increases Anger, Flattery decreases Anger
  - Mentions of his delusions increase Fear

3.2.2 Parry’s responses depend on mental state

4. Corpus based Chatbots

Dialogue as a Markov Decision Process (MDP)
- Given state 𝒔, select action 𝒂 according to (hierarchical) policy 𝝅
  - 给定状态𝒔, 选择动作𝒂 根据（分级）政策𝝅
- Receive reward 𝒓, observe new state s s′
  - 观察新的状态，收到响应奖励r
- Continue the cycle until the episode terminates.
  - 继续循环直到对话情节结束
Goal of dialogue learning: find optimal 𝝅 to maximize expected rewards
- 对话学习的目标：找到最佳𝝅 最大化预期回报
A unified view: dialogue as optimal decision making Dialogue

4.1 Two architectures for corpus based chabots

Response by retrieval 检索响应
- Use information retrieval to grab a response (that is appropriate to the context) from some corpus 使用信息检索从一些语料库中获取响应（适合上下文）
Response by generation 生成响应
- Use a language model or encoder decoder to generate the response given the dialogue context 在给定对话上下文的情况下，使用语言模型或编码器-解码器生成响应
Modern corpus based chatbots are very data-intensive
- 现代基于语料库的聊天机器人非常数据密集
They commonly require hundreds of millions or billions of words

4.2 Response by retrieval: classic IR method

Given a user turn , and a training corpus of conversation
Find in the turn that is most similar ( tf idf cosine) to q
Say r

$\operatorname{response}(q, C)=\underset{r \in C}{\operatorname{argmax}} \frac{q \cdot r}{|q||r|}$

深度学习方法，只是换了计算相似度的方法：

$\begin{aligned} h_{q} &=\operatorname{BERT}_{Q}(\mathrm{q})[\mathrm{CLS}] \\ h_{r} &=\operatorname{BERT}_{R}(\mathrm{r})[\mathrm{CLS}] \\ \operatorname{response}(q, C) &=\underset{r \in C}{\operatorname{argmax}} h_{q} \cdot h_{r} \end{aligned}$

4.2.1 Response by retrieving and refining knowledge

Can generate responses from informative text rather than dialogue?
- 利用IR进行信息检索得到答案
- To respond to turns like “Tell me something about Beijing”
  - XiaoIce collects sentences from public lectures and news articles.
  - And searches them using IR based on query expansion from user’s turn
- Augment encoder decoder model
  - 增强编码解码模型
  - use IR to retrieve passages from Wikipedia
  - concatenate each Wikipedia sentence to the dialogue context with a separator token. 使用分隔符标记将每个Wikipedia句子连接到对话上下文。
  - Give as encoder context to the encoder decoder model, which learns to incorporate text into its response 为编码器-解码器模型提供编码器上下文，该模型学习将文本合并到其响应中

4.3 Response by generation

Think of response production as an encoder-decoder task
Generate each token of the response by conditioning on the encoding of the entire query and the response so far
- 生成每个token是基于整个查询的编码q，以及过去的的token

$\hat{r}_{t}=\operatorname{argmax}_{\mathrm{w} \in \mathrm{V}} P\left(w \mid q, r_{1} \ldots r_{t-1}\right)$

Alternative approach: fine tune a large language model on conversational data
The Chirpy Cardinal system (Paranjape et al., 2020):
- fine tunes GPT 2
- on the E MPATHETIC D IALOGUES dataset ( Rashkin et al., 2019)
Ongoin research problems: Neural chatbots can get repetitive, bland, and inconsistent 神经聊天机器人会变得重复、乏味和不一致

4.4 Challenge: The blandness problem

回答过于无趣
Blandness problem: cause and remedies 原因和改善
Common MLE objective (maximum likelihood) 都使用最大似然函数进行优化

利用互信息改善 Mutual information objective:
- 同时优化双方的条件概率，使得答案和问题更相关

Mutual Information for Neural Network Generation

Mutual information objective:

希望Targe能生成Source的概率应该最小，从而增加约束
Sample outputs

4.5 Challenge: The consistency problem

即前后不一致问题
E2E systems often exhibit poor response consistency :

Conversational data:
- 原因出在数据集本身就不是1对1关系

解决方法，将每个问答编码加入个人化信息，使得答案唯一，减少歧义性。

4.5.1 Personal modeling as multi-task learning

Improving personalization with multiple losses

优化，使人物角色可以“预测”自己的反应
做法如图各自经过一个隐藏层，然后最后cat后再经过一个隐藏层作为最后输出，优化四部分的损失，增加对身份的约束

4.6 Challenge: Long conversational context

对长对话无记忆性
It can be challenging for LSTM/GRU to encode very long context (i.e. more than 200 words: Khandelwal + 18
Hierarchical Encoder Decoder (HRED) Serban + 16
- Encodes: utterance (word by word) + conversation (turn by turn) 编码：话语（逐字）+对话（逐句）就是既要把当前的话语编码，也要把历史所有语料进行编码

Hierarchical Latent Variable Encoder Decoder (VHRED) Serban+ 17
- Adds a latent variable to the decoder 向解码器添加一个潜在变量
- Trained by maximizing variational lower bound on the log likelihood Related
  - 通过最大化对数似然相关函数的变分下界进行训练

5. Hybrid Architectures

Chirpy Cardinal (Paranjape et al., 2020) response generation from a series of different generators
GPT 2 finetuned on EmpatheticDialogues
GPT 2 finetuned to paraphrase content 解析内容 from Wikipedia
Rule based movie or music generators that produce scripted conversation about a movie or a musician 基于规则的电影或音乐生成器，用于生成有关电影或音乐家的脚本式对话
- asking the user’s opinion about a movie,
- giving a fun fact,
- asking the user their opinion on an actor in the movie.

6. The Frame based (“GUS”) Dialogue Architecture 基于框架的（“GUS”）对话架构

Sometimes called “ task based dialogue agents”
- Systems that have the goal of helping a user solve a task like making a travel reservation or buying a product 旨在帮助用户解决旅行预订或购买产品等任务的系统
Architecture:
- First proposed in the GUS system of 1977
- A knowledge structure representing user intentions 表示用户意图的知识结构
- One or more frames (each consisting of slots with values 一个或多个帧（每个帧由带值的插槽组成）

6.1 The Frame

A set of slots , to be filled with information of a given type
- 一组插槽，用于填充给定类型的信息
Each associated with a question to the user
- 每个都与用户的一个问题相关联
Sometimes called a domain ontology
- 有时称为领域本体

6.2 Control structure for GUS frame architecture

System asks questions of user, filling any slots that user specifies
User might fill many slots at a time:
- I want a flight from San Francisco to Denver one way leaving after five p.m . on Tuesday
When frame is filled, do database query

6.3 GUS slots have condition action rules attached

GUS插槽附带了条件操作规则
Some rules attached to the DESTINATION slot for the plane booking frame
- 飞机预订框架中的目的地时段附加的一些规则
Once the user has specified the destination
- 一旦用户指定了目的地
- Enter that city as the default StayLocation for the hotel booking frame.
- 输入该城市作为酒店预订框架的默认位置。
Once the user has specified DESTINATION DAY for a short trip
- 一旦用户指定了短途旅行的目的地日期
- Automatically copy as ARRIVAL DAY. 自动复制为到达日期
Frames like:
Car or hotel reservations
General route information
- Which airlines fly from Boston to San Francisco? ,
Information about airfare practices
- Do I have to stay a specific number of days to get a decent airfare?.
Frame detection:
- System must detect which slot of which frame user is filling
- 系统必须检测用户正在填充哪个帧的哪个插槽
- And switch dialogue control to that frame.
- 并将对话控制切换到该帧。

6.3.1 GUS: Natural Language Understanding for filling dialog slots

Domain classification 领域分类
- Asking weather? Booking a flight? Programming alarm clock?
Intent Determination 意图确定
- Find a Movie, Show Flight, Remove Calendar Appt
Slot Filling 插槽填充
- Extract the actual slots and fillers 提取实际插槽和填充器

6.4.2 How to fill slots->Rule based Slot filling

Write regular expressions or grammar rules
Wake me (up) | set (the|an ) alarm | get me up
Do text normalization
A template is a pre-built response string 模板是预构建的响应字符串
Templates can be fixed
- “Hello, how can I help you?”
Or have variables
- “What time do you want to leave CITY ORIG?”
- “Will you return to CITY ORIG from CITY DEST?”

6.5 A more sophisticated 复杂的 version of the frame based architecture

6.5.1 A Multi-turn Task oriented Dialogue state Architecture

6.5.2 Components in a dialogue state architecture

对话状态体系结构中的组件
LU:extracts slot fillers from the user’s utterance using machine learning
- LU：使用机器学习从用户的话语中提取槽填充
Dialogue state tracker:maintains the current state of the dialogue (user’s most recent dialogue act, set of slot filler constraints from user has expressed so far).
- 对话状态跟踪器：维护对话的当前状态（用户最近的对话行为，用户迄今为止表达的一组插槽填充约束）。
Dialogue policy: decides what the system should do or say next
- 对话政策：决定系统下一步应该做什么或说什么
- GUS policy: ask questions until the frame was full then report back the results of some database query
  - GUS策略：询问问题，直到框架已满，然后报告一些数据库查询的结果
- More sophisticated: know when to answer questions, when to ask a clarification question, when to make a suggestion,etc
  - 更复杂：知道何时回答问题、何时提出澄清问题、何时提出建议等
NLG: produce more natural, less templated utterances

6.6 Dialogue Acts

Combine the ideas of speech acts and grounding into a single representation
- 将言语行为和基础的思想结合成一个单一的表达

6.7 How to fill slots - Machine learning Slot filling

Machine learning classifiers to map words to semantic frame fillers
- 机器学习分类器将单词映射到语义框架填充词
Given a set of labeled sentences
- Input:
  - I want to fly to San Francisco on Monday please”
- Output:
  - Destination: SF
  - Depart-time: Monday
Build a classifier to map from one to the other Requirements: Lots of labeled data
- 构建一个分类器，将一个需求映射到另一个需求：大量标记数据

6.7.1 Slot filling as sequence labeling: BIO tagging

The BIO tagging paradigm
Idea: Train a classifier to label each input word with a tag that tells us what slot (if any) it fills
- 想法：训练一个分类器，用一个标签标记每个输入单词，告诉我们它填充了什么槽（如果有的话）

We create a B and I tag for each slot type
And convert the training data to this format

Slot filling using contextual embeddings

Once we have the BIO tag of the sentence

We can extract the filler string for each slot
- 我们可以提取每个插槽的填充字符串
And then normalize it to the correct form in the ontology
- 然后将其规范化为本体中的正确形式
Like “SFO” for San Francisco
Using homonym dictionaries (SF=SFO=San Francisco)
- 使用同名词典（SF=SFO=San Francisco

6.8 The task of dialogue state tracking

可以简单理解为一直保留原来的slot直到满

Dialogue act interpretation algorithm: 对话行为解释算法：
- 1 of N supervised classification to choose inform N个监督分类中的1个选择信息
- Based on encodings of current sentence + prior dialogue acts 基于当前句子的编码+先前的对话行为
Simple dialogue state tracker: 简单对话状态跟踪器：
- Run a slot filler after each sentence 在每个句子后运行插槽填充程序

An special case of dialogue act detection:Detecting Correction Acts

If system misrecognizes an utterance
User might make a correction
- Repeat themselves
- Rephrasing 重新措辞
- Saying “no” to a confirmation question

Features for detecting corrections in spoken dialogue

6.9 Dialogue Policy

At turn predict action to take, given entire history:

$\hat{A}_{i}=\underset{A_{i} \in A}{\operatorname{argmax}} P\left(A_{i} \mid\left(A_{1}, U_{1}, \ldots, A_{i-1}, U_{i-1}\right)\right.$

Simplify by just conditioning on the current dialogue state (filled frame slots) and the last turn by system and user: 只需调节当前对话状态（填充的帧槽）和最后一圈以及系统和用户的圈数即可简化

$\hat{A}_{i}=\underset{A_{i} \in A}{\operatorname{argmax}} P\left(A_{i} \mid \text { Frame }_{i-1}, A_{i-1}, U_{i-1}\right)$

6.9.1 Policy example: Confirmation and Rejection

Dialogue systems make errors
So they to make sure they have understood user
Two important mechanisms:
- confirming understandings with the user
- rejecting utterances 话语 that the system is likely to have misunderstood.

6.9.2 Confirmation

Explicit confirmation strategy

Implicit confirmation strategy

Confirmation strategy tradeoffs

Explicit confirmation makes it easier for users to correct the system’s misrecognitions since a user can just answer “no” to the confirmation question.
But explicit confirmation is also awkward and increases the length of the conversation ( Danieli and Gerbino 1995, Walker et al. 1998).

6.9.3 Rejection

Progressive prompting for rejection

Don’t just repeat the question “When would you like to leave?” Give user guidance about what they can say:

Using confidence to decide whether to confirm:

ASR or NLU systems can assign a confidence value indicating how likely they are that they understood the user.
- Acoustic log-likelihood of the utterance 声音的对数似然性
- Prosodic features 韵律特征
- Ratio of score of best to second-best interpretation 最佳口译得分与次优口译得分之比
Systems could use set confidence thresholds:

$\begin{array}{lll} <\alpha & \text { low confidence } & \text { reject } \\ \geq \alpha & \text { above the threshold } & \text { confirm explicitly } \\ \geq \beta & \text { high confidence } & \text { confirm implictly } \\ \geq \gamma & \text { very high confidence } & \text { don't confirm at all } \end{array}$

6.10 Natural Language Generation NLG

NLG in information state architecture modeled in two stages:
- content planning (what to say)
- sentence realization (how to say it).
We’ll focus on sentence realization here.

6.12.1 Sentence Realization

可以理解为已经找到answer了，怎么生成完整的句子，即输入是包括answer的slots、value，输出是句子
Assume content planning has been done by the dialogue policy
- Chosen the dialogue act to generate
- Chosen some attributes (slots and values) that the planner wants to say to the user 选择计划员想对用户说的一些属性（槽和值）
  - Either to give the user the answer, or as part of a confirmation strategy)为用户提供答案，或作为确认策略的一部分）

2 samples of Input and Output for Sentence Realizer

Training data is hard to come by 训练数据很难获得
- Don’t see each restaurant in each situation 不可能遍历每种情况的每个餐厅
Common way to improve generalization:
- Delexicalization: replacing words in the training set that represent slot values with a generic placeholder token: 去符号化：用通用占位符标记替换训练集中表示插槽值的单词：
- 大致意思是训练时把slot的词全都mask，训练时不对其进行训练，直到生成后，将mask掉的词再换回来，从而达到只训练句子整体生成的部分。

Output:
- restaurant_name has decent service
Relexicalize to: 符号化
- Au Midi has decent service

6.12.2 Generating clarification questions

这里主要讲如何生成确认性的问题，即使用规则。
User: What do you have going to UNKNOWN WORD on the 5th?
System: Going where on the 5th?
The system repeats “going” and “on the 5th” to make it clear which aspect of the user’s turn the system needs to be clarified Methods for generating clarification questions:
- Rules like ‘replace “going to UNKNOWN WORD” with “going where” ‘
- Classifiers that guess which slots were misrecognized

7. Reinforcement Learning (RL)

7.1 RL vs. SL (supervised learning)

Differences from supervised learning
- Learn by trial-and-error (“experimenting”)
  - Need efficient exploration
- Optimize long-term reward
  - Need temporal credit assignment
Similarities to supervised learning
- Generalization and representation
- Hierarchical problem solving

7.2 Conversation as RL

Observation and action 观察和行动
- Raw representation (utterances in natural language form) 原始表示（自然语言形式的话语）
- Semantic representation (intent slot value form) 语义表示（意向槽值形式）
Reward
- +10 upon successful termination 成功终止后+10
- 10 upon unsuccessful termination 终止不成功时
- 1 per turn

8. Evaluating Dialogue Systems

8.1 Evaluating chatbots and task based dialogue

Task-based dialogue:
- mainly by measuring task performance
Chatbots
- mainly by human evaluation

8.2 Chatbots are evaluated by humans Participant

Participant evaluation: The human who talked to the chatbot assigns a score
Observer evaluation : third party who reads a transcript of a human/chatbot conversation assigns a score.

8.2.1 Participant evaluation

Human chats with model for 6 turns and rates 8 dimensions of quality:
- avoiding repetition, interestingness, making sense, fluency, listening, inquisitiveness, humanness, engagingness,
Avoiding Repetition: How repetitive was this user?
- Repeated themselves over and over •Sometimes said the same thing twice • Always said something new
Making sense: How often did this user say something which didn’t make sense?
- Never made any sense •Most responses didn’t make sense •Some responses didn’t make sense •Everything made perfect sense
Engagingness: How much did you enjoy talking to this user?
Not at all •A little •Somewhat •A lot

8.2.2 Observer evaluation: acute eval

Annotators look at two conversations (A + B) and decide which is better:
Engagingness: Who would you prefer to talk to for a long conversation?
Interestingness: If you had to say one of these speakers is interesting and one is boring, who would you say is more interesting?
Humanness: Which speaker sounds more human?
Knowledgeable: If you had to say that one speaker is more knowledgeable and one is more ignorant, who is more knowledgeable?

8.3 Automatic evaluation is an open problem

Automatic evaluation methods (like the BLEU scores used for Machine Translation) are generally not used for chatbots.
- They correlate poorly with human judgements.
One current research direction: Adversarial Evaluation
- Inspired by the Turing Test
- train a “Turing like” classifier to distinguish between human responses and machine responses.
- The more successful a dialogue system is at fooling the evaluator, the better the system.

8.3.1 Task based systems are evaluated by task success!

End-to-end evaluation (Task Success)
Slot Error Rate for a Sentence

$\frac{\text{of inserted/deleted/subsituted slots}}{\text{of total reference slots for sentence}}$

8.4 Evaluation Metrics: Slot error rate

“Make an appointment with Chris at 10:30 in Gates 104”

Slot error rate: 1/3
Task success : At end, was the correct meeting added to the calendar?