JP2018155980A

JP2018155980A - Dialogue device and dialogue method

Info

Publication number: JP2018155980A
Application number: JP2017053989A
Authority: JP
Inventors: 整加藤; Hitoshi Kato; 拓磨峰村; Takuma Minemura; 純一伊藤; Junichi Ito; 政登藤井; Masato Fujii; 裕人今野; Hiroto Konno
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2017-03-21
Filing date: 2017-03-21
Publication date: 2018-10-04
Anticipated expiration: 2037-03-21
Also published as: JP6772916B2

Abstract

PROBLEM TO BE SOLVED: To properly decide whether to perform task-oriented dialog or non-task-oriented dialogue in a dialogue device.SOLUTION: The device includes utterance recognition means for recognizing the content of utterance made by a user, response generating means for generating a response to the utterance according to a response policy which is a rule for generating a response to the utterance, evaluation obtaining means for determining a user evaluation which is a value representing the preference for the response, and updating means that performs reinforcement learning with the user evaluation as a reward and updates the response policy. According to the response policy, the response generating means determines a response mode indicating whether to generate a task-oriented response or a non-task-oriented response, and generates a response according to the response mode.SELECTED DRAWING: Figure 1

Description

本発明は、ユーザと対話を行う装置に関する。 The present invention relates to an apparatus for interacting with a user.

近年、自然言語によってユーザと対話を行う装置が多く提案されている。特に、「質問に対する情報の提供」といったタスク指向型の対話に加え、雑談といった非タスク指向型の対話を行う技術の進展が著しい。例えば、特許文献１には、ユーザとの対話戦略を学習によって生成する方法が開示されている。 In recent years, many devices that interact with a user in a natural language have been proposed. In particular, in addition to task-oriented dialogues such as “providing information for questions”, the progress of technology for non-task-oriented dialogues such as chat is remarkable. For example, Patent Literature 1 discloses a method for generating a dialogue strategy with a user by learning.

このような技術の進展に伴い、タスク指向型の対話と、非タスク指向型の対話の双方が可能な装置も出現している。タスク指向型の対話とは、「明日の天気を教えて」といったような、特定のタスクを要求する対話であり、非タスク指向型の対話とは、「五月晴れって気持ちいいよね」といったような、特定のタスクを要求しない対話である。これらの双方を使い分けることで、より自然な対話が行えるようになる。 With the advance of such technology, devices capable of both task-oriented dialogue and non-task-oriented dialogue have appeared. A task-oriented dialogue is a dialogue requesting a specific task, such as “Tell me the weather for tomorrow”, and a non-task-oriented dialogue is “I feel fine in May.” An interaction that does not require a specific task. By using both of these, you can have a more natural conversation.

特開２００６−７２４７７号公報JP 2006-72477 A

タスク指向型の対話と非タスク指向型の対話の双方が可能である場合、話者（ユーザ）がどのようなモードでの対話を望んでいるかを推定しなければならない場面がある。
例えば、ユーザが「気持ちがいい天気だね」と発話した場合、非タスク指向型の対話が開始されたことがわかるが、例えば、「明日は晴れるかな？」といった発話がなされた場合、文章のみでは、ユーザが明日の天気予報を知りたがっているのか、明日の行動予定について語りたがっているのかを明確に判別することができない。 When both task-oriented dialogue and non-task-oriented dialogue are possible, there is a scene where it is necessary to estimate in which mode the speaker (user) wants the dialogue.
For example, if the user utters "It's a pleasant weather", it can be seen that a non-task-oriented conversation has started, but if the utterance is "Is it fine for tomorrow?" Then, it cannot be clearly determined whether the user wants to know the weather forecast for tomorrow or wants to talk about tomorrow's action schedule.

本発明は上記の課題を考慮してなされたものであり、タスク指向型の対話を行うか、非タスク指向型の対話を行うかを適切に決定する対話装置を提供することを目的とする。 The present invention has been made in consideration of the above-described problems, and an object of the present invention is to provide an interactive apparatus that appropriately determines whether to perform a task-oriented conversation or a non-task-oriented conversation.

本発明に係る対話装置は、
ユーザが行った発話の内容を認識する発話認識手段と、前記発話に対する応答を生成するルールである応答方策に従って、前記発話に対する応答を生成する応答生成手段と、前記応答に対する好ましさを表す値であるユーザ評価を決定する評価取得手段と、前記ユーザ評価を報酬として強化学習を行い、前記応答方策を更新する更新手段と、を有し、前記応答生成手段は、前記応答方策に従って、タスク指向型の応答を生成するか、非タスク指向型の応答を生成するかを表す応答モードを決定し、前記応答モードに従って応答を生成することを特徴とする。 The interactive apparatus according to the present invention is:
An utterance recognition means for recognizing the content of an utterance made by a user, a response generation means for generating a response to the utterance according to a response policy that is a rule for generating a response to the utterance, and a value representing the preference for the response Evaluation acquisition means for determining a user evaluation, and update means for performing reinforcement learning using the user evaluation as a reward and updating the response policy, wherein the response generation means is task-oriented according to the response policy A response mode indicating whether to generate a type response or a non-task-oriented response is determined, and the response is generated according to the response mode.

応答方策は、対話がどのような状態のときにどのような応答を返すかを定めた方策（ルール）である。本発明では、定められた応答方策に従い、タスク指向型の応答を生成するか、非タスク指向型の応答を生成するかを決定する。
応答方策は、ユーザとの対話結果に基づいて得た報酬を用いて強化学習を行うことで更新される。例えば、デフォルトの応答方策を有している状態から開始し、対話を通して学習を行い更新してもよい。また、本発明では、強化学習に用いる報酬として、ユーザに提
供した応答がどの程度好ましいものであったかを表す値（ユーザ評価）を事後的に決定する。 The response policy is a policy (rule) that defines what kind of response is returned when the dialog is in what state. In the present invention, it is determined whether to generate a task-oriented response or a non-task-oriented response according to a predetermined response policy.
The response policy is updated by performing reinforcement learning using a reward obtained based on the result of the dialog with the user. For example, it may be started from a state having a default response policy, and learning may be updated through dialogue. Moreover, in this invention, the value (user evaluation) showing how much the response provided to the user was preferable as a reward used for reinforcement learning is determined afterwards.

ユーザ評価（すなわち、応答がどの程度好ましいものであったか）は、ユーザの状態に基づいて決定してもよい。例えば、応答に対するユーザの反応が好意的なものであった場合、ユーザ評価を高くしてもよいし、反対にユーザの反応が好意的でなかった場合、ユーザ評価を低くしてもよい。ユーザの反応が好意的であるか否かは、例えば、ユーザの発話や表情などをセンシングした結果を用いて決定することができる。
また、ユーザ評価は、応答を行った後における対話の内容に基づいて決定してもよい。例えば、対話が継続した時間などを利用することができる。
かかる構成によると、対話を通じて強化学習が実施され、応答方策がより好ましいものとなる。すなわち、学習が進むにつれ、タスク指向型の応答を生成するか、非タスク指向型の応答を生成するかを適切に決定できるようになる。 User ratings (ie, how favorable the response was) may be determined based on the user's condition. For example, if the user's response to the response is favorable, the user evaluation may be increased. Conversely, if the user's response is not favorable, the user evaluation may be decreased. Whether or not the user's reaction is favorable can be determined using, for example, the result of sensing the user's speech or facial expression.
In addition, the user evaluation may be determined based on the content of the dialogue after the response is made. For example, it is possible to use the time during which the dialogue has continued.
According to such a configuration, reinforcement learning is performed through dialogue, and a response policy is more preferable. That is, as learning progresses, it becomes possible to appropriately determine whether to generate a task-oriented response or a non-task-oriented response.

また、本発明に係る対話装置は、タスク指向型の対話がなされている度合いが関連付けられた複数の対話状態の中から、前記発話の内容に基づいて、現在の対話状態を推定する状態推定手段をさらに有し、前記応答生成手段は、前記複数の対話状態を用いて前記強化学習を行うことを特徴としてもよい。 Further, the dialogue apparatus according to the present invention is a state estimation means for estimating a current dialogue state based on the content of the utterance from a plurality of dialogue states associated with the degree of task-oriented dialogue. The response generation means may perform the reinforcement learning using the plurality of dialogue states.

対話状態とは、ユーザの発話内容によって決定される状態である。また、対話状態は、タスク指向の度合いによって離散化されたものである。このような対話状態を用いて強化学習を行うことで、現在の対話状態に応じて、タスク指向型の応答モードを採用するか、非タスク指向型の応答モードを採用するかを適切に決定できるようになる。 The conversation state is a state determined by the user's utterance content. The dialogue state is discretized according to the degree of task orientation. Reinforcement learning using such a dialog state can appropriately determine whether to adopt a task-oriented response mode or a non-task-oriented response mode according to the current dialog state. It becomes like this.

また、前記強化学習はＱ学習であり、前記応答方策は、前記対話状態および応答モードの組み合わせに関連付いたＱ値であることを特徴としてもよい。 The reinforcement learning may be Q learning, and the response policy may be a Q value associated with a combination of the dialog state and the response mode.

本発明に係る対話装置は、強化学習の一種であるＱ学習を好適に用いることができる。 The dialogue apparatus according to the present invention can preferably use Q learning which is a kind of reinforcement learning.

また、本発明に係る対話装置は、前記ユーザが前記対話において言及している対象を識別する対象推定手段をさらに有し、識別した前記対象ごとに、前記対話状態の推定および強化学習を行うことを特徴としてもよい。 Moreover, the dialogue apparatus according to the present invention further includes a target estimation unit that identifies a target referred to by the user in the dialog, and performs the dialog state estimation and reinforcement learning for each identified target. May be a feature.

対話においてユーザが言及している対象が変化する（すなわち、話題が転換する）と、ユーザがタスク指向型の応答を望んでいるか、非タスク指向型の応答を望んでいるかが大きく変化する場合がある。そこで、話題ごとに対話状態の推定および強化学習を行うことで、より適切な応答モードを選択できるようになる。 When the subject that the user mentions in the conversation changes (ie, the topic changes), whether the user wants a task-oriented response or a non-task-oriented response may change significantly. is there. Therefore, it is possible to select a more appropriate response mode by performing conversation state estimation and reinforcement learning for each topic.

また、本発明に係る対話装置は、前記ユーザを識別する話者推定手段をさらに有し、識別した前記ユーザごとに、前記対話状態の推定および強化学習を行うことを特徴としてもよい。 The dialogue apparatus according to the present invention may further include speaker estimation means for identifying the user, and the dialogue state estimation and reinforcement learning are performed for each identified user.

タスク指向型の応答が好ましいか、非タスク指向型の応答が好ましいかの判断基準はユーザごとに異なる。そこで、対話の状態判定や学習をユーザごとに別個に行うことで、より適切な応答モードを選択できるようになる。 The criteria for determining whether a task-oriented response is preferable or a non-task-oriented response is different for each user. Therefore, a more appropriate response mode can be selected by separately performing dialog state determination and learning for each user.

また、前記評価取得手段は、前記応答方策に従って応答を行ったあとにおける前記ユーザの発話に基づいて前記ユーザ評価を決定することを特徴としてもよい。
また、前記評価取得手段は、前記応答方策に従って応答を行ったあとで一連の対話が継続した長さに基づいて、前記ユーザ評価を決定することを特徴としてもよい。
また、前記評価取得手段は、前記応答方策に従って応答を行ってから前記ユーザが更なる発話をするまでの時間に基づいて、前記ユーザ評価を決定することを特徴としてもよい。 In addition, the evaluation acquisition unit may determine the user evaluation based on the user's utterance after performing a response according to the response policy.
In addition, the evaluation acquisition unit may determine the user evaluation based on a length of a series of dialogues after a response is made according to the response policy.
In addition, the evaluation acquisition unit may determine the user evaluation based on a time from when a response is made according to the response policy until the user speaks further.

例えば、ユーザがタスク指向型の対話を望んでいるのに、非タスク指向型の応答を提供してしまった場合など、適切ではない応答を行った場合、対話が途切れてしまうことが想定できる。よって、応答に続くユーザの発話や、対話の継続時間、ユーザが発話するまでの時間等に基づいてユーザ評価を決定することができる。 For example, when a user desires a task-oriented dialog but provides a non-task-oriented response, such as when an inappropriate response is made, it can be assumed that the dialog is interrupted. Therefore, the user evaluation can be determined based on the user's utterance following the response, the duration of the conversation, the time until the user speaks, and the like.

なお、本発明は、上記手段の少なくとも一部を含む対話装置として特定することができる。また、前記対話装置が行う対話方法として特定することもできる。上記処理や手段は、技術的な矛盾が生じない限りにおいて、自由に組み合わせて実施することができる。 In addition, this invention can be specified as an interactive apparatus containing at least one part of the said means. It can also be specified as a dialogue method performed by the dialogue device. The above processes and means can be freely combined and implemented as long as no technical contradiction occurs.

本発明によれば、対話装置において、タスク指向型の対話を行うか、非タスク指向型の対話を行うかを適切に決定することができる。 According to the present invention, it is possible to appropriately determine whether a task-oriented dialogue or a non-task-oriented dialogue is performed in the dialogue apparatus.

第一の実施形態に係る対話装置のシステム構成図である。1 is a system configuration diagram of an interactive apparatus according to a first embodiment. 本発明における対話状態を説明する図である。It is a figure explaining the dialog state in this invention. 第一の実施形態における報酬テーブルの例である。It is an example of the reward table in 1st embodiment. 第一の実施形態に係る対話装置が行う処理フローチャート図である。It is a process flowchart figure which the dialogue apparatus which concerns on 1st embodiment performs. ユーザと装置が行う対話の例である。It is an example of the dialogue which a user and an apparatus perform. 第二の実施形態に係る対話装置のシステム構成図である。It is a system block diagram of the dialogue apparatus which concerns on 2nd embodiment. 第三の実施形態に係る対話装置のシステム構成図である。It is a system block diagram of the dialogue apparatus which concerns on 3rd embodiment.

（第一の実施形態）
<システム概要>
以下、本発明の好ましい実施形態について図面を参照しながら説明する。
第一の実施形態に係る対話装置は、ユーザが発した音声を取得して音声認識を行い、認識結果に基づいて応答文を生成することでユーザとの対話を行うシステムである。 (First embodiment)
<System overview>
Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.
The dialogue apparatus according to the first embodiment is a system that performs dialogue with the user by acquiring voice uttered by the user, performing voice recognition, and generating a response sentence based on the recognition result.

本実施形態に係る対話装置は、タスク指向型の対話と、非タスク指向型の対話の双方が可能な構成となっている。タスク指向型の対話とは、特定の課題（タスク）の達成を目的とした対話であり、非タスク指向型の対話とは、特定のタスクの達成を目的としない対話である。例えば、「明日の天気を教えて」といった発話がなされた場合、「気象情報を提供する」というタスクが達成すべきタスクとなる。また、「エアコンの温度を下げて」といった発話がなされた場合、空調の設定温度を下げるというタスクが達成すべきタスクとなる。一方、非タスク指向型の対話においては、達成すべきタスクが無いため、雑談中心の対話となる。 The dialogue apparatus according to the present embodiment has a configuration capable of both task-oriented dialogue and non-task-oriented dialogue. A task-oriented dialogue is a dialogue aimed at achieving a specific task (task), and a non-task-oriented dialogue is a dialogue not aimed at achieving a specific task. For example, when an utterance such as “tell me tomorrow's weather” is made, the task of “providing weather information” is a task to be achieved. Further, when an utterance such as “lower the temperature of the air conditioner” is made, the task of lowering the set temperature of the air conditioner is a task to be achieved. On the other hand, in a non-task-oriented dialogue, there is no task to be achieved, so the dialogue is mainly chat.

前述したように、ユーザの発話が、「○○して」といったように、具体的なタスクを要求するものであった場合、タスク指向型の対話が開始されたことがわかる。しかし、例えば、「明日晴れるかな？」「ちょっと暑くない？」といった発話がなされた場合、対話がどちらであるか断定できない場合がある。第一の実施形態に係る対話装置は、強化学習を利用して、タスク指向型の応答をすべきか非タスク指向型の応答をすべきかを判定し、適切なモードによって応答する機能を有する。 As described above, when the user's utterance is a request for a specific task such as “Please do”, it is understood that the task-oriented dialogue has started. However, for example, when an utterance such as "Is it sunny tomorrow?" The interactive apparatus according to the first embodiment has a function of using reinforcement learning to determine whether to perform a task-oriented response or a non-task-oriented response and to respond in an appropriate mode.

<システム構成>
次に、前述した機能を実現するための装置のシステム構成について説明する。
図１は、第一の実施形態に係る対話装置のシステム構成図である。本実施形態に係る対話装置１００は、入出力部１０１、発話取得部１０２、応答生成部１０３、状態推定部１０４、学習部１０５、評価取得部１０６を含んで構成される。 <System configuration>
Next, a system configuration of an apparatus for realizing the above-described function will be described.
FIG. 1 is a system configuration diagram of the interactive apparatus according to the first embodiment. The dialogue apparatus 100 according to the present embodiment includes an input / output unit 101, an utterance acquisition unit 102, a response generation unit 103, a state estimation unit 104, a learning unit 105, and an evaluation acquisition unit 106.

入出力部１０１は、内蔵されたマイクおよびスピーカによって音声の入出力を行う手段である。入出力部１０１は、入力された音声を音声データに変換し、また、入力された音声データを音声によって出力する機能を有している。 The input / output unit 101 is means for inputting / outputting sound by using a built-in microphone and speaker. The input / output unit 101 has a function of converting input sound into sound data and outputting the input sound data by sound.

発話取得部１０２は、ユーザが発した音声を、入出力部１０１を介して取得し、認識する手段である。発話取得部１０２は、取得した音声データに対して、既知の技術を用いて音声認識を行う。例えば、発話取得部１０２には、音響モデルと認識辞書が記憶されており、取得した音声データと音響モデルとを比較して特徴を抽出し、抽出した特徴を認識辞書とをマッチングさせることで音声認識を行う。認識結果は、応答生成部１０３および状態推定部１０４へ送信される。
なお、発話取得部１０２は、装置外部のサービスを用いて音声認識を行ってもよい。例えば、図１に示したように、ネットワークを介して外部の音声認識サーバと通信を行い、認識結果を取得するように構成してもよい。 The utterance acquisition unit 102 is means for acquiring and recognizing the voice uttered by the user via the input / output unit 101. The utterance acquisition unit 102 performs voice recognition on the acquired voice data using a known technique. For example, the utterance acquisition unit 102 stores an acoustic model and a recognition dictionary, extracts features by comparing the acquired speech data with the acoustic model, and matches the extracted features with the recognition dictionary to generate speech. Recognize. The recognition result is transmitted to the response generation unit 103 and the state estimation unit 104.
Note that the utterance acquisition unit 102 may perform speech recognition using a service outside the apparatus. For example, as shown in FIG. 1, it may be configured to communicate with an external voice recognition server via a network and acquire a recognition result.

応答生成部１０３は、発話取得部１０２から取得したテキストに基づいて、ユーザに提供する応答文を生成する手段である。提供する応答文は、例えば、事前に記憶された対話シナリオ（対話辞書）に基づくものであってもよいし、データベースやウェブを検索して得られた情報に基づくものであってもよい。
また応答生成部１０３は、後述する状態推定部１０４が推定した対話状態と、学習部１０５に記憶された情報に基づいて、タスク指向型の応答を生成するか、非タスク指向型の応答を生成するかを表す応答モードを決定し、当該応答モードに応じて異なるタイプの応答を生成する。具体的な方法については後述する。 The response generation unit 103 is a unit that generates a response sentence to be provided to the user based on the text acquired from the utterance acquisition unit 102. The response sentence to be provided may be based on, for example, a dialogue scenario (dialog dictionary) stored in advance, or may be based on information obtained by searching a database or the web.
The response generation unit 103 generates a task-oriented response or generates a non-task-oriented response based on the conversation state estimated by the state estimation unit 104 described later and the information stored in the learning unit 105. A response mode indicating whether to perform the determination is determined, and different types of responses are generated according to the response mode. A specific method will be described later.

状態推定部１０４は、ユーザとの対話状態を推定する手段である。本明細書における対話状態とは、タスク指向型の対話がなされている度合いが関連付けられた複数の状態であって、ユーザが行った発話の内容に依拠して決定される。
ここで、図２を参照しながら対話状態について説明する。本明細書においては、対話中のある時点における状態を、タスク指向型の対話がなされている度合い（以下、タスク度）に関連付けて特定する。
ここで、対話が取りうる状態の集合をＳとし、タスク度を１０段階に離散化すると、Ｓは以下のようになる。なお、Ｓ₀は発話が無い状態を意味するものとする。
Ｓ＝｛Ｓ₀，Ｓ₁，Ｓ₂，…，Ｓ₁₀｝ The state estimation unit 104 is a means for estimating a dialog state with the user. The dialogue state in the present specification is a plurality of states associated with the degree of task-oriented dialogue, and is determined based on the content of the utterance performed by the user.
Here, the dialogue state will be described with reference to FIG. In the present specification, the state at a certain point in the dialogue is specified in association with the degree of the task-oriented dialogue (hereinafter referred to as the task degree).
Here, when the set of states that can be taken by the dialogue is S, and the task level is discretized into 10 stages, S is as follows. S ₀ means a state where there is no utterance.
S = {S ₀ , S ₁ , S ₂ ,..., S ₁₀ }

図２の例の場合、状態Ｓ₁が、対話が最も非タスク指向型である状態であり、状態Ｓ₁₀
が、対話が最もタスク指向型である状態である。対話状態は、ユーザまたは装置が発話するたびに変化（遷移）しうる。各対話状態間は自由に遷移することができるが、発話があっても対話状態が変化しない場合もある。
状態推定部１０４は、ユーザが行った発話の内容に基づいてタスク度を判定し、対話状態をその都度推定する。タスク度の判定には、公知の技術を利用することができる。 In the case of the example of FIG. 2, the state S ₁ is a state in which the dialogue is most non-task-oriented, and the state S ₁₀
However, dialogue is the most task-oriented state. The conversation state can change (transition) each time a user or device speaks. Although it is possible to freely change between each dialogue state, there are cases where the dialogue state does not change even if there is an utterance.
The state estimation unit 104 determines the task level based on the content of the utterance performed by the user, and estimates the dialog state each time. A known technique can be used to determine the task level.

ところで、画一的な基準を用いてタスク度の判定を行っただけでは、適切な応答モードを選択できない場合がある。例えば、「明日は晴れるかな？」といった発話において、ユーザＡは天気予報を要求する意図があり、ユーザＢは明日の予定について会話を交わしたいという意図があるといったケースが存在する。
そこで、本実施形態に係る対話装置は、後述する手段によって強化学習を行い、得られ
た学習結果を併用して応答モードを決定する。 By the way, there is a case where an appropriate response mode cannot be selected only by determining the task level using a uniform standard. For example, in an utterance such as “Is it clear tomorrow?”, There is a case where the user A has an intention to request a weather forecast and the user B has an intention to have a conversation about tomorrow's schedule.
Therefore, the interactive apparatus according to the present embodiment performs reinforcement learning by means described later, and determines a response mode by using the obtained learning results together.

学習部１０５は、応答モードを決定するためのデータを強化学習によって蓄積する手段である。学習部１０５は、対話状態がどのような状態にある場合に、どのような応答モードで応答すれば期待報酬が最大になるか（すなわち、ユーザを満足させる応答ができるか）をＱ学習によって学習する。
図３（Ａ）および（Ｂ）は、学習部１０５によって保持されるテーブル（報酬テーブル）の例である。本実施形態では、対話状態（状態推定部１０４が推定した対話状態に対応）と、応答モード（Ａ₁が非タスク指向型の応答を意味し、Ａ₂がタスク指向型の応答を意味する）が、期待報酬と関連付けて記憶されている。
期待報酬とは、強化学習において期待される報酬であり、本実施形態においては、応答の適切度を表す値である。期待報酬は、たとえば、ユーザの満足度などに基づいて、学習を通して更新される。学習の進め方と期待報酬については後述する。 The learning unit 105 is means for accumulating data for determining the response mode by reinforcement learning. The learning unit 105 learns, by Q-learning, in what response mode the response state is maximized (i.e., can the response satisfying the user be satisfied) in what response mode the response is in? To do.
3A and 3B are examples of tables (reward tables) held by the learning unit 105. FIG. In the present embodiment, the dialogue state (corresponding to the state estimator conversation state 104 is estimated), response mode (A ₁ is meant the response of non-task-oriented, A ₂ means the response of task-oriented) Is stored in association with the expected reward.
The expected reward is a reward expected in reinforcement learning, and is a value representing the appropriateness of response in the present embodiment. The expected reward is updated through learning based on, for example, user satisfaction. The way of learning and the expected reward will be described later.

図３（Ａ）は、初期値が格納された報酬テーブルを表し、図３（Ｂ）は学習後における報酬テーブルを表す。例えば、図３（Ｂ）の場合、対話状態がＳ₁にある場合、非タスク
指向型の応答をした場合により大きい報酬が期待できる（すなわち、ユーザが満足する）ことが示されている。また、対話状態がＳ₁₀にある場合、タスク指向型の応答をした場合により大きい報酬が期待できることが示されている。 FIG. 3A shows a reward table in which initial values are stored, and FIG. 3B shows a reward table after learning. For example, in the case of FIG. 3B, it is shown that when the dialogue state is S ₁ , a larger reward can be expected when a non-task-oriented response is made (that is, the user is satisfied). Also, if the conversation state is in S _10, it is shown that can be expected is greater than reward when the response of the task-oriented.

学習部１０５は、装置が用いるデータを一時的または恒久的に記憶する手段を有している。例えば、高速に読み書きでき、かつ、大容量なフラッシュメモリなどの記憶媒体を用いることが好ましい。 The learning unit 105 has means for temporarily or permanently storing data used by the apparatus. For example, it is preferable to use a storage medium such as a flash memory that can read and write at high speed.

評価取得部１０６は、学習部１０５が強化学習を行うための報酬を取得する手段である。本実施形態では、報酬とは、応答生成部１０３が生成した応答の好ましさを表す値（ユーザ評価）であって、ユーザの反応に基づいて事後的に取得される。 The evaluation acquisition unit 106 is a means for acquiring a reward for the learning unit 105 to perform reinforcement learning. In this embodiment, the reward is a value (user evaluation) that represents the preference of the response generated by the response generation unit 103, and is acquired afterward based on the user's reaction.

ここで、報酬の取得方法について説明する。
応答生成部１０３が生成した応答が適切な応答モードであった場合（すなわち、ユーザが望む応答モードと一致していた場合）、対話がスムーズに続くことが想定される。また、応答生成部１０３が生成した応答が適切な応答モードでなかった場合、対話がスムーズに続かなくなることが想定される。そこで、本実施形態では、応答をユーザに提供した後で、一連の対話が継続したか否かに基づいて報酬を算出する。
例えば、対話が継続した場合は＋１．０という値を報酬とし、対話が途切れてしまった場合は−１．０という値を報酬とする。算出した報酬は学習部１０５へ送信され、直前に生成した応答を評価する学習データとして用いられる。 Here, a reward acquisition method will be described.
When the response generated by the response generation unit 103 is an appropriate response mode (that is, when the response mode matches the response mode desired by the user), it is assumed that the dialogue continues smoothly. In addition, when the response generated by the response generation unit 103 is not in an appropriate response mode, it is assumed that the conversation does not continue smoothly. Therefore, in this embodiment, after providing a response to the user, a reward is calculated based on whether or not a series of conversations has continued.
For example, when the dialogue continues, a value of +1.0 is used as a reward, and when the dialogue is interrupted, a value of -1.0 is used as a reward. The calculated reward is transmitted to the learning unit 105 and used as learning data for evaluating the response generated immediately before.

なお、本例では、対話が継続したか否かに基づいて報酬を決定したが、報酬は、応答モードが適切であったか否かを評価することができれば、他の方法によって算出してもよい。例えば、対話が継続した長さ（時間やターン数）に基づいて報酬を算出してもよい。また、ユーザが反応するまでの時間に基づいて報酬を算出してもよい。
この他にも、ユーザをセンシングした結果に基づいて報酬を算出してもよい。例えば、声のトーンなどに基づいてユーザの満足度を推定し、報酬として利用してもよい。もちろん、発話の内容（例えば、直前の応答における応答モードを否定するような発話がなされたか否か）に基づいて報酬を算出してもよい。 In this example, the reward is determined based on whether or not the conversation has continued. However, the reward may be calculated by another method as long as it can be evaluated whether or not the response mode is appropriate. For example, the reward may be calculated based on the length of time (time or number of turns) that the conversation has continued. Moreover, you may calculate a reward based on the time until a user reacts.
In addition, the reward may be calculated based on the result of sensing the user. For example, the user's satisfaction may be estimated based on a voice tone or the like and used as a reward. Of course, the reward may be calculated based on the content of the utterance (for example, whether or not an utterance that denies the response mode in the immediately preceding response was made).

対話装置１００は、ＣＰＵ、主記憶装置、補助記憶装置を有する情報処理装置として構成することができる。補助記憶装置に記憶されたプログラムが主記憶装置にロードされ、ＣＰＵによって実行されることで、以降に説明する機能が実現される。なお、図示した機
能の全部または一部は、専用に設計された回路（半導体集積回路など）を用いて実行されてもよい。 The interactive device 100 can be configured as an information processing device having a CPU, a main storage device, and an auxiliary storage device. Functions described below are realized by loading a program stored in the auxiliary storage device into the main storage device and executing it by the CPU. Note that all or part of the illustrated functions may be executed using a circuit designed exclusively (such as a semiconductor integrated circuit).

<処理フローチャート>
次に、図１に示した各手段が行う処理について、処理フローチャート図である図４を参照しながら説明する。図４に示したフローチャートは、装置が学習を行っていない初期状態から開始される。 <Process flowchart>
Next, processing performed by each unit shown in FIG. 1 will be described with reference to FIG. 4 which is a processing flowchart. The flowchart shown in FIG. 4 is started from an initial state where the apparatus is not learning.

まず、ステップＳ１１で、報酬テーブルにデフォルト値を挿入する。本実施形態では、図３（Ａ）に示したような１０個の対話状態を定義し、それぞれにＡ₁（非タスク指向型
の応答）とＡ₂（タスク指向型の応答）が関連付けられている。また、それぞれの応答に
期待報酬（Ｑ値）が関連付けられている。期待報酬（Ｑ値）は、どちらのモードで応答した場合により高い報酬（ユーザの満足度）が得られるかを表す無次元数である。本例では、デフォルト値として、全ての項目に５が挿入されている。
また、ユーザとの対話が開始されると、処理はステップＳ１２へ遷移する。 First, in step S11, a default value is inserted into the reward table. In the present embodiment, ten dialog states as shown in FIG. 3A are defined, and A ₁ (non-task-oriented response) and A ₂ (task-oriented response) are associated with each. Yes. In addition, an expected reward (Q value) is associated with each response. The expected reward (Q value) is a dimensionless number that indicates which mode is more rewarding (user satisfaction) when responding. In this example, 5 is inserted in all items as a default value.
Further, when the dialogue with the user is started, the process proceeds to step S12.

ステップＳ１２では、ユーザが行った発話を発話取得部１０２が取得し、音声認識を行う。認識の結果得られたテキストは、応答生成部１０３および状態推定部１０４へ送信される。
次いで、状態推定部１０４が、ユーザが行った発話の内容に基づいて、現在の対話状態を推定する。例えば、図２に示した状態のうち、初期状態をＳ₀（発話無しを意味する状
態）とすると、Ｓ₀から、Ｓ₁〜Ｓ₁₀のいずれの状態に遷移したかを推定する。また、現在の状態がＳ₁〜Ｓ₁₀のいずれかにある場合、さらに他の状態に遷移したかを推定する。な
お、状態が変化しない（発話がなされた結果、同じ状態に遷移する）場合もあり得る。 In step S12, the utterance acquisition unit 102 acquires the utterance performed by the user and performs voice recognition. The text obtained as a result of recognition is transmitted to the response generation unit 103 and the state estimation unit 104.
Next, the state estimation unit 104 estimates the current dialog state based on the content of the utterance performed by the user. For example, if the initial state of the states shown in FIG. 2 is S ₀ (a state meaning no utterance), it is estimated from S ₀ to which of the states S _{1 to} S ₁₀ . Further, when the current state is in any of S _{1 to} S ₁₀ , it is estimated whether the state has further transitioned to another state. Note that the state may not change (transition to the same state as a result of the utterance).

次に、ステップＳ１３で、応答生成部１０３が、前ステップで状態推定部１０４が推定した状態と、学習部１０５に記憶されている報酬テーブルを参照して、応答モードを選択する。例えば、学習部１０５に記憶されている報酬テーブルが、図３（Ｂ）のような状態であって、ステップＳ１２で推定された対話状態がＳ₂であるとする。この場合、Ｓ₂という対話状態において、期待報酬が最も高くなる応答モードはＡ₁（非タスク指向型の応答
）となる。よって、この場合、応答生成部１０３は、応答モードとして非タスク指向型の応答を選択する。 Next, in step S13, the response generation unit 103 selects a response mode with reference to the state estimated by the state estimation unit 104 in the previous step and the reward table stored in the learning unit 105. For example, compensation table stored in the learning unit 105, a state as shown in FIG. 3 (B), the dialogue state, which is estimated in step S12 is assumed to be S _2. In this case, the response mode in which the expected reward is the highest in the conversation state S ₂ is A ₁ (non-task-oriented response). Therefore, in this case, the response generation unit 103 selects a non-task-oriented response as the response mode.

次いで、ステップＳ１４で、応答生成部１０３が、選択した応答モードに応じた応答を生成し、入出力部１０１を介して音声出力する。 Next, in step S <b> 14, the response generation unit 103 generates a response according to the selected response mode, and outputs the sound via the input / output unit 101.

ステップＳ１５では、出力した応答に対するユーザの反応に基づいて、強化学習を行うか否かを選択する。強化学習は、デフォルトではＯＮであり、後述するステップＳ１９で、学習終了の判定が行われるまで続けられる。 In step S15, whether to perform reinforcement learning is selected based on the user's response to the output response. Reinforcement learning is ON by default, and is continued until a learning end determination is made in step S19 described later.

ステップＳ１６〜Ｓ１９は、出力した応答に対するユーザの反応に基づいて、強化学習を行うステップである。本ステップでは、直前にユーザが行った発話に基づいて、一つ前の会話ターンにおいて装置が発した応答を評価する。
図５に、会話ターンの例を示す。本例では、時刻ｔ１において、明日の天気予報を知りたいユーザが、「明日は晴れるかな？」という発話を行い、時刻ｔ２において、対話装置が「晴れるといいね！」という非タスク指向型の応答（ユーザが望んでいないモードの応答）を返したものとする（ステップＳ１４）。
一方、ユーザは、少しの沈黙のあと、時刻ｔ４において、「明日の天気を教えて」と言い直したものとする。 Steps S16 to S19 are steps for performing reinforcement learning based on the user's response to the output response. In this step, based on the utterance made by the user immediately before, the response uttered by the device in the previous conversation turn is evaluated.
FIG. 5 shows an example of a conversation turn. In this example, at time t1, a user who wants to know tomorrow's weather forecast utters “Is it fine tomorrow?”, And at time t2, the dialog device is a non-task-oriented type that says “I hope it is clear!”. It is assumed that a response (response in a mode not desired by the user) is returned (step S14).
On the other hand, it is assumed that the user rephrased “Tell me the weather tomorrow” at time t4 after a little silence.

この場合、時刻ｔ１においてステップＳ１２の処理が行われ、時刻ｔ２においてステップＳ１４の処理が行われるが、一つ前の会話ターンは存在しないため、ステップＳ１６以降の処理は省略される。 In this case, the process of step S12 is performed at time t1, and the process of step S14 is performed at time t2. However, since there is no previous conversation turn, the processes after step S16 are omitted.

一方、時刻ｔ４においてユーザの発話を取得した場合、一つ前の会話ターン（「発話無し」というターン）が存在するため、ステップＳ１６の処理が実行され、評価取得部１０６によって報酬の算出が行われる。
報酬とは、強化学習における報酬であり、行動選択（本例では応答モードの選択）に対するスコアである。すなわち、ステップＳ１６では、一つ前の会話ターンにおける応答モードが正しかったか否かを、次のユーザの発話に基づいて判定する。 On the other hand, when the user's utterance is acquired at time t4, since there is a previous conversation turn (turn “no utterance”), the process of step S16 is executed, and the evaluation acquisition unit 106 calculates the reward. Is called.
The reward is a reward in reinforcement learning, and is a score for action selection (selection of response mode in this example). That is, in step S16, it is determined based on the next user's utterance whether or not the response mode in the previous conversation turn was correct.

報酬の算出方法には様々なものがある。例えば、以下のような算出方法が考えられる。もちろん、一つ前の会話ターンにおいて選択した応答モードに対するユーザの満足度を推定することができれば、例示した方法以外を用いてもよい。 There are various methods for calculating the reward. For example, the following calculation method can be considered. Of course, as long as the user's satisfaction with the response mode selected in the previous conversation turn can be estimated, methods other than those exemplified may be used.

（１）対話が途切れずに持続した時間あるいはターン数
例えば、対話が途切れずに持続した時間や会話のターン数によって、望ましい応答モードで応答ができたかを推定することができる。不適切な応答モードで応答を行った場合、対話が持続しにくくなると考えられるためである。 (1) Time or number of turns that the conversation lasts without interruption For example, it is possible to estimate whether a response has been made in a desired response mode based on the time that the conversation lasts without interruption or the number of turns of the conversation. This is because if the response is made in an inappropriate response mode, it is considered that the dialogue is less likely to be sustained.

（２）ユーザが発した音声を解析した結果得られた、ユーザの満足度（不満度）
例えば、声のピッチやトーン、声量等を解析することで、ユーザが満足しているか否かを推定することができる。 (2) Satisfaction (dissatisfaction) of the user obtained as a result of analyzing the voice uttered by the user
For example, it is possible to estimate whether the user is satisfied by analyzing the pitch, tone, and volume of the voice.

（３）ユーザから次の発話が得られるまでの時間
例えば、不適切な応答モードで応答を行った場合、ユーザが困惑することが考えられるためである。この基準を図５の例に適用した場合、時刻ｔ３において沈黙が発生しているため、装置は、時刻ｔ４の時点で、時刻ｔ２で行った応答に対する報酬が低かったことを知ることができる。 (3) Time until the next utterance is obtained from the user For example, when a response is made in an inappropriate response mode, the user may be confused. When this criterion is applied to the example of FIG. 5, since silence occurs at time t3, the device can know that the reward for the response made at time t2 was low at time t4.

次に、ステップＳ１７で、学習部１０５が、得られた報酬に基づいて強化学習を行い、報酬テーブルを更新する。
Ｑ学習においては、方策πのもとで、状態ｓにおいて行動ａを行った場合の期待報酬はＱ^π（ｓ，ａ）と表される。方策πが報酬テーブルであり、状態ｓが、ステップＳ１２で推定された対話状態である。また、ａはＡ₁またはＡ₂となる。
本例では、装置が応答した結果、会話が途切れてしまった場合（沈黙が発生した場合。すなわち、状態がＳ₀に遷移した場合）に−１．０という報酬を与え、会話が続いた場合
（すなわち、状態がＳ₁〜Ｓ₁₀に遷移した場合）に＋１．０という報酬を与えるものとす
る。
このような学習を続けて報酬テーブルを更新していくと、報酬を最大化させるようなＱ値が対話状態ごとに特定の値に収束していく。 Next, in step S17, the learning unit 105 performs reinforcement learning based on the obtained reward, and updates the reward table.
In Q learning, the expected reward when the action a is performed in the state s under the policy π is expressed as Q ^π (s, a). Policy π is a reward table, and state s is the dialog state estimated in step S12. A is A ₁ or A ₂ .
In this example, when the conversation is interrupted as a result of the response from the device (when silence occurs, that is, when the state transitions to S ₀ ), a reward of −1.0 is given and the conversation continues. It is assumed that a reward of +1.0 is given (that is, when the state transitions from S _{1 to} S ₁₀ ).
When such a learning is continued and the reward table is updated, the Q value that maximizes the reward converges to a specific value for each dialog state.

ステップＳ１８では、報酬テーブルが更新された際の、Ｑ値（期待報酬）の変動量が閾値より大きいか否かを判定する。ここで、変動量がある程度大きい場合、処理はステップＳ１２へ戻る。一方、変動量が十分に小さい場合、すでに目標とする値に収束していることが考えられるため、ステップＳ１９へ遷移し、強化学習を行うフラグをＯＦＦにする。これにより、以降はステップＳ１６以降へ処理が遷移しなくなる。 In step S18, it is determined whether or not the amount of change in the Q value (expected reward) when the reward table is updated is greater than a threshold value. Here, when the fluctuation amount is large to some extent, the process returns to step S12. On the other hand, if the fluctuation amount is sufficiently small, it is considered that the target value has already been converged. Therefore, the process proceeds to step S19, and the flag for performing reinforcement learning is turned OFF. As a result, the process does not shift to step S16 and thereafter.

以上に説明した処理を繰り返すと、Ｑ学習によって最適な応答モードを選択するための報酬テーブルが学習によって更新される。これにより、タスク指向型の応答を行うか、非
タスク指向型の応答を行うかを適切に決定できるようになる。 When the processing described above is repeated, a reward table for selecting an optimal response mode by Q learning is updated by learning. This makes it possible to appropriately determine whether to perform a task-oriented response or a non-task-oriented response.

（第二の実施形態）
第二の実施形態は、対話を行っているユーザを識別し、ユーザごとに学習を行う実施形態である。図６は、第二の実施形態における対話装置１００のシステム構成図である。
第二の実施形態における対話装置１００は、対話を行っているユーザを識別する手段（ユーザ識別部１０７）を有するという点において、第一の実施形態における対話装置１００と異なる。 (Second embodiment)
The second embodiment is an embodiment in which a user who has a conversation is identified and learning is performed for each user. FIG. 6 is a system configuration diagram of the interactive apparatus 100 according to the second embodiment.
The dialogue apparatus 100 according to the second embodiment is different from the dialogue apparatus 100 according to the first embodiment in that the dialogue apparatus 100 includes means (user identification unit 107) for identifying a user who has a dialogue.

ユーザ識別部１０７は、取得した音声に基づいてユーザの識別を行う手段である。ユーザ識別部１０７は、例えば、音声を解析した結果に基づいて、当該音声を発したユーザに一意な識別子を付与する。なお、ここでは音声の解析を例示したが、ユーザの顔画像などに基づいてユーザの識別を行ってもよい。
また、第二の実施形態では、ユーザ識別子ごとに複数の報酬テーブルが記憶され、ユーザごとに強化学習が可能な構成となっている。このように、第一の実施形態における処理をユーザごとに実施することで、よりパーソナライズされた受け答えが可能になる。 The user identification unit 107 is means for identifying a user based on the acquired voice. For example, the user identification unit 107 assigns a unique identifier to the user who has emitted the voice based on the result of analyzing the voice. In addition, although the analysis of the audio | voice was illustrated here, you may identify a user based on a user's face image etc.
In the second embodiment, a plurality of reward tables is stored for each user identifier, and reinforcement learning is possible for each user. As described above, by performing the processing in the first embodiment for each user, a more personalized answer is possible.

なお、対話装置が複数のユーザと同時に対話する場合、現在対話中のユーザが変化するごとに、報酬テーブルを切り替えながら処理を行うようにすればよい。また、対話中のユーザを識別できない場合、デフォルトの報酬テーブル（一般的なユーザにおいて報酬が最大になるように学習された報酬テーブル）を利用するようにしてもよい。 In addition, when the interactive device interacts simultaneously with a plurality of users, the processing may be performed while switching the reward table each time the user currently interacting changes. In addition, when a user who is interacting cannot be identified, a default reward table (a reward table learned so as to maximize the reward for a general user) may be used.

（第三の実施形態）
第三の実施形態は、対話においてユーザが言及している対象（すなわち話題）を識別し、話題ごとに学習を行う実施形態である。図７は、第三の実施形態における対話装置１００のシステム構成図である。
第三の実施形態における対話装置１００は、対話においてユーザが言及している対象を識別する手段（話題識別部１０８）を有するという点において、第一の実施形態における対話装置１００と異なる。 (Third embodiment)
The third embodiment is an embodiment in which an object (that is, a topic) referred to by a user in a conversation is identified and learning is performed for each topic. FIG. 7 is a system configuration diagram of the interactive apparatus 100 according to the third embodiment.
The dialogue apparatus 100 according to the third embodiment is different from the dialogue apparatus 100 according to the first embodiment in that the dialogue apparatus 100 includes means (topic identification unit 108) for identifying an object referred to by the user in the dialogue.

話題識別部１０８は、対話においてユーザが言及している対象（話題としている対象）を識別する手段である。対話における話題は、例えば、音声認識の結果得られたテキストに対して形態素解析を行い、得られた複数の単語を解析することで識別することができる。話題識別部１０８は、例えば、話題ごとに一意な識別子（話題識別子）を付与する。
また、第三の実施形態では、話題識別子ごとに複数の報酬テーブルが記憶され、話題ごとに強化学習が可能な構成となっている。このように、第一の実施形態における処理を話題ごとに実施することで、より精度の良い受け答えが可能になる。 The topic identification unit 108 is a means for identifying a target (target topic) that is referred to by the user in the conversation. The topic in the dialogue can be identified by, for example, performing morphological analysis on the text obtained as a result of speech recognition and analyzing the obtained plurality of words. For example, the topic identification unit 108 assigns a unique identifier (topic identifier) for each topic.
In the third embodiment, a plurality of reward tables is stored for each topic identifier, and reinforcement learning can be performed for each topic. As described above, by executing the processing in the first embodiment for each topic, a more accurate answer can be obtained.

（変形例）
上記の実施形態はあくまでも一例であって、本発明はその要旨を逸脱しない範囲内で適宜変更して実施しうる。
例えば、第二の実施形態と第三の実施形態を組み合わせ、ユーザと話題の双方を利用するようにしてもよい。この場合、報酬テーブルは統合してもよいし、別々のまま利用してもよい。報酬テーブルを統合しない場合、応答モードを個別に決定したうえで、得られたＱ値に基づいてどちらを採用するかを決定してもよい。 (Modification)
The above embodiment is merely an example, and the present invention can be implemented with appropriate modifications within a range not departing from the gist thereof.
For example, the second embodiment and the third embodiment may be combined to use both the user and the topic. In this case, the reward table may be integrated or used separately. When the reward table is not integrated, after determining the response mode individually, it may be determined which one is adopted based on the obtained Q value.

１００対話装置
１０１入出力部
１０２発話取得部
１０３応答生成
１０４状態推定部
１０５学習部
１０６評価取得部 DESCRIPTION OF SYMBOLS 100 Dialogue device 101 Input / output part 102 Speech acquisition part 103 Response generation 104 State estimation part 105 Learning part 106 Evaluation acquisition part

Claims

An utterance recognition means for recognizing the content of an utterance made by the user;
Response generating means for generating a response to the utterance according to a response policy that is a rule for generating a response to the utterance;
An evaluation acquisition means for determining a user evaluation which is a value representing the preference for the response;
Update means for performing reinforcement learning using the user evaluation as a reward, and updating the response policy;
Have
The response generation means determines a response mode indicating whether to generate a task-oriented response or a non-task-oriented response according to the response policy, and generates a response according to the response mode.
Interactive device.

It further includes state estimation means for estimating a current conversation state based on the content of the utterance from a plurality of conversation states associated with the degree of task-oriented conversation.
The response generation means performs the reinforcement learning using the plurality of dialogue states.
The interactive apparatus according to claim 1.

The reinforcement learning is Q learning,
The response policy is a Q value associated with the combination of the dialog state and response mode.
The interactive apparatus according to claim 2.

Further comprising object estimation means for identifying objects that the user refers to in the dialogue;
For each identified object, the dialog state estimation and reinforcement learning are performed.
The interactive apparatus according to claim 2 or 3.

Speaker estimation means for identifying the user;
For each identified user, the dialog state estimation and reinforcement learning are performed.
The interactive apparatus according to any one of claims 2 to 4.

The evaluation acquisition means determines the user evaluation based on the user's utterance after responding according to the response policy;
The interactive apparatus according to claim 1.

The evaluation acquisition means determines the user evaluation based on a length of a series of dialogues after responding according to the response policy.
The interactive apparatus according to claim 1.

The evaluation acquisition means determines the user evaluation based on a time from when the user responds according to the response policy until the user speaks further.
The interactive apparatus according to claim 1.

An utterance recognition step for recognizing the content of an utterance made by the user;
A response generation step of generating a response to the utterance according to a response policy which is a rule for generating a response to the utterance;
An evaluation acquisition step of determining a user evaluation, which is a value representing the preference for the response;
An update step of performing reinforcement learning with the user evaluation as a reward and updating the response policy;
Including
In the response generation step, a response mode indicating whether to generate a task-oriented response or a non-task-oriented response is determined according to the response policy, and a response is generated according to the response mode.
How to interact.

A program for causing a computer to execute the interactive method according to claim 9.