JP7367609B2

JP7367609B2 - Response sentence generation device, reinforcement learning device, response sentence generation method, model generation method, program

Info

Publication number: JP7367609B2
Application number: JP2020086759A
Authority: JP
Inventors: 雅博水上; 誠也河野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2023-10-24
Anticipated expiration: 2040-05-18
Also published as: JP2021182039A

Description

本発明は、対話履歴に対する応答文を生成するための応答文生成装置、強化学習装置、応答文生成方法、モデル生成方法、プログラムに関する。 The present invention relates to a response sentence generation device, a reinforcement learning device, a response sentence generation method, a model generation method, and a program for generating response sentences to a dialogue history.

応答文を選択する技術として非特許文献１などが知られている。非特許文献１では、エントレインメント（Entrainment）を対話中に考慮し、用例ベース対話システムにおける応答文選択に応用する取り組みが示されている。エントレインメントとは、同調傾向、シンクロニーとも呼ばれ、対話における話者間の話し方や声の調子などの振る舞いが同調または類似する現象を意味する。 Non-patent document 1 is known as a technique for selecting a response sentence. Non-Patent Document 1 describes an approach in which entrainment is considered during dialogue and applied to response sentence selection in an example-based dialogue system. Entrainment, also referred to as a tendency to conform or synchrony, refers to a phenomenon in which speakers in a dialogue behave in the same way or in similar manners, such as the tone of their voices.

水上雅博，吉野幸一郎，Graham Neubig，中村哲，“エントレインメント分析に基づく応答文選択モデルの評価”，言語処理学会，第23回年次大会発表論文集，pp.370-373，2017年3月．Masahiro Mizukami, Koichiro Yoshino, Graham Neubig, Satoshi Nakamura, “Evaluation of a response sentence selection model based on entrainment analysis”, Proceedings of the 23rd Annual Conference of the Language Processing Society, pp. 370-373, March 2017. ．．

対話システムの発展に伴い、対話システムは単に妥当な応答を返すのみでなく、より人間らしい応答および人間特有の対話現象を再現し、その人間らしさを向上させることが求められ始めている。これまでの多くの対話システムは、人間同士の対話等から得られたデータを参考とし、任意の入力に妥当な応答を返す一問一答のシステムが主である。非特許文献１には、応答文の候補を選択し、それらの選択した応答文の候補の中から、過去の対話履歴から計算した単語頻度（言語モデルとも呼ぶ）を用いて、最もエントレインメントが生じたものを選ぶことで、適切な応答文とする技術が示されている。しかし、非特許文献１も、応答文の候補の選択は直前の発話に基づいて行う技術であり、過去の複数の発話からなる対話履歴に基づいて応答文の候補を生成していない。 With the development of dialogue systems, there is a growing demand for dialogue systems to not only return reasonable responses, but also to reproduce more human-like responses and dialogue phenomena unique to humans, and to improve their human-likeness. Many dialogue systems to date have mainly been question-and-answer systems that return appropriate responses to arbitrary input, using data obtained from dialogue between humans as a reference. Non-Patent Document 1 discloses that response sentence candidates are selected, and word frequencies (also called language models) calculated from past dialogue history are used to find the most entraining candidate among the selected response sentence candidates. A technique for creating an appropriate response sentence by selecting the one that occurs is shown. However, Non-Patent Document 1 is also a technique in which response sentence candidates are selected based on the immediately preceding utterance, and response sentence candidates are not generated based on a dialogue history consisting of a plurality of past utterances.

近年の対話システムでは、ニューラルネットワークを用いて応答を生成する方式（以下、「文生成方式」と呼ぶ）が一般化しており、文生成方式では、対話全体を入力として、次の応答を出力する「文脈を考慮できる」モデルも提案されつつある。しかしながら、全てのやりとりを入力したとしても、最終的にモデルが注目するのは一つ前のユーザの発話のみであるという研究結果も出ている。これは、文脈考慮が可能なモデルであっても文脈を十分に考慮できないことを示している。また、対話全体を入力する文生成方式において、エントレインメントを考慮できるモデルは提案されていない。 In recent years, a method of generating responses using neural networks (hereinafter referred to as "sentence generation method") has become common in dialogue systems.In the sentence generation method, the entire dialogue is input and the next response is output. Models that can take the context into account are also being proposed. However, research has shown that even if all interactions are input, the model ultimately only focuses on the previous user's utterance. This shows that even if the model is capable of taking context into account, it cannot take context into account sufficiently. Furthermore, no model has been proposed that can take entrainment into account in a sentence generation method that inputs the entire dialogue.

例えば、ユーザとシステムとの間で以下のような対話があったとする。
ユーザ：こんにちは。
システム：こんにちは。
ユーザ：いい天気だね。
システム：そうですね，晴れてよかったですね。
ユーザ：晴れるとキャンプとかピクニックとかに行きたくなるね。
システム：キャンプが好きなんですか？
ユーザ：うん，好きだよ。あなたはキャンプ好きかな？
上記の対話に対するエントレインメントを考慮したシステムの応答文の例は、「うん，私もキャンプ好きだよ。」などである。この応答文であれば、ユーザの話し方と同調している。一方、エントレインメントを考慮できていない現状のシステムの応答文の例は、「キャンプは好きです。」などである。内容は同じであるが、対話としては不自然な単純な応答になっている。 For example, suppose there is the following interaction between the user and the system.
User: Hello.
System: Hello.
User: It's nice weather.
System: Well, I'm glad it's sunny.
User: When the weather is sunny, I feel like going camping or having a picnic.
System: Do you like camping?
User: Yeah, I like it. Do you like camping?
An example of a system response to the above dialogue that takes entrainment into consideration is, "Yeah, I like camping, too." This response sentence is in sync with the user's speaking style. On the other hand, an example of a response sentence in the current system that does not take entrainment into account is, "I like camping." The content is the same, but the responses are simple and unnatural as a dialogue.

本発明は、過去の複数の発話からなる対話履歴に対する応答文を、エントレインメント度合いに基づいて生成することを目的とする。 An object of the present invention is to generate a response sentence to a dialogue history consisting of a plurality of past utterances based on the degree of entrainment.

本発明の応答文生成装置は、記録部と応答生成部を備える。記録部は、対話履歴を入力として応答文を出力するための応答生成モデルを記録する。応答生成モデルは、対話履歴と応答文とのエントレインメント度合いに基づいた報酬期待値を用いて強化学習したモデルである。応答生成部は、対話履歴を入力とし、応答生成モデルを用いて、対話履歴に対する応答文を出力する。 The response sentence generation device of the present invention includes a recording section and a response generation section. The recording unit records a response generation model for outputting a response sentence using the dialogue history as input. The response generation model is a reinforcement learning model using an expected reward value based on the degree of entrainment between the dialogue history and response sentences. The response generation unit receives the dialogue history as input, uses the response generation model, and outputs a response sentence to the dialogue history.

本発明の強化学習装置は、入力された対話履歴に対する応答文を出力するための応答生成モデルを強化学習する。本発明の強化学習装置は、報酬計算モデル部とパラメータ更新部を備える。報酬計算モデル部は、少なくとも他者の対話履歴、当該対話履歴に対して生成された応答文、当該対話履歴に対するレファレンス応答を入力とし、対話履歴と応答文とのエントレインメント度合いに基づいた報酬期待値を計算し、当該報酬期待値を出力する。パラメータ更新部は、応答生成モデルと報酬期待値を入力とし、報酬期待値を用いて応答生成モデルのパラメータを更新し、更新後のパラメータを出力する。 The reinforcement learning device of the present invention performs reinforcement learning on a response generation model for outputting a response sentence to an input dialogue history. The reinforcement learning device of the present invention includes a reward calculation model section and a parameter update section. The reward calculation model section receives at least the other person's dialogue history, response sentences generated for the dialogue history, and reference responses for the dialogue history, and calculates reward expectations based on the degree of entrainment between the dialogue history and the response sentences. Calculate the value and output the expected reward value. The parameter updating unit receives the response generation model and the expected reward value as input, updates the parameters of the response generation model using the expected reward value, and outputs the updated parameters.

本発明の応答文生成装置によれば、過去の複数の発話からなる対話履歴に対する応答文を、エントレインメント度合いに基づいて生成することができる。また、本発明の強化学習装置によれば、本発明の応答文生成装置で用いる応答生成モデルを強化学習できる。 According to the response sentence generation device of the present invention, a response sentence to a dialogue history consisting of a plurality of past utterances can be generated based on the degree of entrainment. Further, according to the reinforcement learning device of the present invention, the response generation model used in the response sentence generation device of the present invention can be subjected to reinforcement learning.

応答文生成装置と強化学習装置の構成例を示す図。The figure which shows the example of a structure of a response sentence generation device and a reinforcement learning device. 応答文生成の処理フローを示す図。The figure which shows the processing flow of response sentence generation. 強化学習の処理フローを示す図。A diagram showing a processing flow of reinforcement learning. ConvAI2データセットにおける対話数／発話数を示す図。A diagram showing the number of dialogues/number of utterances in the ConvAI2 dataset. 各応答生成モデルを用いた場合の応答文生成結果を示す図。The figure which shows the response sentence generation result when each response generation model is used. コンピュータの機能構成例を示す図。FIG. 1 is a diagram showing an example of a functional configuration of a computer.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Embodiments of the present invention will be described in detail below. Note that components having the same functions are given the same numbers and redundant explanations will be omitted.

図１に応答文生成装置と強化学習装置の構成例を示す。図２に応答文生成の処理フローを、図３に強化学習の処理フローを示す。応答文生成装置１００は、記録部１２０と応答生成部１１０を備える。記録部１２０は、対話履歴を入力として応答文を出力するための応答生成モデルを記録する。応答生成モデルは、対話履歴と応答文とのエントレインメント度合いに基づいた報酬期待値を用いて強化学習したモデルである。応答生成部１１０は、対話履歴を入力とし（Ｓ１０１）、応答生成モデルを用いて対話履歴に対する応答文を生成し（Ｓ１１０）、出力する（Ｓ１０２）。なお、「対話履歴」は、複数の発話からなっており、エントレインメント度合いなどのその他の情報は含まれていない。 FIG. 1 shows an example of the configuration of a response sentence generation device and a reinforcement learning device. FIG. 2 shows the processing flow of response sentence generation, and FIG. 3 shows the processing flow of reinforcement learning. The response sentence generation device 100 includes a recording section 120 and a response generation section 110. The recording unit 120 records a response generation model for outputting a response sentence by inputting the dialogue history. The response generation model is a reinforcement learning model using an expected reward value based on the degree of entrainment between the dialogue history and response sentences. The response generation unit 110 receives the dialogue history as input (S101), generates a response sentence for the dialogue history using a response generation model (S110), and outputs it (S102). Note that the "dialogue history" consists of a plurality of utterances and does not include other information such as the degree of entrainment.

強化学習装置２００は、入力された対話履歴に対する応答文を出力するための応答生成モデルを強化学習する。強化学習装置２００は、報酬計算モデル部２１０とパラメータ更新部２２０を備える。強化学習の処理フローでは、対話履歴を応答文生成装置１００に入力し、当該対話履歴とレファレンス応答の組を訓練データとして強化学習装置２００に入力する（S２０１）。応答生成部１１０は、応答生成モデルを用いて対話履歴に対する応答文を生成する（Ｓ１１０）。報酬計算モデル部２１０は、少なくとも他者の対話履歴、当該対話履歴に対して生成された応答文、当該対話履歴に対するレファレンス応答を入力とし、対話履歴と応答文とのエントレインメント度合いに基づいた報酬期待値を計算し、当該報酬期待値を出力する（Ｓ２１０）。ここで、「他者」とは、発話しようとしている者（応答文を生成しようとしている者）ではない者を意味する。なお、「自身」とは、発話しようとしている者（応答文を生成しようとしている者）を意味する。例えば、ユーザＡとユーザＢの対話の場合、ユーザＡの発話（応答文）を生成しているときは、自身はユーザＡであり、他者はユーザＢである。また、ユーザＢの発話（応答文）を生成しているときは、自身はユーザＢであり、他者はユーザＡである。このように、「自身」と「他者」は、だれの応答文を生成しようとしているときか、に基づいて決まる。パラメータ更新部２２０は、応答生成モデルと報酬期待値を入力とし、報酬期待値を用いて応答生成モデルのパラメータを更新し、更新後のパラメータを出力する（Ｓ２２１）。これらの処理を繰り返す（Ｓ２０２）。以下では、各構成部の処理について詳細に説明する。 The reinforcement learning device 200 performs reinforcement learning on a response generation model for outputting a response sentence to the input dialogue history. The reinforcement learning device 200 includes a reward calculation model section 210 and a parameter update section 220. In the reinforcement learning processing flow, a dialogue history is input to the response sentence generation device 100, and a set of the dialogue history and a reference response is input to the reinforcement learning device 200 as training data (S201). The response generation unit 110 generates a response sentence to the dialogue history using the response generation model (S110). The reward calculation model unit 210 receives at least another person's dialogue history, a response sentence generated for the dialogue history, and a reference response for the dialogue history, and calculates a reward based on the degree of entrainment between the dialogue history and the response sentence. The expected value is calculated and the expected reward value is output (S210). Here, "other person" means a person other than the person who is trying to speak (the person who is trying to generate a response sentence). Note that "self" means a person who is trying to speak (a person who is trying to generate a response sentence). For example, in the case of a dialogue between user A and user B, when user A is generating an utterance (response sentence), user A is the user, and user B is the other person. Further, when the user B's utterance (response sentence) is being generated, the user is the user B and the other person is the user A. In this way, "self" and "other" are determined based on whose response sentence is being generated. The parameter updating unit 220 receives the response generation model and the expected reward value, updates the parameters of the response generation model using the expected reward value, and outputs the updated parameters (S221). These processes are repeated (S202). Below, the processing of each component will be explained in detail.

＜応答生成部１１０，応答生成モデル＞
応答生成部１１０では、対話履歴H = {H_i-1, H_i-2, …, H_i-N}が与えられたときに、応答文R_i = {w_i,1, w_i,2, …, w_i,t}を生成する。ここで、ｉは対話のターン、Ｎは履歴長、ｔは単語順であり、H_i-Nはｉ－Ｎ番目の発話、w_i,tはｉ番目の発話（処理対象の応答文）のｔ番目の単語である。応答生成部１１０では、入力された対話履歴を固定長の文脈表現に符号化する階層型Encoderと、階層型Encoderから受け取った文脈表現を用いて発話生成（応答文生成）を行うDecoderからなるモデルを用いればよい。この技術は、参考文献１（Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C., and Pineau, J.: Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models., in AAAI, pp. 3776-3784 (2016).）などに示されている。さらに、応答生成部１１０では、より対話履歴を考慮した応答文生成を実現するために、参考文献１に示されたモデルに対して対話履歴を考慮した注意機構を用いてもよい。つまり、応答生成部１１０は、注意機構付き階層型Encoder-Decoderを用いればよい。また、ＲＮＮ（Recurrent Neural Network）セルには、ＧＲＵ（Gated Recurrent Unit）を用いればよい。 <Response generation unit 110, response generation model>
The response generation unit 110 generates a response sentence R _i = {w _i,1 , w i,2 , ... when the dialogue history H = {H _i -1 , H _i-2 _, ..., H _iN } is given. , w _i,t }. Here, i is the turn of the dialogue, N is the history length, t is the word order, H _iN is the i-Nth utterance, and w _i,t is the tth of the i-th utterance (response sentence to be processed). is the word. The response generation unit 110 uses a model consisting of a hierarchical encoder that encodes the input dialogue history into a fixed-length context expression, and a decoder that generates utterances (response sentence generation) using the context expression received from the hierarchical encoder. You can use This technique is described in Reference 1 (Serban, IV, Sordoni, A., Bengio, Y., Courville, AC, and Pineau, J.: Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models., in AAAI , pp. 3776-3784 (2016).). Furthermore, the response generation unit 110 may use an attention mechanism that takes the dialogue history into consideration with respect to the model shown in Reference 1, in order to realize response sentence generation that takes the dialogue history into consideration. In other words, the response generation unit 110 may use a hierarchical encoder-decoder with an attention mechanism. Further, a GRU (Gated Recurrent Unit) may be used for the RNN (Recurrent Neural Network) cell.

階層型Encoderでは、発話Encoderを用いて対話履歴における各発話H_iを固定長の発話表現h_iに符号化する。ここで、h_iはH_iの各単語w_i,tについて、次式を再帰的に適用することで最終的に得られるベクトルu_i,tとすればよい。
u_i,t = GRU(u_i,t-1, Embedding(w_i,t))
ただし、Embeddingは単語w_i,tを固定長密ベクトルに写像する線形変換関数である。そして、文脈Encoderにおいて発話Encoderから得られた発話表現h_iに対して次式を再帰的に適用することで対話履歴の文脈表現c_iを得ればよい。
c_i = GRU(c_i-1, h_i) The hierarchical encoder encodes each utterance H _i in the dialogue history into a fixed-length utterance expression h _i using the utterance encoder. Here, h _i may be a vector u i, _{t finally obtained by recursively applying the following equation to each word w i,} _t of H _i .
u _i,t = GRU(u _i,t-1 , Embedding(w _i,t ))
However, Embedding is a linear transformation function that maps the word w _i,t to a fixed-length dense vector. Then, by recursively applying the following equation to the utterance expression h _i obtained from the utterance encoder in the context encoder, the context expression c _i of the dialogue history can be obtained.
c _i = GRU(c _i-1 , h _i )

Decoderにおいては、階層型Encoderから得られた対話履歴Hの文脈表現c_i-1を初期状態h₀’として用い、Decoderの中間状態h_t’と単語の生成確率p_tを次式のように求めればよい。
h_t’= GRU(h_t-1’, Embedding(w_i,t-1))
p_t = softmax(Linear(h_t’))
ただし、Linearは、h_t’を語彙サイズ次元の密ベクトルに写像する線形変換関数である。また、w_i,tはp_tからサンプルされ、次のステップの入力として使用される。 In the Decoder, the context representation c _i-1 of the dialogue history H obtained from the hierarchical encoder is used as the initial state h ₀ ', and the intermediate state h _t ' of the Decoder and the word generation probability p _t are calculated as follows: All you have to do is ask.
h _t '= GRU(h _t-1 ', Embedding(w _i,t-1 ))
p _t = softmax(Linear(h _t '))
However, Linear is a linear transformation function that maps h _t ' to a dense vector of vocabulary size dimension. Also, w _i,t is sampled from p _t and used as input for the next step.

応答生成部１１０は、エントレインメントという会話現象を取り扱う性質上、より対話履歴を考慮した応答生成モデルを構築することが求められる。したがって、上述のDecoderに対して対話履歴における各発話の情報をより効率的に扱うための注意機構を導入すればよい。具体的には、c_i-1-N:i-1を文脈Encoderによって得られた文脈ベクトルの系列、h_t’をtステップにおけるDecoderの中間状態としたとき、次のように各中間状態に対してアラインメントの重みを計算し、文脈ベクトルh^－を求めればよい。なお、記載上の制限から“h^－”と表現しているが、この表現では、“^－”は“h”の上に位置することを意味している。同様に、“h_t^”の場合は、“^”は“h”の上に位置することを意味している。

応答生成部１１０は、さらに、文脈ベクトルh^-を用いて、ステップｔにおける出力単語の予測を次のように行えばよい。
h_t^= tanh(Linear([h^-,h_t’]))
p_t = softmax(Linear(h_t^)) Since the response generation unit 110 handles a conversation phenomenon called entrainment, it is required to construct a response generation model that takes the conversation history into consideration. Therefore, it is sufficient to introduce an attention mechanism to the above-mentioned Decoder in order to more efficiently handle the information of each utterance in the dialogue history. Specifically, when c _{i-1-N:i-1 is} the sequence of context vectors obtained by the context encoder and h _t ' is the intermediate state of the decoder at step t, each intermediate state is It is sufficient to calculate the alignment weight for that and obtain the context vector h ^- . Note that due to descriptive limitations, it is expressed as "h ^- ", but in this expression, " ^- " means that it is located above "h". Similarly, in the case of "h _t ^", "^" means that it is located above "h".

The response generation unit 110 may further predict the output word in step t using the context vector h ^- as follows.
h _t ^= tanh(Linear([h ^- ,h _t ']))
p _t = softmax(Linear(h _t ^))

＜報酬計算モデル部２１０＞
報酬計算モデル部２１０は、パラメータ更新部２２０で用いる報酬期待値を計算する（S２１０）。そこで、生成する応答文のエントレイメント度合いを評価するための報酬計算モデルを定義する。例えば、単純な例としては、ＷＤＭ（Word Mover’s Distance）を用いて他者による直前の発話H_i-1と生成した応答文R_iの類似度を報酬r_previousとして定義すればよい。ＷＤＭの具体的な内容は、参考文献２（Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K.: From word embeddings to document distances, in International conference on machine learning, pp. 957-966 (2015).）などに示されている。ただし、参考文献２などに示されているＷＤＭは正規化されていない類似度指数であるため、そのまま用いることは強化学習における報酬として利用することは望ましくない。したがって、以下のようにＷＤＭを０から１に正規化したWDM_normを定義し、報酬r_previousを求めればよい。なお、“e^(-WDM(H_i-1, R_i)²)”は、“e”を“(-WDM(H_i-1, R_i)²)”乗することを意味している。
WDM_norm(H_i-1, R_i) = e^(-WDM(H_i-1, R_i)²)
r_previous(H_i-1, R_i) = WDM_norm(H_i-1, R_i) <Remuneration calculation model section 210>
The reward calculation model unit 210 calculates the expected reward value used by the parameter update unit 220 (S210). Therefore, we define a reward calculation model for evaluating the degree of entrainment of generated response sentences. For example, as a simple example, the degree of similarity between the previous utterance H _i-1 by another person and the generated response sentence R _i may be defined as the reward r _previous using WDM (Word Mover's Distance). The specific content of WDM can be found in Reference 2 (Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K.: From word embeddings to document distances, in International conference on machine learning, pp. 957) -966 (2015).) etc. However, since the WDM shown in Reference 2 and the like is a similarity index that is not normalized, it is not desirable to use it as is as a reward in reinforcement learning. Therefore, it is sufficient to define a WDM _norm in which WDM is normalized from 0 to 1 as shown below, and calculate the reward r _previous . Note that “e^(-WDM(H _i-1 , R _i ) ² )” means “e” raised to the power “(-WDM(H _i-1 , R _i ) ² )”. .
WDM _norm (H _i-1 , R _i ) = e^(-WDM(H _i-1 , R _i ) ² )
r _previous (H _i-1 , R _i ) = WDM _norm (H _i-1 , R _i )

さらに、エントレインメントは必ずしも他者の直前の発話に対してのみ行われるものではなく、他者による発話履歴から文脈に応じて適切にエントレインメントの対象となる発話を決定する方が望ましい。そこで、報酬計算モデル部２１０は、理想的なエントレインメント度合いに関する情報も入力とし、当該理想的なエントレインメント度合いとの相対的な値に基づいて報酬期待値を計算してもよい。そのために、他者の発話に対するエントレインメント度合いを示すr_LIDを次のように定義する。

ただし、R_i ^refは対話履歴Hに対応するレファレンスの応答文（訓練データに含まれている正解の応答文）、H^otherは対話履歴から自身の発話をすべて除外したもの（言い換えると、他者の対話履歴）である。また、U_entrainedは与えた他者の対話履歴H^otherの中でレファレンスの応答文と最も類似している発話を示す。なお、上述したとおり、「自身」と「他者」は、だれの応答文を生成しようとしているときかに基づいて決まる。 Furthermore, entrainment is not necessarily performed only on the other person's previous utterance, but it is preferable to appropriately determine the utterance to be entrained according to the context from the utterance history of the other person. Therefore, the remuneration calculation model unit 210 may also receive information regarding the ideal degree of entrainment and calculate the expected remuneration value based on the value relative to the ideal degree of entrainment. To this end, we define _rLID, which indicates the degree of entrainment for another person's utterances, as follows.

However, R _i ^ref is the response sentence of the reference corresponding to the dialogue history H (the correct response sentence included in the training data), and H other is the response sentence of the reference corresponding to the dialogue history H (correct response sentence included in the training data), and H ^other is the response sentence from which all own utterances have been excluded from the dialogue history (in other words, the response sentence of the reference that corresponds to the dialogue history H) (dialogue history). In addition, U _entrained indicates the utterance most similar to the reference response sentence among the given other's dialogue history H ^other . Note that, as described above, "self" and "other" are determined based on whose response sentence is to be generated.

エントレインメント度合いr_LIDでは、生成した応答文がレファレンスの応答文から計算されたエントレインメントの対象となる発話と類似するように、理想的なエントレインメント値r_idealとの相対値を考慮して報酬を与える。これにより、ＬＩＤ（Local Interpersonal Distance）を間接的に最大化するような応答生成モデルの学習を行うことが可能になる。なお、ＬＩＤは、エントレインメント評価指標の１つであり、参考文献３（Nasir, M., Chakravarthula, S. N., Baucom, B. R., Atkins, D. C., Georgiou, P., and Narayanan, S.: Modeling Interpersonal Linguistic Coordination in Conversations Using Word Mover ’s Distance, in Proc. Interspeech 2019, pp. 1423-1427 (2019).）などに示されている。報酬計算モデル部２１０では、理想的なエントレインメント値r_idealとして、訓練データにおけるすべての応答事例における実際のエントレイメント度合いを次式を用いて計算し、上位Ａ％の値を理想的なエントレインメント度合いr_idealとしてモデルの学習を行えばよい。

Ａ％は、例えば、９０％、７０％、５０％などであり、応答文でどの程度のエントレインメントを起こしたいかに応じてエントレインメント度合いが大きいほど大きなパーセンテージとなるように適宜定めればよい。そして、各ステップtからのロールアウトを用いたMonte Carlo Tree Searchによって報酬期待値を求めればよい。なお、Monte Carlo Tree Searchについては、参考文献４（Yu, L., Zhang, W., Wang, J., and Yu, Y.: SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient., in AAAI, pp. 2852-2858 (2017).）などに具体的に示されている。 Entrainment degree r In _LID , the reward is calculated by considering the relative value of _{the ideal} entrainment value r so that the generated response sentence is similar to the entrainment target utterance calculated from the reference response sentence. give. This makes it possible to learn a response generation model that indirectly maximizes LID (Local Interpersonal Distance). Note that LID is one of the entrainment evaluation indicators, and is described in Reference 3 (Nasir, M., Chakravarthula, SN, Baucom, BR, Atkins, DC, Georgiou, P., and Narayanan, S.: Modeling Interpersonal Linguistic Coordination in Conversations Using Word Mover's Distance, in Proc. Interspeech 2019, pp. 1423-1427 (2019). The reward calculation model unit 210 calculates the actual degree of entrainment in all response cases in the training data using the following formula as the ideal entrainment value r _ideal , and sets the top A% value as the ideal entrainment value. The model should be trained with the degree of imment r _ideal .

A% is, for example, 90%, 70%, 50%, etc., and may be set as appropriate so that the larger the degree of entrainment, the larger the percentage, depending on the degree of entrainment desired in the response sentence. Then, the expected reward value may be obtained by Monte Carlo Tree Search using the rollout from each step t. Regarding Monte Carlo Tree Search, see Reference 4 (Yu, L., Zhang, W., Wang, J., and Yu, Y.: SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient., in AAAI, pp. 2852-2858 (2017).).

＜パラメータ更新部２２０＞
パラメータ更新部２２０は、例えば、方策勾配型の強化学習の一種であるREINFORCEアルゴリズムを用いればよい。REINFORCEアルゴリズムについては、参考文献５（Williams, R. J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, Vol. 8, No. 3-4, pp. 229-256 (1992).）などに示されている。 <Parameter update unit 220>
The parameter updating unit 220 may use, for example, the REINFORCE algorithm, which is a type of policy gradient reinforcement learning. The REINFORCE algorithm is described in Reference 5 (Williams, RJ: Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, Vol. 8, No. 3-4, pp. 229-256 (1992)). has been done.

応答生成部１１０における応答文生成では、対話履歴H = {H_i-1, H_i-2, …, H_i-N}の文脈表現が入力されると、単語列である応答文R_i = {w_i,1, w_i,2, …, w_i,t}を生成する。このような応答文生成のプロセスは、マルコフ決定過程において、ある政策にしたがって実行される一連の行動系列とみなすことができる。パラメータ更新部２２０では、報酬計算モデル部２１０から出力される報酬期待値を利用し、報酬期待値が大きくなるように応答生成モデルのパラメータをREINFORCEアルゴリズムにより更新すればよい。 When the response generation unit 110 generates a response sentence, when the context expression of the dialogue history H = {H _i-1 , H _i-2 , ..., H _iN } is input, a response sentence R _i = {w _i,1 , w _i,2 , …, w _i,t } is generated. The process of generating such a response sentence can be regarded as a series of actions executed according to a certain policy in the Markov decision process. The parameter updating unit 220 may use the expected reward value output from the reward calculation model unit 210 to update the parameters of the response generation model using the REINFORCE algorithm so that the expected reward value becomes larger.

G_θをパラメータθを持つ応答生成モデル、pを単語w_tを生成する確率とする。目的関数J_REINFORCEと勾配は、次式のように定義すればよい。

ステップＳ２２１では、パラメータ更新部２２０は、上記の目的関数J_REINFORCEとその勾配に基づいて応答生成モデルG_θのパラメータθを更新すればよい。 Let G _θ be a response generation model with parameter θ, and p be the probability of generating word w _t . The objective function J _REINFORCE and the gradient can be defined as in the following equation.

In step S221, the parameter updating unit 220 may update the parameter θ of the response generation model G _θ based on the objective function J _REINFORCE and its gradient.

なお、目的関数J_REINFORCEとその勾配のみに基づいてパラメータθを更新すると、エントレインメント度合いを重要視しすぎるために応答生成モデルが崩壊してしまうリスクがある。そのようなリスクがある場合は、パラメータ更新部２２０は、損失関数も用いてパラメータθを更新すればよい。損失関数を用いた更新とは、従来から存在するものであり、Decoderの各ステップにおける単語予測結果と正解単語の負の対数尤度J_MLE(θ)が小さくなるように、パラメータθを更新すればよい（Ｓ２２２）。ただし、対数尤度J_MLE(θ)の影響が支配的にならないように、係数λを乗じて更新すればよい。係数λは例えば、０．１などにすればよい。このように、エントレインメントに基づく更新と、損失関数に基づく更新に適宜重み付けを行って更新すればよい。 Note that if the parameter θ is updated only based on the objective function J _REINFORCE and its gradient, there is a risk that the response generation model will collapse due to placing too much emphasis on the degree of entrainment. If there is such a risk, the parameter updating unit 220 may update the parameter θ using a loss function as well. Updating using a loss function has existed for a long time, and it updates the parameter θ so that the word prediction result at each step of the decoder and the negative log likelihood J _MLE (θ) of the correct word are small. Good (S222). However, in order to prevent the influence of the log likelihood J _MLE (θ) from becoming dominant, the update may be performed by multiplying by a coefficient λ. The coefficient λ may be set to, for example, 0.1. In this way, the update based on entrainment and the update based on the loss function may be weighted appropriately.

応答文生成装置１００によれば、応答文の品質を損なうことなく、過去の複数の発話からなる対話履歴に対する応答文を、エントレインメント度合いに基づいて生成することができる。また、強化学習装置２００によれば、応答文生成装置１００で用いる応答生成モデルを、応答文の品質を損なわないようにしながら、エントレインメント度合いに基づくように強化学習できる。 According to the response sentence generation device 100, a response sentence for a dialogue history consisting of a plurality of past utterances can be generated based on the degree of entrainment, without degrading the quality of the response sentence. Further, according to the reinforcement learning device 200, the response generation model used in the response sentence generation device 100 can be reinforced based on the degree of entrainment while not impairing the quality of the response sentences.

＜実験＞
応答生成モデルの訓練と評価には、参考文献６（ConvAI2,［令和２年４月２７日検索］、インターネット<http://convai.io/>．）で提供されたPersonaChatデータセットを使用した。図４にConvAI2データセットにおける対話数／発話数を示す。評価用データセットは非公開であったため、開発用データをに分割し、開発用データセットと評価用データセットとして用いた。図５に各応答生成モデルを用いた場合の応答文生成結果を示す。 <Experiment>
For training and evaluation of the response generation model, we used the PersonaChat dataset provided in Reference 6 (ConvAI2, [retrieved April 27, 2020], Internet <http://convai.io/>.) did. Figure 4 shows the number of interactions/utterances in the ConvAI2 dataset. Since the evaluation data set was not made public, the development data was divided into two parts and used as a development data set and an evaluation data set. FIG. 5 shows the response sentence generation results when using each response generation model.

ＭＬＥは、負の対数尤度の最小化による訓練を行った応答生成モデルを用いた場合を示している。訓練方法の列にＲＥＩＮＦＯＲＣＥと示された応答生成モデルは、エントレイメントに基づいた強化学習を行った応答生成モデルである。さらに、評価実験に用いる応答生成モデルとしては，Sequence-to-Sequenceモデル（SEQ2SEQ）と、SEQ2SEQに内積によるスコア計算を用いたアテンション機構を導入したモデル（Attention-SEQ2SEQ）と、上述した階層型Encoder-Decoderモデル（HED）と、注意機構付き階層型Encoder-Decoderモデル（Attention-HED）の４つの応答生成モデルを使用した。なお、Attention-SEQ2SEQに導入したアッテンション機構は、参考文献７（Luong, M.-T., Pham, H., and Manning, C. D.: Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).）に示されている。なお、モデルがHumanの行は、人が応答をした場合を想定した評価を示している。 MLE indicates the case where a response generation model trained by minimizing the negative log likelihood is used. The response generation model indicated as REINFORCE in the training method column is a response generation model that has undergone reinforcement learning based on entrainment. Furthermore, the response generation models used in the evaluation experiment are a sequence-to-sequence model (SEQ2SEQ), a model that introduces an attention mechanism using inner product score calculation to SEQ2SEQ (Attention-SEQ2SEQ), and the above-mentioned hierarchical encoder model. We used four response generation models: -Decoder model (HED) and hierarchical encoder-decoder model with attention mechanism (Attention-HED). The attention mechanism introduced in Attention-SEQ2SEQ can be found in Reference 7 (Luong, M.-T., Pham, H., and Manning, C. D.: Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).). Note that the row where the model is Human indicates evaluation assuming a human response.

いずれの応答生成モデルもＲＮＮにはＧＲＵを使用し、単語埋め込み層の次元を３００、中間層の次元を３００、層数を１に設定した。また、使用する語彙サイズは１５，０００とし、未知語は特殊記号“UNK”に置き換えた。応答生成モデルの事前訓練は交差エントロピー誤差を用いて行った。その後、報酬計算モデル部２１０に、r_previousを用いて強化学習した応答生成モデル（以下、“r_previous”と示す）、上位９０％の値を理想的なエントレインメント度合いr_idealとして求めたr_LIDを用いて強化学習した応答生成モデル（以下、“r_LID ^90％”と示す）、上位７０％の値を理想的なエントレインメント度合いr_idealとして求めたr_LIDを用いて強化学習した応答生成モデル（以下、“r_LID ^70％”と示す）、上位５０％の値を理想的なエントレインメント度合いr_idealとして求めたr_LIDを用いて強化学習した応答生成モデル（以下、“r_LID ^50％”と示す）を生成した。応答生成モデルの訓練においては、パッチサイズを64、学習率を1×10^-4とし、OptimizerにはＳＧＤ（Stochastic Gradient Descent：確率的勾配降下法）を使用した。 In both response generation models, GRU was used for the RNN, and the dimension of the word embedding layer was set to 300, the dimension of the middle layer was set to 300, and the number of layers was set to 1. In addition, the vocabulary size used was set to 15,000, and unknown words were replaced with the special symbol "UNK". Pre-training of the response generation model was performed using cross-entropy error. Thereafter, the reward calculation model unit 210 is provided with a response generation model (hereinafter referred to as "r _previous ") that has been subjected to reinforcement learning using r _previous , and an r _LID whose top 90% values are determined as the ideal entrainment degree r _ideal . (hereinafter referred to as "r _LID ^90% "), a response generation model that was reinforcement learned using r _LID , where the top 70% value was determined as the ideal entrainment degree r _ideal (hereinafter referred to as “r _LID ^70% ”), and a response generation model that is reinforcement learned using r _LID , which has the top 50% value as the ideal entrainment degree (r _ideal ) (hereinafter referred to as “r _LID ^50% ”) ) was generated. In training the response generation model, the patch size was 64, the learning rate was 1×10 ⁻⁴ , and SGD (Stochastic Gradient Descent) was used as the optimizer.

応答文生成の評価においては、従来と同様に、言語モデルの性能を測るための指標であるPerplexity（ＰＰＬ）による評価と、生成応答とレファレンス発話の類似度（関連性）による評価を行った。関連性についての評価指標としては、ＢＬＥＵによる評価と、正規化したWDM_normの平均（×１００）による評価を用いた。図５の“ＷＤＭ^－”が付された列は、正規化したWDM_normを用いた評価結果を示している。ＰＰＬは値が小さいほど高い評価であり、ＢＬＥＵとＷＤＭ^－は値が大きいほど高い評価である。また，エントレインメントに着目した応答文生成の評価を行うために、各報酬計算モデルが応答発話に対して与える報酬値の平均（×１００）を用いた。図５の“r^－ _previous”，“r^－ _LID ^90％”，“r^－ _LID ^70％”，“r^－ _LID ^50％”が付された列は、それぞれの報酬計算モデルを用いたときの評価結果を示している。これらの評価結果は１００を最大値とし、値が大きい方が高い評価である。なお、“ＷＤＭ^－”という表記は、“ＷＤＭ”の上部に“－”が付された記載を意味している。“r^－”という表記は、“r”の上部に“－”が付された記載を意味している。なお、WDMの計算においては、TWITTER（登録商標）データで事前学習された２００次元の単語分散表現ベクトルをノルムが１になるように正規化した。この単語分散表現ベクトルについては、参考文献８（GloVe: Global Vectors for Word Representation,［令和２年４月２７日検索］、インターネット<https://nlp.stanford.edu/projects/glove/>．）に示されている。 In the evaluation of response sentence generation, as in the past, evaluation was performed using perplexity (PPL), which is an index for measuring the performance of a language model, and evaluation based on the degree of similarity (relevance) between the generated response and the reference utterance. As evaluation indicators for relevance, evaluation using BLEU and evaluation using the normalized WDM _norm average (×100) were used. The column labeled “WDM ⁻ ” in FIG. 5 shows the evaluation results using the normalized WDM _norm . For PPL, the smaller the value, the higher the evaluation, and for BLEU and WDM ^- , the larger the value, the higher the evaluation. Furthermore, in order to evaluate response sentence generation focusing on entrainment, we used the average (x100) of reward values given to response utterances by each reward calculation model. The columns labeled “r ^- _previous ”, “r ^- _LID ^90% ”, “r ^- _LID ^70% ”, and “r ^- _LID ^50% ” in Figure 5 indicate the evaluation when using each reward calculation model. Showing results. These evaluation results have a maximum value of 100, and the larger the value, the higher the evaluation. Note that the notation "WDM ^- " means a description with "-" added above "WDM". The notation "r ^- " means a description with "-" added above "r". In addition, in the calculation of WDM, the 200-dimensional word distributed expression vector pre-trained using TWITTER (registered trademark) data was normalized so that the norm was 1. Regarding this word distributed representation vector, see Reference 8 (GloVe: Global Vectors for Word Representation, [searched on April 27, 2020], Internet <https://nlp.stanford.edu/projects/glove/>. ) is shown.

図５に示された結果より、従来の損失関数を用いた応答生成モデルに相当するＭＬＥは、人による応答と比較してエントレインメントに基づいた評価結果（“r^－ _previous”，“r^－ _LID ^90％”，“r^－ _LID ^70％”，“r^－ _LID ^50％”）が大きく劣ることが分かる。これは，既存の応答生成モデルがエントレインメントを行う能力に乏しいのに加え、対話履歴の情報を有効に活用できていないことを示唆している。一方で、エントレインメント度合いに基づいた報酬期待値を用いて強化学習を適用した応答生成モデルは、ＭＬＥと比較して、ＰＰＬがほぼ同等であり、かつ、エントレインメントに基づいた評価結果（“r^－ _previous”，“r^－ _LID ^90％”，“r^－ _LID ^70％”，“r^－ _LID ^50％”）が大きく向上していることが分かる。特に、図５の太い線で囲んだ部分の評価から、指定した理想的なエントレインメント値r_idealとなるように強化学習できていることが分かる。なお、生成した応答文とレファレンスの応答文についての関連性評価については、強化学習を適用する場合では、ＢＬＥＵについては低下傾向にあるものの、ＷＭＤ^－が大きく向上していることが分かる。したがって、ＰＰＬ，ＢＬＥＵ，ＷＤＭ^－の評価において、応答文の品質を損なわないようにしながら、エントレインメント度合いに基づくように、応答生成モデルを強化学習できていることが分かる。 The results shown in Figure 5 show that the MLE, which corresponds to the response generation model using the conventional loss function, improves the entrainment-based evaluation results (“r ⁻ _previous ”, “r ⁻ _LID ^90% '', ``r ^- _LID ^70% '', ``r ^- _LID ^50% '') are significantly inferior. This suggests that existing response generation models lack the ability to perform entrainment, and are also unable to effectively utilize information from the interaction history. On the other hand, the response generation model that applies reinforcement learning using reward expectations based on the degree of entrainment has almost the same PPL as MLE, and the evaluation result based on entrainment (“r ^－previous '', ``r ^－ _LID ^90% '', ``r ^－ _LID ^70% '', _` `r ^－ _LID ^50% '') are significantly improved. In particular, from the evaluation of the part surrounded by the thick line in FIG. 5, it can be seen that reinforcement learning is performed to achieve the specified ideal entrainment value r _ideal . Regarding the relevance evaluation between the generated response sentence and the reference response sentence, it can be seen that when reinforcement learning is applied, although BLEU tends to decrease, WMD ^- significantly improves. Therefore, in the evaluation of PPL, BLEU, and ^WDM- , it can be seen that reinforcement learning of the response generation model can be performed based on the degree of entrainment while not impairing the quality of response sentences.

［プログラム、記録媒体］
上述の各種の処理は、図６に示すコンピュータ２０００の記録部２０２０に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部２０１０、入力部２０３０、出力部２０４０、表示部２０５０などに動作させることで実施できる。 [Program, recording medium]
The various processes described above are performed by causing the recording unit 2020 of the computer 2000 shown in FIG. This can be done by letting

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program describing the contents of this process can be recorded on a computer-readable recording medium. The computer-readable recording medium may be of any type, such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, this program is distributed by, for example, selling, transferring, lending, etc. a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing a process, this computer reads a program stored in its own recording medium and executes a process according to the read program. In addition, as another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and furthermore, the program may be transferred to this computer from the server computer. The process may be executed in accordance with the received program each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results. You can also use it as Note that the program in this embodiment includes information that is used for processing by an electronic computer and that is similar to a program (data that is not a direct command to the computer but has a property that defines the processing of the computer, etc.).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.

１００応答文生成装置
１１０応答生成部
１２０記録部
２００強化学習装置
２１０報酬計算モデル部
２２０パラメータ更新部 100 Response sentence generation device 110 Response generation section 120 Recording section 200 Reinforcement learning device 210 Reward calculation model section 220 Parameter update section

Claims

a recording unit that records a response generation model for outputting a response sentence using the dialogue history as input;
a response generation unit that receives a dialogue history as input and outputs a response sentence to the dialogue history using the response generation model;
Equipped with
The response generation device is characterized in that the response generation model is a model subjected to reinforcement learning using an expected reward value based on a degree of entrainment between a dialogue history and a response sentence.

The response sentence generation device according to claim 1,
A response sentence generation device, wherein the response generation unit uses a hierarchical encoder-decoder with an attention mechanism.

A reinforcement learning device for reinforcement learning of a response generation model for outputting a response sentence to an input dialogue history,
At least the other person's dialogue history, a response sentence generated for the dialogue history, and a reference response for the dialogue history are input, and an expected reward value is calculated based on the degree of entrainment between the dialogue history and the response sentence. a remuneration calculation model section that outputs an expected remuneration value;
a parameter updating unit that receives the response generation model and the expected reward value as input, updates parameters of the response generation model using the expected reward value, and outputs the updated parameters;
A reinforcement learning device equipped with

The reinforcement learning device according to claim 3,
The reinforcement learning device, wherein the parameter updating unit updates the parameters also using a loss function.

The reinforcement learning device according to claim 3 or 4,
The reinforcement learning device is characterized in that the reward calculation model section also receives information regarding an ideal degree of entrainment and calculates the expected reward value based on a value relative to the ideal degree of entrainment. .

A response sentence generation method using a response sentence generation device that records a response generation model for outputting a response sentence using dialogue history as input, the method comprising:
a step of inputting interaction history;
a response generation step of generating a response sentence corresponding to the interaction history using the response generation model;
a response sentence output step of outputting the response sentence;
has
The response generation method is characterized in that the response generation model is a model subjected to reinforcement learning using an expected reward value based on the degree of entrainment between the dialogue history and the response sentence.

A model generation method that performs reinforcement learning on a response generation model for outputting a response sentence to an input dialogue history, and generates a reinforcement learned response generation model, the method comprising:
At least the other person's dialogue history, a response sentence generated for the dialogue history, and a reference response for the dialogue history are input, and an expected reward value is calculated based on the degree of entrainment between the dialogue history and the response sentence. a remuneration calculation step for outputting an expected remuneration value;
a parameter updating step of inputting the response generation model and the expected reward value, updating parameters of the response generation model using the expected reward value, and outputting the updated parameters;
A model generation method that performs.

A program for causing a computer to function as the response sentence generation device according to claim 1 or 2, or the reinforcement learning device according to any one of claims 3 to 5.