JP7377900B2

JP7377900B2 - Dialogue text generation device, dialogue text generation method, and program

Info

Publication number: JP7377900B2
Application number: JP2022040286A
Authority: JP
Inventors: 徳章川前
Original assignee: NTT Comware Corp
Current assignee: NTT Comware Corp
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2023-11-10
Anticipated expiration: 2042-03-15
Also published as: JP2023135199A

Description

本発明は、対話テキスト生成装置、対話テキスト生成方法、およびプログラムに関する。 The present invention relates to a dialogue text generation device, a dialogue text generation method, and a program.

ＡＩの応用分野の一つである自然言語処理の世界ではテキストの自動生成が登場している。近年のＡＩで自動生成されたテキストは人間が書いたテキストと見分けるのが難しいぐらいの品質である。この応用として、人間と対話が可能な発話生成への適用が進んでいる。従来のモデルは一般的な会話が可能であるが、ビジネスやサービスへの適用には個々の領域にあわせてモデルをカスタマイズ（再学習）する必要がある。 In the world of natural language processing, which is one of the application fields of AI, automatic text generation is emerging. The quality of text automatically generated by AI in recent years is such that it is difficult to distinguish it from text written by humans. As an application of this technology, progress is being made in the generation of speech that allows dialogue with humans. Conventional models are capable of general conversation, but in order to apply them to business or services, it is necessary to customize (re-learn) the model to suit each area.

Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko, Kyryl Truskovskyi, Alexander Tselousov, and Thomas Wolf, "Large-Scale Transfer Learning for Natural Language Generation", In ACL, 2019, 6053-6058Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko, Kyryl Truskovskyi, Alexander Tselousov, and Thomas Wolf, "Large-Scale Transfer Learning for Natural Language Generation", In ACL, 2019, 6053-6058 Yu Cao, Wei Bi, Meng Fang, and Dacheng Tao, D. "Pretrained Language Models for Dialogue Generation with Multiple Input Sources", In EMNLP, 2020, 909-917Yu Cao, Wei Bi, Meng Fang, and Dacheng Tao, D. "Pretrained Language Models for Dialogue Generation with Multiple Input Sources", In EMNLP, 2020, 909-917

しかしながら、再学習により一般的な発話への応答精度が低くなったり、生成する発話が言語として崩れてくることがあったり、再学習前のモデルとのバランスをとるのが難しいという課題がある。加えて発話のコンテキスト（発話者の属性やこれまでの発話内容）を再学習で反映するのが難しいという課題がある。 However, there are problems with relearning, such as lowering the accuracy of responses to general utterances, the generated utterances sometimes breaking down as a language, and the difficulty of achieving a balance with the model before relearning. In addition, there is a problem in that it is difficult to reflect the context of the utterance (the attributes of the speaker and the content of the utterances so far) in relearning.

本発明は、上記に鑑みてなされたものであり、再学習前のモデルと再学習後のモデルをバランスよく結合し、発話のコンテキストを反映した発話を生成する対話テキスト生成技術を提供することを目的とする。 The present invention has been made in view of the above, and aims to provide a dialogue text generation technology that combines a model before relearning and a model after relearning in a well-balanced manner to generate utterances that reflect the context of the utterance. purpose.

本発明の一態様の対話テキスト生成装置は、発話に対する応答テキストを生成する対話テキスト生成装置であって、発話者の属性が付与された発話テキストを入力し、前の発話テキストと次の発話テキストから単語を抽出し、トークンの種別ごとに他のトークンへのアクセスを制御するマルチビューアテンションメカニズムを導入したＴｒａｎｓｆｏｒｍｅｒに前記前の発話テキストの単語と前記属性を条件、前記次の発話テキストの単語を出力結果として入力し、前記属性の一部を除いたときの前記属性の予測精度を表す目的関数と前記前の発話テキストと前記次の発話テキストとの噛合い度合いを表す目的関数を最小化するようにＴｒａｎｓｆｏｒｍｅｒを学習する学習部と、直前の発話テキストと発話者の属性を学習済みのＴｒａｎｓｆｏｒｍｅｒに入力し、Ｔｒａｎｓｆｏｒｍｅｒから再帰的に出力される単語をつなげて発話者の次の発話となる応答テキストを生成する生成部を備え、前記マルチビューアテンションメカニズムは、前記属性のトークンについては全てのトークンへのアクセスを可能とし、前記次の発話テキストの単語のトークンについては前記属性のトークンの全てと当該単語よりも前に出現した単語のトークンへのアクセスを可能とする自己アテンションマスクを備えて、Ｔｒａｎｓｆｏｒｍｅｒがアテンションを求める際に、次の発話テキストの単語のトークンについては後続の単語のトークンを参照しないようにアテンションを無限に小さい値とする。 A dialogue text generation device according to one aspect of the present invention is a dialogue text generation device that generates a response text to an utterance, and inputs a utterance text to which an attribute of a speaker is attached, and generates a previous utterance text and a next utterance text. The word of the next uttered text is extracted from the Transformer, which introduces a multi-view attention mechanism that controls access to other tokens for each type of token, using the word of the previous uttered text and the attribute. Input as an output result, and minimize an objective function representing the prediction accuracy of the attribute when some of the attributes are excluded, and an objective function representing the degree of congruence between the previous utterance text and the next utterance text. A learning section that learns a Transformer, inputs the previous utterance text and the speaker's attributes into the learned Transformer, connects the words recursively output from the Transformer, and generates a response text that becomes the speaker's next utterance. The multi-view attention mechanism includes a generation unit that generates a generator that allows access to all tokens of the attribute, and accesses all tokens of the attribute for the tokens of words of the next utterance text. With a self-attention mask that allows access to tokens of words that occur before this word, when the Transformer seeks attention, it refers to tokens of subsequent words for tokens of words in the next spoken text. Set attention to an infinitely small value to avoid this .

本発明によれば、再学習前のモデルと再学習後のモデルをバランスよく結合し、発話のコンテキストを反映した発話を生成する対話テキスト生成技術を提供できる。 According to the present invention, it is possible to provide a dialogue text generation technique that combines a model before relearning and a model after relearning in a well-balanced manner and generates an utterance that reflects the context of the utterance.

図１は、本実施形態の対話システムの構成の一例を示す図である。FIG. 1 is a diagram showing an example of the configuration of a dialogue system according to this embodiment. 図２は、本実施形態で提案する深層学習モデルの一例を示す図である。FIG. 2 is a diagram showing an example of a deep learning model proposed in this embodiment. 図３は、自己アテンションマスクの一例を示す図である。FIG. 3 is a diagram showing an example of a self-attention mask. 図４は、学習処理の流れの一例を示すフローチャートである。FIG. 4 is a flowchart showing an example of the flow of learning processing. 図５は、学習データの一例を示す図である。FIG. 5 is a diagram showing an example of learning data. 図６は、対話生成処理の流れの一例を示すフローチャートである。FIG. 6 is a flowchart showing an example of the flow of dialogue generation processing.

［システム構成］
以下、本発明の実施の形態について図面を用いて説明する。 [System configuration]
Embodiments of the present invention will be described below with reference to the drawings.

本実施形態の対話システムは、発話の生成を発話のコンテキスト（発話履歴や発話者の属性）を条件とする条件付きテキスト生成と解釈し、直前までの発話を含む発話テキストと発話者の属性（生成する発話テキストの発話者の属性）をコンテキストとして与えると、条件に応じた次の発話テキストを生成するシステムである。例えば、直前の発話テキストとして“ＭａｙＩｈｅｌｐｙｏｕ？”を入力し、発話者の属性として「発話者Ａ」を入力すると、“Ｙｅｓ，ｐｌｅａｓｅ．”のように、発話者Ａによる次の発話テキストを生成する。同じ発話テキストを与えても、発話者の属性を「発話者Ｂ」に変えて入力すると、“Ｎｏ，ｔｈａｎｋｙｏｕ．”のように、発話者の属性により異なる発話テキストを生成する。発話者の属性は、例えば、発話者の識別子、年齢、職業、あるいは場所などである。発話者の複数の属性が入力されてもよい。 The dialogue system of this embodiment interprets the generation of an utterance as conditional text generation based on the context of the utterance (utterance history and attributes of the speaker), and uses the utterance text including the previous utterance and the attributes of the speaker ( This system generates the next uttered text according to the conditions when given as a context (attributes of the speaker of the uttered text to be generated). For example, if you input "May I help you?" as the previous utterance text and "Speaker A" as the speaker attribute, the next utterance text by speaker A, such as "Yes, please." generate. Even if the same utterance text is given, if the attribute of the speaker is changed to "Speaker B" and input, different utterance texts will be generated depending on the attributes of the speaker, such as "No, thank you." The attributes of the speaker include, for example, the speaker's identifier, age, occupation, or location. A plurality of attributes of the speaker may be input.

図１は、本実施形態の対話システムの構成の一例を示す図である。図１に示す対話システム１は、学習部１０、生成部２０、データ保存部３０、計算結果記憶部４０、および入出力部５０を備える。対話システム１が備える各部は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは対話システム１が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリなどの記録媒体に記録することも、ネットワークを通して提供することも可能である。 FIG. 1 is a diagram showing an example of the configuration of a dialogue system according to this embodiment. The dialogue system 1 shown in FIG. 1 includes a learning section 10, a generation section 20, a data storage section 30, a calculation result storage section 40, and an input/output section 50. Each part of the dialogue system 1 may be configured by a computer equipped with an arithmetic processing unit, a storage device, etc., and the processing of each part may be executed by a program. This program is stored in a storage device included in the dialog system 1, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or can be provided through a network.

学習部１０は、各発話に発話者の属性が付与された一連の発話テキストを学習データとして入力し、連続する２つの発話テキストを単語に分割し、前の発話テキストの単語と発話者の属性を条件として入力し、属性を単語と同じ意味空間に配置できるようにモデルを学習するとともに、前の発話テキストに対応する次の発話テキストを生成するようにモデルを学習する。 The learning unit 10 inputs as learning data a series of uttered texts in which each utterance is given an attribute of the speaker, divides two consecutive uttered texts into words, and divides the words of the previous uttered text and the attributes of the speaker into words. is input as a condition, and the model is trained so that attributes can be placed in the same semantic space as words, and the model is trained to generate the next uttered text that corresponds to the previous uttered text.

生成部２０は、直前の発話テキストおよび発話者の属性を学習済みモデルに入力し、発話者の次の発話となる発話テキストを生成する。 The generation unit 20 inputs the immediately preceding utterance text and the attributes of the speaker into the trained model, and generates the utterance text that will be the next utterance of the speaker.

データ保存部３０は、一連の発話テキストを含む学習データを保持する。各発話には発話者の属性が付与されている。 The data storage unit 30 holds learning data including a series of spoken texts. Each utterance is given an attribute of the speaker.

計算結果記憶部４０は、テキストを生成する深層学習モデル、学習によって得られた深層学習モデルのパラメータ、属性や単語の分散ベクトル（分散埋め込み表現）などの計算結果を保持する。 The calculation result storage unit 40 holds calculation results such as a deep learning model that generates text, parameters of the deep learning model obtained through learning, and distributed vectors (distributed embedding representations) of attributes and words.

入出力部５０は、ユーザ端末５から属性および発話文を入力して生成部２０へ送信し、生成部２０から生成した発話テキストを受信して応答文としてユーザ端末５に返却する。 The input/output unit 50 inputs attributes and utterances from the user terminal 5 and transmits them to the generation unit 20, receives the utterance text generated from the generation unit 20, and returns it to the user terminal 5 as a response sentence.

［提案モデル］
図２および図３を参照し、本実施形態で提案するモデルについて説明する。図２に示す提案モデルは、Ｔｒａｎｓｆｏｒｍｅｒにマルチビューアテンションメカニズムを導入した深層学習モデルである。提案モデルは直前の発話を含む発話テキストを条件として用い、提案モデルに発話テキスト（ＨＩＳＴＯＲＹ）と発話者の属性（ＡＴＴＲＩＢＵＴＥ）を入力すると、入力した発話者の属性による次の発話テキスト（ＵＴＴＥＲＡＮＣＥ）を自己回帰的に生成する。Ｔｒａｎｓｆｏｒｍｅｒは主に自然言語処理分野で用いられる深層学習モデルである。ＴｒａｎｓｆｏｒｍｅｒをＬｅｆｔ－ｔｏ－ｒｉｇｈｔＬａｎｇｕａｇｅｍｏｄｅｌとして利用すると、入力した単語からその次に出現する単語を確率的に予測し、これを再帰的に繰り返すことでデキストを生成できる。マルチビューアテンションメカニズムは、単語間、属性間、または単語と属性間の関連度（Ａｔｔｅｎｔｉｏｎ）を計算する際に、属性は双方向で参照でき、単語は属性とテキスト内で以前に出現した単語のみを参照できるようにする仕組みである。マルチビューアテンションメカニズムは、トークン（属性および単語）ごとにコンテキストへのアクセスを制御するための異なる自己アテンションマスクを持つことで、Ｔｒａｎｓｆｏｒｍｅｒが属性側と単語側とでパラメータを共有できる。 [Proposed model]
The model proposed in this embodiment will be described with reference to FIGS. 2 and 3. The proposed model shown in FIG. 2 is a deep learning model in which a multi-view attention mechanism is introduced into Transformer. The proposed model uses the uttered text that includes the previous utterance as a condition, and when the uttered text (HISTORY) and the speaker's attributes (ATTRIBUTE) are input to the proposed model, the next uttered text (UTTERANCE) based on the input speaker's attributes is calculated. Generate autoregressively. Transformer is a deep learning model mainly used in the field of natural language processing. When Transformer is used as a left-to-right language model, it is possible to probabilistically predict the next word from an input word, and to generate a dext by recursively repeating this process. The multi-view attention mechanism calculates the degree of association (Attention) between words, between attributes, or between words and attributes. Attributes can be referenced in both directions, and words can only be referenced in both the attribute and the previously occurring words in the text. This is a mechanism that allows you to refer to. The multi-view attention mechanism allows the Transformer to share parameters between the attribute side and the word side by having a different self-attention mask for controlling access to the context for each token (attribute and word).

図３にマルチビューアテンションメカニズムで利用する自己アテンションマスクの一例を示す。図３の例では、自己アテンションマスクを、縦方向に参照元の単語ｈ、属性ａ、単語ｕを並べ、横方向に参照先の単語ｈ、属性ａ、単語ｕを並べて示した。ｈは発話履歴の単語のトークンに相当し、ａは属性のトークンに相当し、ｕは生成する発話（次の発話テキスト）の単語のトークンに相当する。黒丸は参照可能なトークンを示す。単語ｈは直前の発話テキストを構成する単語である。単語ｕは次の発話テキストを構成する単語である。単語ｈに関しては入力のみで出力がないので、単語ｈ，ｕと属性ａを参照しない。属性ａは、単語ｈ，ｕと属性ａの全てを参照できる。単語ｕは、単語ｈの全てと属性ａの全てと次の発話テキストにおいてそれまでに出現した単語ｕのみを参照できる。以下、マルチビューアテンションメカニズムを導入したＴｒａｎｓｆｏｒｍｅｒについて説明する。 Figure 3 shows an example of a self-attention mask used in the multi-view attention mechanism. In the example of FIG. 3, the self-attention mask is shown with reference source word h, attribute a, and word u arranged in the vertical direction, and reference destination word h, attribute a, and word u arranged in the horizontal direction. h corresponds to a word token in the utterance history, a corresponds to an attribute token, and u corresponds to a word token in the utterance to be generated (next utterance text). Black circles indicate referenceable tokens. The word h is a word that constitutes the immediately preceding utterance text. Word u is a word that constitutes the next utterance text. Regarding word h, since there is only input and no output, words h, u and attribute a are not referenced. Attribute a can refer to all words h and u and attribute a. The word u can refer to all of the word h, all of the attribute a, and only the word u that has appeared up to that point in the next uttered text. The Transformer that introduces the multi-view attention mechanism will be described below.

アテンションとはトークン間（属性間、単語間、属性と単語間）の関連度を表すスコアである。各トークンがＱ（クエリ），Ｋ（キー），およびＶ（バリュー）のベクトルを持つ。次式のように、アテンションは、Ｖの加重和であり、その加重はＱとＫを使って計算される。本実施形態では、アテンションの計算にマルチビューアテンションメカニズムを導入し、別のトークンへのアクセスを制御した。 Attention is a score representing the degree of association between tokens (between attributes, between words, between attributes and words). Each token has a vector of Q (query), K (key), and V (value). Attention is a weighted sum of V, and the weight is calculated using Q and K, as shown in the following equation. In this embodiment, a multi-view attention mechanism is introduced in the attention calculation to control access to different tokens.

Ｗ_l ^Q，Ｗ_l ^K，Ｗ_l ^V∈Ｒ^d _h ^×d _kは、Ｑ，Ｋ，Ｖ∈Ｒ^x×d _kのそれぞれを計算するための学習可能な重みである。ｄは、クエリとキーの共有次元数である。Ｍ∈Ｒ^x×xは自己アテンションマスクである。アテンションを求める際に、単語のトークンについては後続（ｉ＜ｊ）の単語を参照しないように無限に小さい値とする。ＨはＴｒａｎｓｆｏｒｍｅｒを構成するパラメータであり、次式で表される。 W _l ^Q , W _l ^K , W _l ^V ∈R ^d _h ^×d _k are learnable weights for calculating each of Q, K, V∈R ^x×d _k . d is the number of shared dimensions between the query and the key. M∈R ^xxx is the self-attention mask. When obtaining attention, a word token is given an infinitely small value so as not to refer to subsequent (i<j) words. H is a parameter constituting the Transformer, and is expressed by the following equation.

Ｈ_a ⁰はＴｒａｎｓｆｏｒｍｅｒへの入力であって、各トークンについて、属性または単語の分散埋め込み表現（ＴｏｋｅｎＥｍｂｅｄｄｉｎｇ）、位置の分散埋め込み表現（ＰｏｓｉｔｉｏｎａｌＥｍｂｅｄｄｉｎｇ）、およびデータ形式の分散埋め込み表現（ＳｅｇｍｅｎｔＥｍｂｅｄｉｎｇ）を合わせたものである。Ｈ_a ^lはｌ番目のレイヤの出力であり、次のレイヤへの入力である。なお、図２中の［ＣＬＳ］は始まりを示すトークンである。［ＳＥＰ］は区切りを示すトークンである。［ＳＯＡ］は属性の始まりを示し、［ＥＯＡ］は属性の終わりを示すトークンである。［ＥＯＴ］はテキストの終わりを示すトークンである。 H _a ⁰ is an input to the Transformer, and for each token, a distributed embedding of the attribute or word (Token Embedding), a distributed embedding of the position (Positional Embedding), and a distributed embedding of the data format (Segment Embedding) are input. It is a combination. H _a ^l is the output of the lth layer and is the input to the next layer. Note that [CLS] in FIG. 2 is a token indicating the beginning. [SEP] is a token indicating a break. [SOA] is a token that indicates the beginning of the attribute, and [EOA] is a token that indicates the end of the attribute. [EOT] is a token indicating the end of text.

提案モデルは、学習タスクとしてＭＡＮとＮＵＭを導入した。ＭＡＮにより属性を単語と同じ意味空間に配置できるようにモデルを学習する。ＭＡＭは次式で定義される。 The proposed model introduced MAN and NUM as learning tasks. A model is trained using MAN so that attributes can be placed in the same semantic space as words. MAM is defined by the following equation.

ここで、ζは学習するパラメータを表す。ｊ番目のテキストにおける属性群をａ_j＝｛ａ_j,1，・・・，ａ_j,i｝、単語群をｗ_j＝｛ｗ_j,1，・・・，ｗ_j,i｝とする。単語群は、直前の発話テキスト（ＨＩＳＴＯＲＹ）と次の発話テキスト（ＵＴＴＥＲＡＮＣＥ）の単語を含む。バックスラッシュを付したｍはｍ番目の属性をマスクしたことを表す。ＭＡＭは属性の一部を除いたときの属性の予測精度を表し、ＭＡＭによりマスクした属性を正しく推定できるようにモデルを学習できる。 Here, ζ represents the parameter to be learned. Let the attribute group in the j-th text be a _j = {a _j,1 , ..., a _j,i }, and the word group be w _j = {w _j,1 , ..., w _j,i }. . The word group includes words from the immediately preceding utterance text (HISTORY) and the next utterance text (UTTERANCE). m with a backslash indicates that the m-th attribute is masked. MAM represents the prediction accuracy of an attribute when some of the attributes are excluded, and a model can be learned to correctly estimate the masked attributes using MAM.

ＮＵＭは次式で定義される。 NUM is defined by the following equation.

連続する発話のスコアリング関数をｓ_ζ（ｈ_t，ｈ_t+1）とする。ＮＵＭは直前の発話テキストと次の発話テキストがどれだけ噛合っているかを評価する。ＮＵＭによりコンテキストにあった次の発話テキストの予測精度が向上するようにモデルを学習できる。 Let the scoring function of consecutive utterances be _sζ (h _t , h _t+1 ). NUM evaluates how closely the previous utterance text and the next utterance text fit together. With NUM, a model can be trained to improve the accuracy of predicting the next uttered text that matches the context.

モデルの学習は、以下の目的関数を最小化することで実施する。 Model learning is performed by minimizing the following objective function.

Ｌ_LMはＴｒａｎｓｆｏｒｍｅｒデコーダを学習するための目的関数であり、Ｌ_LMを最小化することで、自己回帰的に生成する単語の予測精度を向上できる。 _LLM is an objective function for learning the Transformer decoder, and by minimizing _LLM , the prediction accuracy of words generated autoregressively can be improved.

［動作］
次に、図４のフローチャートを参照し、学習処理について説明する。 [motion]
Next, the learning process will be explained with reference to the flowchart in FIG.

ステップＳ１１にて、学習部１０は、データ保存部３０から一連の発話テキストと属性を読み出す。図５に一連の発話テキストの一例を示す。一連の発話テキストの各発話に属性が付与されている。図５の例では、発話者を特定する識別子が付与されている。発話者の属性として複数の属性が付与されていてもよい。 In step S11, the learning unit 10 reads out a series of utterance texts and attributes from the data storage unit 30. FIG. 5 shows an example of a series of spoken texts. Attributes are assigned to each utterance in a series of utterance texts. In the example of FIG. 5, an identifier identifying the speaker is given. A plurality of attributes may be assigned as attributes of the speaker.

ステップＳ１２にて、学習部１０は、連続する２つの発話テキストを取得し、２つの発話テキストのそれぞれを形態素解析により単語に分割する。図５の例の場合、学習部１０は、最初に、発話者Ａの“Ｈｉ！”と発話者Ｂの“ＭａｙＩｈｅｌｐｙｏｕ？”を取得し、それぞれを単語に分割する。 In step S12, the learning unit 10 acquires two consecutive uttered texts and divides each of the two uttered texts into words by morphological analysis. In the example of FIG. 5, the learning unit 10 first obtains "Hi!" from speaker A and "May I help you?" from speaker B, and divides each into words.

ステップＳ１３にて、学習部１０は、直前の発話テキストの単語、発話者の属性、および次の発話テキストの単語をモデルに入力し、上記で示した目的関数を最小化するようにモデルのパラメータを更新する。具体的には、学習部１０は、直前の発話テキストの単語をＨＩＳＴＯＲＹ、次の発話者の属性をＡＴＴＲＩＢＵＴＥ、次の発話テキストの単語をＵＴＴＥＲＡＮＣＥとしてモデルに入力する。発話者の属性は複数個入力してもよい。 In step S13, the learning unit 10 inputs the words of the immediately preceding utterance text, the attributes of the speaker, and the words of the next utterance text into the model, and sets the model parameters so as to minimize the objective function shown above. Update. Specifically, the learning unit 10 inputs into the model the word of the immediately preceding utterance text as HISTORY, the attribute of the next speaker as ATTRIBUTE, and the word of the next utterance text as UTTERANCE. Multiple attributes of the speaker may be input.

ステップＳ１４にて、学習部１０は、一連の発話テキストの全ての発話テキストについて処理したか否か判定する。処理していない発話テキストがある場合、学習部１０は、ステップＳ１２に戻り、発話テキストを１つずらして、次の発話テキストを含む連続する２つの発話テキストを取得して処理を続ける。具体的には、図５の例の場合、発話者Ｂの“ＭａｙＩｈｅｌｐｙｏｕ？”と発話者Ａの“Ｙｅｓ，ｐｌｅａｓｅ．”を取得して処理を続ける。 In step S14, the learning unit 10 determines whether all spoken texts in the series of spoken texts have been processed. If there is unprocessed utterance text, the learning unit 10 returns to step S12, shifts the utterance text by one, obtains two consecutive utterance texts including the next utterance text, and continues processing. Specifically, in the example of FIG. 5, speaker B's "May I help you?" and speaker A's "Yes, please." are obtained and the process continues.

本実施形態では、ターゲットとなるＵＴＴＥＲＡＮＣＥに対して、直前の発話テキストのみを用いることで、それ以前の全ての発話テキストを用いる従来技術よりも計算コストを抑制することができる。 In this embodiment, by using only the immediately preceding utterance text for the target UTTERANCE, calculation costs can be reduced compared to the conventional technique that uses all the previous utterance texts.

次に、図６のフローチャートを参照し、対話生成処理について説明する。 Next, the dialog generation process will be explained with reference to the flowchart in FIG.

ステップＳ２１にて、生成部２０は、ユーザ端末５から受信した直前の発話テキストと属性をモデルに入力する。直前の発話テキストは単語に分割されてモデルに入力される。 In step S21, the generation unit 20 inputs the immediately preceding utterance text and attributes received from the user terminal 5 into the model. The previous utterance text is divided into words and input into the model.

ステップＳ２２にて、生成部２０は、モデルが再帰的に出力する単語を繋げて次の発話テキストを生成する。 In step S22, the generation unit 20 connects the words recursively output by the model to generate the next uttered text.

ステップＳ２３にて、生成部２０は、発話テキストの生成を終了するか否か判定する。例えば、連続する一連の会話を生成する場合、生成部２０は、ステップＳ２１に戻り、ステップ２２で生成した次の発話テキストを直前の発話テキストとして処理を続ける。 In step S23, the generation unit 20 determines whether to end the generation of the spoken text. For example, when generating a continuous series of conversations, the generation unit 20 returns to step S21 and continues processing using the next utterance text generated in step 22 as the immediately preceding utterance text.

生成部２０による発話テキスト生成処理が終了すると、生成された発話テキストは、入出力部５０からユーザ端末５へ返却される。 When the utterance text generation process by the generation unit 20 is completed, the generated utterance text is returned from the input/output unit 50 to the user terminal 5.

以上説明したように、本実施形態の対話システム１は、学習部１０と生成部２０を備え、発話に対する応答テキストを生成するシステムである。学習部１０は、発話者の属性が付与された発話テキストを入力し、前の発話テキストと次の発話テキストから単語を抽出し、トークンの種別ごとに他のトークンへのアクセスを制御するマルチビューアテンションメカニズムを導入したＴｒａｎｓｆｏｒｍｅｒに前の発話テキストの単語と属性を条件、次の発話テキストの単語を出力結果として入力し、属性の一部を除いたときの属性の予測精度を表す目的関数と前の発話テキストと次の発話テキストとの噛合い度合いを表す目的関数を最小化するようにＴｒａｎｓｆｏｒｍｅｒを学習する。生成部２０は、直前の発話テキストと発話者の属性を学習済みのＴｒａｎｓｆｏｒｍｅｒに入力し、Ｔｒａｎｓｆｏｒｍｅｒから再帰的に出力される単語をつなげて発話者の次の発話となる応答テキストを生成する。これにより、再学習前のモデルと再学習後のモデルをバランスよく結合し、発話のコンテキストを反映した発話を生成する対話システム１を提供できる。 As described above, the dialogue system 1 of this embodiment is a system that includes the learning section 10 and the generation section 20 and generates response text to utterances. The learning unit 10 is a multi-viewer that inputs the utterance text to which the attribute of the speaker is attached, extracts words from the previous utterance text and the next utterance text, and controls access to other tokens for each token type. Input the words and attributes of the previous uttered text as conditions to the Transformer that introduces the tension mechanism, input the words of the next uttered text as the output result, and create an objective function that represents the prediction accuracy of the attributes when some of the attributes are excluded. The Transformer is trained to minimize the objective function representing the degree of congruence between the utterance text of 1 and the next utterance text. The generation unit 20 inputs the immediately preceding utterance text and the attributes of the speaker into a trained Transformer, connects the words recursively output from the Transformer, and generates a response text that will be the next utterance of the speaker. Thereby, it is possible to provide a dialogue system 1 that combines the model before relearning and the model after relearning in a well-balanced manner and generates an utterance that reflects the context of the utterance.

１対話システム
１０学習部
２０生成部
３０データ保存部
４０計算結果記憶部
５０入出力部
５ユーザ端末 1 Dialogue system 10 Learning unit 20 Generation unit 30 Data storage unit 40 Calculation result storage unit 50 Input/output unit 5 User terminal

Claims

A dialogue text generation device that generates a response text to an utterance,
We introduced a multi-view attention mechanism that inputs uttered text with speaker attributes, extracts words from the previous uttered text and next uttered text, and controls access to other tokens for each token type. Inputting the words of the previous utterance text and the attribute as conditions to the Transformer, inputting the words of the next utterance text as an output result, and excluding a part of the attributes, an objective function representing the prediction accuracy of the attribute, and the a learning unit that learns the Transformer to minimize an objective function representing the degree of engagement between the previous utterance text and the next utterance text;
A generation unit that inputs the immediately preceding utterance text and attributes of the speaker into a trained Transformer, connects words recursively output from the Transformer, and generates a response text that becomes the next utterance of the speaker,
The multi-view attention mechanism allows access to all tokens of the attribute, and provides access to all tokens of the attribute and words that appear before the word in the next spoken text. When the Transformer seeks attention, it sets the attention to an infinitely small value so that it does not refer to subsequent word tokens for the next word token in the spoken text. to be
Dialogue text generator.

The dialogue text generation device according to claim 1 ,
The learning unit inputs utterance texts in a series of dialogues and learns by shifting the previous utterance text and the next utterance text one by one.

A dialogue text generation method for generating a response text to an utterance, the method comprising:
The computer is
We introduced a multi-view attention mechanism that inputs uttered text with speaker attributes, extracts words from the previous uttered text and next uttered text, and controls access to other tokens for each token type. Inputting the words of the previous utterance text and the attribute as conditions to the Transformer, inputting the words of the next utterance text as an output result, and excluding a part of the attributes, an objective function representing the prediction accuracy of the attribute, and the Learning a Transformer to minimize an objective function representing the degree of engagement between the previous utterance text and the next utterance text,
Input the previous utterance text and the speaker's attributes into a trained Transformer, connect the words recursively output from the Transformer to generate a response text that will be the speaker's next utterance ,
The multi-view attention mechanism allows access to all tokens of the attribute, and provides access to all tokens of the attribute and words that appear before the word in the next spoken text. When the Transformer seeks attention, it sets the attention to an infinitely small value so that it does not refer to subsequent word tokens for the next word token in the spoken text. to be
Dialogue text generation method.

The dialog text generation method according to claim 3 ,
A dialogue text generation method that inputs utterance texts in a series of dialogues and learns by shifting the previous utterance text and the next utterance text one by one.

A program that causes a computer to operate as each part of the dialog text generation device according to claim 1 or 2 .