JP7428245B2

JP7428245B2 - Response sentence generator and program

Info

Publication number: JP7428245B2
Application number: JP2022524741A
Authority: JP
Inventors: 雅博水上; 弘晃杉山; 宏美成松; 庸浩有本; 竜一郎東中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2024-02-06
Anticipated expiration: 2040-05-20
Also published as: JPWO2021234838A1; JP2024028569A; WO2021234838A1

Description

この発明は、ユーザと対話を行う対話技術に関し、特に、個性を反映したシステム発話を生成する技術に関する。 The present invention relates to an interaction technique for interacting with a user, and particularly to a technique for generating system utterances that reflect individuality.

対話システムの発展に伴い、個性やキャラクターといった特徴（以下、まとめて「個性」と呼ぶ）を対話システムに付与する需要が高まっている（例えば、非特許文献１）。従来の多くの商用対話システムはルール方式を用いており、個性を反映した応答ルールを事前に用意しておくことで、個性のある対話システムを構築していた。近年の対話システムでは、ニューラルネットワークを用いて応答を生成する方式（以下、「文生成方式」と呼ぶ）が一般化しており、文生成方式でも個性を考慮する手法の実現が期待されている。 With the development of dialogue systems, there is an increasing demand for providing features such as individuality and character (hereinafter collectively referred to as "individuality") to dialogue systems (for example, Non-Patent Document 1). Many conventional commercial dialogue systems use a rule system, and by preparing response rules that reflect individuality in advance, a unique dialogue system can be constructed. In recent years, in dialogue systems, methods that use neural networks to generate responses (hereinafter referred to as ``sentence generation methods'') have become commonplace, and it is hoped that a method that takes individuality into account in sentence generation methods will also be realized.

文生成方式で個性を考慮する場合に、応答に使える個性の情報を入力文と共に入力する方法がある。例えば、個性を考慮しない場合に「食べ物は何が好きですか？」という入力文に対して「カレーライスが好きです」という応答文を生成する対話システムがあるとする。ここで、個性を考慮する場合には、「食べ物は何が好きですか？」という入力文と共に「食べ物だと唐揚げが好き。趣味はサーフィン。犬を飼っている。」といった個性の情報を入力すれば、「唐揚げが好きです」という応答文を生成することができる。この手法では、一般的な発話と応答の関係を学習した上で、同時に入力された個性の情報が応答に直接利用できる場合に、その個性の情報を反映した応答文を生成する。 When considering personality in the sentence generation method, there is a method of inputting personality information that can be used for responses together with the input sentence. For example, suppose there is a dialogue system that generates the response sentence ``I like curry and rice'' in response to the input sentence ``What kind of food do you like?'' without considering individuality. Here, when considering personality, input information such as ``I like fried chicken when it comes to food.My hobby is surfing.I have a dog.'' along with the input sentence ``What kind of food do you like?'' By inputting this information, a response sentence such as ``I like fried chicken'' can be generated. This method learns the relationship between general utterances and responses, and then generates response sentences that reflect personality information input at the same time if that information can be directly used in the response.

Ryuichiro Higashinaka, Masahiro Mizukami, Hidetoshi Kawabata, Emi Yamaguchi, Noritake Adachi, and Junji Tomita1, "Role play-based question-answering by real users for building chatbots with consistent personalities", Proceedings of the SIGDIAL 2018 Conference, pp. 264-272, Melbourne, Australia, July 2018.Ryuichiro Higashinaka, Masahiro Mizukami, Hidetoshi Kawabata, Emi Yamaguchi, Noritake Adachi, and Junji Tomita1, "Role play-based question-answering by real users for building chatbots with consistent personalities", Proceedings of the SIGDIAL 2018 Conference, pp. 264-272 , Melbourne, Australia, July 2018.

しかしながら、特定の個人特有の個性（例えば、織田信長や豊臣秀吉のような、個性がよく知られた人物の個性を対話システムに付与したい場合など）は、言語化が困難であったり、一般的な発話と応答の関係から外れた応答が必要になったりする場合がある。例えば、「来年はサル年ですね」という入力に対して、豊臣秀吉の個性を反映した応答をする場合、「ワシの年じゃな」や「誰がサルじゃ！！」といった応答文が生成されることが期待される。しかしながら、従来の手法では一般的な発話と応答の関係に対して個性の情報が反映できる場合でしか有効ではないため、「名前は豊臣秀吉。三英傑の一人。織田信長に仕え、天下統一も果たした。来年はサル年ですね」といった内容を入力としても、上記のような応答文を生成することはできない。 However, it is difficult to verbalize the unique personality of a particular individual (for example, when you want to give a dialogue system the personality of a person whose personality is well known, such as Oda Nobunaga or Toyotomi Hideyoshi), or it is difficult to express it in words. Sometimes a response that deviates from the relationship between the utterance and the response may become necessary. For example, when responding to the input ``Next year is the year of the monkey,'' a response that reflects Toyotomi Hideyoshi's personality will generate responses such as ``It's the year of the eagle!'' or ``Who's the monkey?!'' It is expected. However, conventional methods are only effective when individuality information can be reflected in the relationship between general utterances and responses. Even if you input something like, "Next year is the year of the monkey," it will not be possible to generate a response sentence like the one above.

この発明の目的は、上記のような技術的課題を鑑みて、個性の情報を入力することなく、個性を反映した応答文を生成することができる対話技術を提供することである。 SUMMARY OF THE INVENTION In view of the above-mentioned technical problems, an object of the present invention is to provide a dialogue technique that can generate response sentences that reflect individuality without inputting information about individuality.

上記の課題を解決するために、この発明の第一の態様の応答文生成装置は、入力文と話者を表す話者識別子とを入力する入力部と、入力文と話者識別子とを応答文生成モデルに入力することで応答文を求める応答文生成部と、を含み、応答文生成モデルは、話者識別子から話者埋め込みベクトルを求める話者モデルと、発話文から文ベクトルを生成するエンコーダと、発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、デコーダの内部状態を表す内容ベクトルと文ベクトルと話者埋め込みベクトルとを用いて注意ベクトルを生成する注意機構と、を含む。 In order to solve the above problems, a response sentence generation device according to a first aspect of the present invention includes an input section for inputting an input sentence and a speaker identifier representing a speaker, and an input section for inputting an input sentence and a speaker identifier representing a speaker; a response sentence generation unit that generates a response sentence by inputting it to a sentence generation model; the response sentence generation model generates a speaker model that calculates a speaker embedding vector from a speaker identifier; and a sentence vector from an uttered sentence. An encoder, a decoder that generates a response sentence using an attention vector representing the content of attention to an uttered sentence, and an attention vector that generates an attention vector using a content vector representing the internal state of the decoder, a sentence vector, and a speaker embedding vector. including a mechanism.

この発明の第二の態様の応答文生成モデル学習装置は、発話文と所定の話者が発話文に応答する応答文と話者を表す話者識別子とからなる学習データを記憶する学習データ記憶部と、学習データを用いて、発話文と話者識別子を入力とし、当該発話文に応答する応答文を出力する応答文生成モデルを学習するモデル学習部と、を含み、応答文生成モデルは、話者識別子から話者埋め込みベクトルを求める話者モデルと、発話文から文ベクトルを生成するエンコーダと、発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、デコーダの内部状態を表す内容ベクトルと文ベクトルと話者埋め込みベクトルとを用いて注意ベクトルを生成する注意機構と、を含む。 A response sentence generation model learning device according to a second aspect of the present invention has a learning data storage that stores learning data consisting of an uttered sentence, a response sentence in response to the uttered sentence by a predetermined speaker, and a speaker identifier representing the speaker. and a model learning unit that uses learning data to learn a response sentence generation model that receives an uttered sentence and a speaker identifier as input and outputs a response sentence in response to the uttered sentence. , a speaker model that calculates a speaker embedding vector from a speaker identifier, an encoder that generates a sentence vector from an uttered sentence, a decoder that generates a response sentence using the attention vector representing the content of attention for the uttered sentence, and a decoder. It includes an attention mechanism that generates an attention vector using a content vector representing an internal state, a sentence vector, and a speaker embedding vector.

この発明によれば、個性の情報を入力することなく、個性を反映した応答文を生成することができる。 According to this invention, a response sentence that reflects individuality can be generated without inputting individuality information.

図１は応答文生成装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating the functional configuration of a response sentence generation device. 図２は応答文生成方法の処理手順を例示する図である。FIG. 2 is a diagram illustrating the processing procedure of the response sentence generation method. 図３は応答文生成モデルの機能構成を例示する図である。FIG. 3 is a diagram illustrating the functional configuration of the response sentence generation model. 図４は応答文生成モデル学習装置の機能構成を例示する図である。FIG. 4 is a diagram illustrating the functional configuration of the response sentence generation model learning device. 図５は応答文生成モデル学習方法の処理手順を例示する図である。FIG. 5 is a diagram illustrating the processing procedure of the response sentence generation model learning method. 図６はコンピュータの機能構成を例示する図である。FIG. 6 is a diagram illustrating the functional configuration of a computer.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Embodiments of the present invention will be described in detail below. Note that in the drawings, components having the same functions are designated by the same numbers, and redundant explanation will be omitted.

文中で使用する記号「^－」は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。数式中においてはこれらの記号は本来の位置、すなわち文字の真上に記載している。 The symbol " ^- " used in a sentence should originally be written directly above the character that immediately follows it, but due to text notation restrictions, it is written immediately before the character in question. In mathematical formulas, these symbols are written in their original positions, that is, directly above the letters.

［発明の概要］
この発明では、文生成方式を用いる対話システムにおいて、任意の話者を想定し、その話者の個性が反映した応答文を生成するように構成する。このとき、従来の文生成方式において個性を考慮するために必要とされていた個性の情報を不要とする。例えば、「来年はサル年ですね」という入力文から、豊臣秀吉の個性を反映した「誰がサルじゃ！！」という応答文を生成できるようにする。 [Summary of the invention]
In the present invention, a dialogue system using a sentence generation method is configured to assume an arbitrary speaker and generate a response sentence that reflects the speaker's personality. At this time, information on individuality, which is required to take individuality into consideration in conventional sentence generation methods, is no longer necessary. For example, from the input sentence ``Next year is the year of the monkey,'' it is possible to generate a response sentence ``Who is the monkey!!'' that reflects the personality of Toyotomi Hideyoshi.

そのために、個性ごとに異なる発話と応答の関係を学習するためのニューラルネットワークにおいて、入力と出力の対応関係を学習する注意機構に対して、話者ごとの個性の特徴を考慮するための枠組みを導入し、個性ごとに特徴的な入力と出力の対応関係を学習する。例えば、「この人物ならサルという単語に注目しそうだ」とか「サルという単語からこういう意味を読み取りそうだ」といった、話者ごとに異なる注意の傾向を、応答文生成において実現する。これにより、個性を考慮した応答文生成の性能（すなわち、生成された応答文の品質）が向上する。 To this end, in a neural network that learns the relationship between utterances and responses that differ for each personality, we developed a framework that takes into account the characteristics of each speaker's personality for the attention mechanism that learns the correspondence between input and output. The system learns the correspondence between input and output that is characteristic of each individuality. For example, when generating response sentences, attention trends that vary depending on the speaker can be realized, such as, ``This person is likely to pay attention to the word monkey,'' or ``This person is likely to derive this kind of meaning from the word monkey.'' This improves the performance of response sentence generation that takes individuality into consideration (that is, the quality of the generated response sentence).

［実施形態］
この発明の実施形態は、文生成方式を用いる対話システムにおいて、ユーザ発話に基づく入力文に対して応答文を生成する応答文生成装置および方法と、その応答文生成装置および方法において用いられる応答文生成モデルを学習する応答文生成モデル学習装置および方法とからなる。 [Embodiment]
Embodiments of the present invention provide a response sentence generation device and method for generating a response sentence to an input sentence based on user utterances in a dialogue system using a sentence generation method, and a response sentence generation device and method used in the response sentence generation device and method. The present invention comprises a response sentence generation model learning device and method for learning a generation model.

＜応答文生成装置＞
図１に示すように、実施形態の応答文生成装置１は、ユーザ発話の内容を表す入力文と、話者を一意に特定する話者識別子とを入力とし、入力文に対するシステム発話の内容を表す応答文を出力する。応答文生成装置１は、例えば、モデル記憶部１０、入力部１１、および応答文生成部１２を備える。この応答文生成装置１が、図２に例示する各ステップの処理を行うことにより実施形態の応答文生成方法が実現される。 <Response sentence generation device>
As shown in FIG. 1, the response sentence generation device 1 according to the embodiment receives an input sentence representing the content of the user's utterance and a speaker identifier that uniquely identifies the speaker, and generates the content of the system utterance in response to the input sentence. Outputs the response sentence that represents. The response sentence generation device 1 includes, for example, a model storage section 10, an input section 11, and a response sentence generation section 12. The response sentence generation method of the embodiment is realized by the response sentence generation device 1 performing the processing of each step illustrated in FIG.

応答文生成装置１は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。応答文生成装置１は、例えば、中央演算処理装置の制御のもとで各処理を実行する。応答文生成装置１に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。応答文生成装置１は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。応答文生成装置１が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The response sentence generation device 1 is configured by loading a special program into a known or dedicated computer having, for example, a central processing unit (CPU), a main memory (RAM), etc. It is a special device. The response sentence generation device 1 executes each process under the control of, for example, a central processing unit. The data input to the response sentence generation device 1 and the data obtained through each process are stored, for example, in the main memory, and the data stored in the main memory is read out to the central processing unit as necessary. and used for other processing. At least a portion of the response sentence generation device 1 may be configured by hardware such as an integrated circuit. Each storage unit included in the response sentence generation device 1 includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device constituted by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory; Alternatively, it can be configured with middleware such as a relational database or key-value store.

モデル記憶部１０には、学習済みの応答文生成モデルが記憶されている。図３に示すように、応答文生成モデル１００は、入力文と話者識別子とを入力とし、応答文を出力する。応答文生成モデル１００は、例えば、話者モデル１０１、エンコーダ１０２、デコーダ１０３、および注意機構１０４を含む。入力文は、例えば、対話システムに対してユーザが発話した質問の内容を表す発話文である。話者識別子は、個性を反映させたい人物を一意に特定する識別子である。応答文は、例えば、入力文として与えられた質問文に対する対話システムからの回答の内容を表す発話文である。 The model storage unit 10 stores a trained response sentence generation model. As shown in FIG. 3, the response sentence generation model 100 receives an input sentence and a speaker identifier, and outputs a response sentence. The response sentence generation model 100 includes, for example, a speaker model 101, an encoder 102, a decoder 103, and an attention mechanism 104. The input sentence is, for example, an uttered sentence representing the content of a question uttered by the user to the dialogue system. The speaker identifier is an identifier that uniquely identifies a person whose individuality is desired to be reflected. The response sentence is, for example, an uttered sentence expressing the content of the answer from the dialog system to the question sentence given as the input sentence.

話者モデル１０１は、話者識別子を入力とし、その話者識別子を話者埋め込みベクトルに変換して出力する、学習済みのモデルである。話者モデル１０１は、例えば、参考文献１に記載されたSpeaker modelと呼ばれるモデルを用いることができる。
〔参考文献１〕Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan, "A persona-based neural conversation model," arXiv preprint, arXiv:1603.06155, 2016. The speaker model 101 is a trained model that receives a speaker identifier as input, converts the speaker identifier into a speaker embedding vector, and outputs the result. As the speaker model 101, for example, a model called a speaker model described in reference document 1 can be used.
[Reference 1] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan, "A persona-based neural conversation model," arXiv preprint, arXiv:1603.06155, 2016.

エンコーダ１０２は、発話文を入力とし、その発話文を文ベクトルに変換して出力する。デコーダ１０３は、注意機構１０４が出力する注意ベクトルを用いて応答文を生成して出力する。エンコーダ１０２およびデコーダ１０３は、従来の文生成方式で用いられるエンコーダおよびデコーダと同じものである。従来の文生成方式については、参考文献２を参照されたい。
〔参考文献２〕Vinyals, Oriol, and Quoc Le, "A neural conversational model," arXiv preprint, arXiv:1506.05869, 2015. The encoder 102 receives an utterance as input, converts the utterance into a sentence vector, and outputs the sentence vector. The decoder 103 uses the attention vector output by the attention mechanism 104 to generate and output a response sentence. Encoder 102 and decoder 103 are the same as those used in conventional sentence generation methods. For the conventional sentence generation method, please refer to Reference 2.
[Reference 2] Vinyals, Oriol, and Quoc Le, "A neural conversational model," arXiv preprint, arXiv:1506.05869, 2015.

注意機構１０４は、話者モデル１０１が出力する話者埋め込みベクトル、エンコーダ１０２が出力する文ベクトル、およびデコーダ１０３の内部状態を表す内容ベクトルを入力とし、注意ベクトルを生成して出力する。注意機構１０４は、まず、話者埋め込みベクトルと文ベクトルと内容ベクトルとを用いて、入力文のどの部分に注目するか（以下、「注意の傾向」と呼ぶ）を表すベクトルである注意重みを生成する。次に、注意機構１０４は、注意重みと話者埋め込みベクトルと文ベクトルとを用いて、入力文に対して注意の傾向に従って注意した内容（以下、「注意の内容」と呼ぶ）を表す注意ベクトルを生成する。 The attention mechanism 104 receives as input the speaker embedding vector outputted by the speaker model 101, the sentence vector outputted by the encoder 102, and the content vector representing the internal state of the decoder 103, and generates and outputs an attention vector. The attention mechanism 104 first uses the speaker embedding vector, the sentence vector, and the content vector to calculate an attention weight, which is a vector representing which part of the input sentence to pay attention to (hereinafter referred to as "attention tendency"). generate. Next, the attention mechanism 104 uses the attention weight, the speaker embedding vector, and the sentence vector to generate an attention vector representing the content of the input sentence that has been paid attention to according to the attention tendency (hereinafter referred to as "the content of attention"). generate.

従来の文生成方式における注意機構との相違点は、注意重みの計算および注意ベクトルの計算において話者埋め込みベクトルを参照することである。これにより、個性に応じて、注意の傾向および注意の内容を変化させる。注意の傾向は、例えば、「豊臣秀吉はサルという単語に強く注目する」といった特徴である。注意の内容は、例えば、「豊臣秀吉はサルという単語をネガティブに捉える」といった特徴である。こうした特徴は、事前に人手で付与する必要はなく、注意の傾向および注意の内容が反映された学習データ、具体的には、話者と紐づけられた多数の文データを用意して学習データとすることで、注意ベクトルに反映される。 The difference from the attention mechanism in conventional sentence generation methods is that speaker embedding vectors are referenced in the calculation of attention weights and attention vectors. This changes the tendency and content of attention depending on individuality. The tendency of attention is, for example, a characteristic such as "Toyotomi Hideyoshi pays strong attention to the word monkey." The content of the warning is, for example, a characteristic such as "Toyotomi Hideyoshi views the word monkey in a negative way." These features do not need to be added manually in advance; rather, training data that reflects the tendency and content of attention, specifically, a large number of sentence data linked to speakers, can be prepared. This is reflected in the attention vector.

注意機構１０４は、具体的には、以下の数式を計算する。 Specifically, the attention mechanism 104 calculates the following formula.

ここで、○は要素積を表す演算子である。tはデコーダがt番目の単語を出力していることを示す変数である。iはエンコーダに入力されたN単語からなる入力文のうちi番目の単語であることを示す変数である。h_t ^(dec)はデコーダの内部状態を表すd次元の内容ベクトルである。なお、dは注意機構の計算部分のサイズ（次元数）である。H^(enc)はエンコーダが生成したN×d次元の文ベクトルである。h_i ^(enc)∈H^(enc)は文ベクトルのi番目の単語に対応する要素である。s_uは話者モデルが生成したd次元の話者埋め込みベクトルである。f(・)およびg(・)は相異なる線形変換である。f(・)およびg(・)は、1次の線形変換でもよいし、任意のM次の線形変換でもよいし、sigmoid関数やsoftsign関数等を用いて出力が0～1や-1～1などの一定の閾値に収まるような関数を定義してもよいし、これらを組み合わせてもよい。a_iは入力文のうちi番目の単語に対する注意重みである。 Here, ◯ is an operator representing the product of elements. t is a variable indicating that the decoder is outputting the tth word. i is a variable indicating that it is the i-th word in the input sentence consisting of N words input to the encoder. h _t ^(dec) is a d-dimensional content vector representing the internal state of the decoder. Note that d is the size (number of dimensions) of the calculation part of the attention mechanism. H ^(enc) is an N×d dimensional sentence vector generated by the encoder. h _i ^(enc) ∈H ^(enc) is an element corresponding to the i-th word of the sentence vector. s _u is a d-dimensional speaker embedding vector generated by the speaker model. f(·) and g(·) are different linear transformations. f(・) and g(・) may be linear transformations of first order or any M order, or can be performed using sigmoid function, softsign function, etc. You may define a function that falls within a certain threshold such as , or you may combine these functions. a _i is the attention weight for the i-th word in the input sentence.

すなわち、注意機構１０４は、以下のようにして、注意ベクトルを計算する。まず、注意重みa_iを計算するために、話者埋め込みベクトルs_uをM次の線形変換f(・)を用いて変換する。線形変換した話者埋め込みベクトルf(s_u)と、エンコードした文ベクトルH^(enc)の各要素h_i ^(enc)との要素積を計算して、^-h_i,k ^(enc)とする（数式の２行目に相当）。話者埋め込みベクトルs_uによって変形した^-h_i,k ^(enc)とデコーダの内容ベクトルh_t ^(dec)を用いて、i番目の注意重みa_iを計算する（数式の４行目に相当）。次に、注意ベクトルを計算するために、話者埋め込みベクトルs_uをM次の線形変換g(・)を用いて変換する。線形変換した話者埋め込みベクトルg(s_u)と、エンコードした文ベクトルH^(enc)の各要素h_i ^(enc)との要素積を計算して、^-h_i,v ^(enc)とする（数式の３行目に相当）。これらのエンコードした文ベクトルの各要素と線形変換した話者埋め込みベクトルにより計算される^-h_i,k ^(enc),^-h_i,v ^(enc)の添字k, vはそれぞれkey, valueの頭文字をとったものであり、注意機構では慣例的に重みをkey、重みをかけるベクトルをvalueと呼ぶためこのような添え字をとる。最後に、すべてのiについて注意重みa_iと^-h_i,v ^(enc)の積を計算し、それらの総和を求める。この総和が最終的な出力となる注意ベクトルである（数式の１行目に相当）。 That is, the attention mechanism 104 calculates the attention vector as follows. First, in order to calculate the attention weight a _i , the speaker embedding vector s _u is transformed using an M-order linear transformation f(·). Calculate the element product of the linearly transformed speaker embedding vector f(s _u ) and each element h _i ⁽ ^{enc) of the encoded sentence vector H (enc)} , and set it to ^- h _i,k ^(enc) ( (equivalent to the second line of the formula). Calculate the i-th attention weight a _i using ^- h _i,k ^(enc) transformed by the speaker embedding vector s _u and the decoder content vector h _t ^(dec) (corresponds to the 4th line of the formula) . Next, in order to calculate the attention vector, the speaker embedding vector s _u is transformed using an M-order linear transformation g(·). Calculate the element product of the linearly transformed speaker embedding vector g(s _u ) and each element h _i ( ^{enc) of the encoded sentence vector H (enc)} ^, and set it to ^- h _i,v ^(enc) ( (equivalent to the third line of the formula). The subscripts k and v of ^- h _i,k ^(enc) and ^- h _i,v ^(enc) are calculated using each element of these encoded sentence vectors and the linearly transformed speaker embedding vector, respectively. This subscript is used because in attention mechanisms, the weight is conventionally called the key, and the vector to which the weight is applied is called the value. Finally, the product of the attention weight a _i and ^- h _i,v ^(enc) is calculated for all i, and their sum is determined. This sum is the attention vector that becomes the final output (corresponding to the first line of the formula).

図２を参照して、実施形態の応答文生成装置１が実行する応答文生成方法の処理手続きを説明する。 With reference to FIG. 2, the processing procedure of the response sentence generation method executed by the response sentence generation device 1 of the embodiment will be described.

ステップＳ１１において、入力部１１へ、入力文と話者識別子とが入力される。入力部１１は、入力文と話者識別子とを応答文生成部１２へ出力する。 In step S11, an input sentence and a speaker identifier are input to the input unit 11. The input unit 11 outputs the input sentence and the speaker identifier to the response sentence generation unit 12.

ステップＳ１２において、応答文生成部１２は、入力部１１から入力文と話者識別子とを受け取り、入力文と話者識別子とをモデル記憶部１０に記憶された応答文生成モデルに入力することで、話者の個性が反映された応答文を得て出力する。応答文の出力では、応答文生成モデルの出力層から得られたベクトルに紐づく単語を出力することを繰り返すことで応答文となる単語列が得られる。応答文生成部１２は、得られた応答文を応答文生成装置１の出力とする。 In step S12, the response sentence generation unit 12 receives the input sentence and the speaker identifier from the input unit 11, and inputs the input sentence and the speaker identifier into the response sentence generation model stored in the model storage unit 10. , to obtain and output a response sentence that reflects the personality of the speaker. When outputting a response sentence, a word string that becomes a response sentence is obtained by repeatedly outputting words associated with the vector obtained from the output layer of the response sentence generation model. The response sentence generation unit 12 outputs the obtained response sentence from the response sentence generation device 1 .

＜応答文生成モデル学習装置＞
図４に示すように、実施形態の応答文生成モデル学習装置２は、例えば、学習データ記憶部２０、モデル学習部２１、およびモデル記憶部１０を備える。この応答文生成モデル学習装置２が、図５に例示する各ステップの処理を行うことにより実施形態の応答文生成モデル学習方法が実現される。 <Response sentence generation model learning device>
As shown in FIG. 4, the response sentence generation model learning device 2 of the embodiment includes, for example, a learning data storage section 20, a model learning section 21, and a model storage section 10. The response sentence generation model learning method of the embodiment is realized by the response sentence generation model learning device 2 performing the processing of each step illustrated in FIG.

応答文生成モデル学習装置２は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。応答文生成モデル学習装置２は、例えば、中央演算処理装置の制御のもとで各処理を実行する。応答文生成モデル学習装置２に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。応答文生成モデル学習装置２は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。応答文生成モデル学習装置２が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The response sentence generation model learning device 2 is configured by loading a special program into a known or dedicated computer having, for example, a central processing unit (CPU), a main memory (RAM), etc. It is a special device that has been The response sentence generation model learning device 2 executes each process under the control of, for example, a central processing unit. The data input to the response sentence generation model learning device 2 and the data obtained through each process are stored, for example, in the main memory, and the data stored in the main memory is read to the central processing unit as necessary. It is output and used for other processing. The response sentence generation model learning device 2 may be configured at least in part by hardware such as an integrated circuit. Each storage unit included in the response sentence generation model learning device 2 includes, for example, a main storage device such as a RAM (Random Access Memory), and an auxiliary storage device constituted by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory. It can be configured by a device or middleware such as a relational database or a key-value store.

図５を参照して、実施形態の応答文生成モデル学習装置２が実行する応答文生成モデル学習方法の処理手続きを説明する。 With reference to FIG. 5, the processing procedure of the response sentence generation model learning method executed by the response sentence generation model learning device 2 of the embodiment will be described.

学習データ記憶部２０には、学習データが記憶されている。学習データは、例えば質問文である発話文と、所定の話者がその発話文に応答する応答文と、その話者を表す話者識別子とからなる。学習データは、対話システム等で実際に行われた対話を収集したものでもよいし、特定の人物を想定して人手で作成したものでもよいし、それらが混在していてもよい。 The learning data storage unit 20 stores learning data. The learning data includes, for example, an uttered sentence that is a question, a response sentence in which a predetermined speaker responds to the uttered sentence, and a speaker identifier representing the speaker. The learning data may be a collection of conversations actually conducted in a dialog system or the like, may be created manually with a specific person in mind, or may be a mixture of these.

ステップＳ２０において、応答文生成モデル学習装置２は、学習データ記憶部２０から学習データを読み出す。応答文生成モデル学習装置２は、読み出した学習データをモデル学習部２１へ入力する。 In step S20, the response sentence generation model learning device 2 reads learning data from the learning data storage unit 20. The response sentence generation model learning device 2 inputs the read learning data to the model learning section 21.

ステップＳ２１において、モデル学習部２１は、入力された学習データを用いて、応答文生成モデル１００のニューラルネットワークの各パラメータを学習する。応答文生成モデルの学習方法は、参考文献２に開示されている、従来の入力と話者識別子を用いて出力を生成するモデルの学習方法と同様である。すなわち、モデルの出力文に対するsoftmax cross entropyを損失関数とし、損失を最小化するようにエンコーダ１０２、デコーダ１０３、および注意機構１０４のパラメータを学習する。注意機構１０４のパラメータの学習に際しては、パラメータf, gの更新を所定の回数、または、所定の条件を満たすまで繰り返す。同時に、話者識別子を話者埋め込みベクトルに変換する話者モデル１０１のパラメータを、従来のSpeaker modelと同様にして学習する。モデル学習部２１は、学習済みの応答文生成モデル１００のパラメータをモデル記憶部１０へ記憶する。 In step S21, the model learning unit 21 learns each parameter of the neural network of the response sentence generation model 100 using the input learning data. The learning method of the response sentence generation model is similar to the learning method of a conventional model that generates an output using an input and a speaker identifier, which is disclosed in Reference 2. That is, the softmax cross entropy for the output sentence of the model is used as a loss function, and the parameters of the encoder 102, decoder 103, and attention mechanism 104 are learned so as to minimize the loss. When learning the parameters of the attention mechanism 104, updating of the parameters f and g is repeated a predetermined number of times or until a predetermined condition is satisfied. At the same time, the parameters of the speaker model 101 that converts the speaker identifier into a speaker embedding vector are learned in the same manner as the conventional speaker model. The model learning unit 21 stores the parameters of the learned response sentence generation model 100 in the model storage unit 10.

［実験結果］
上記実施形態の効果を測定するために、非特許文献１に開示されているなりきり質問応答のデータを用いて実験を行った。具体的には、３名分のなりきりデータ５万件の質問応答ペアを学習データとし、２千件を評価データとした。注意機構の次元数dは512とし、エンコーダおよびデコーダはTransformerを用いた。注意機構はTransformer内の自己注意およびソースターゲット注意を置き換える形で、本実施形態の注意機構を実装した。学習データでモデルを学習し、評価データの質問文を与えて回答を生成した。評価尺度は、BLEU-1, BLEU-4, PPLを用いた。すなわち、本実施形態で生成された回答文と評価データの質問に紐づいた正解回答文とをBLEU-1, BLEU-4を用いて比較した（値は大きい方が良い）。また、正解回答文に対するモデルの生成確率（PPL: Perplexity）を計算した（値は小さい方が良い）。 [Experimental result]
In order to measure the effects of the above embodiment, an experiment was conducted using the data of the impersonation question and answer disclosed in Non-Patent Document 1. Specifically, 50,000 question-and-answer pairs from three people's impersonation data were used as learning data, and 2,000 were used as evaluation data. The number of dimensions d of the attention mechanism was set to 512, and Transformer was used as the encoder and decoder. The attention mechanism of this embodiment is implemented in a form that replaces the self-attention and source-target attention in the Transformer. The model was trained using the training data, and answers were generated by giving the questions from the evaluation data. The evaluation scales used were BLEU-1, BLEU-4, and PPL. That is, the answer sentences generated in this embodiment and the correct answer sentences linked to the questions in the evaluation data were compared using BLEU-1 and BLEU-4 (the larger the value, the better). We also calculated the model generation probability (PPL: Perplexity) for correct answer sentences (the smaller the value, the better).

表１に実験結果を示す。すべての評価指標において、本実施形態の手法が最も良い評価となったことがわかる。 Table 1 shows the experimental results. It can be seen that the method of this embodiment achieved the best evaluation in all evaluation indicators.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 Although the embodiments of this invention have been described above, the specific configuration is not limited to these embodiments, and even if the design is changed as appropriate without departing from the spirit of this invention, Needless to say, it is included in this invention. The various processes described in the embodiments are not only executed in chronological order according to the order described, but also may be executed in parallel or individually depending on the processing capacity of the device that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムを図６に示すコンピュータの記憶部１０２０に読み込ませ、演算処理部１０１０、入力部１０３０、出力部１０４０などに動作させることにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When the various processing functions of each device described in the above embodiments are realized by a computer, the processing contents of the functions that each device should have are described by a program. By loading this program into the storage unit 1020 of the computer shown in FIG. 6 and causing it to operate in the arithmetic processing unit 1010, input unit 1030, output unit 1040, etc., various processing functions in each of the above devices are realized on the computer. be done.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体は、例えば、非一時的な記録媒体であり、磁気記録装置、光ディスク等である。 A program describing the contents of this process can be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a non-transitory recording medium, such as a magnetic recording device, an optical disk, or the like.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, this program is distributed by, for example, selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の非一時的な記憶装置である補助記録部１０５０に格納する。そして、処理の実行時、このコンピュータは、自己の非一時的な記憶装置である補助記録部１０５０に格納されたプログラムを一時的な記憶装置である記憶部１０２０に読み込み、読み込んだプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み込み、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer into the auxiliary storage unit 1050, which is its own non-temporary storage device. Store. When executing the process, this computer loads the program stored in the auxiliary storage section 1050, which is its own non-temporary storage device, into the storage section 1020, which is a temporary storage device, and executes the program according to the read program. Execute processing. In addition, as another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and further, the program may be transferred to this computer from the server computer. The process may be executed in accordance with the received program each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results. You can also use it as Note that the program in this embodiment includes information that is used for processing by an electronic computer and that is similar to a program (data that is not a direct command to the computer but has a property that defines the processing of the computer, etc.).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.

Claims

an input section for inputting an input sentence and a speaker identifier representing a speaker;
a response sentence generation unit that obtains a response sentence by inputting the input sentence and the speaker identifier into a response sentence generation model;
including;
The response sentence generation model is
a speaker model that calculates a speaker embedding vector from a speaker identifier;
an encoder that generates a sentence vector from an uttered sentence;
a decoder that generates a response sentence using an attention vector representing the content of attention to the uttered sentence;
an attention mechanism that generates the attention vector using a content vector representing an internal state of the decoder, the sentence vector, and the speaker embedding vector;
including;
The attention mechanism is configured such that H ^(enc) is the sentence vector, N is the number of elements of the sentence vector, h _i ^(enc) is the element corresponding to the i-th word of the sentence vector, and h _t ^(dec) Let be the content vector when calculating the t-th element of the response sentence, let s _u be the speaker embedding vector, f be the first linear transformation, g be the second linear transformation, and calculate the following equation. Generate the attention vector by

Response sentence generator.

A program for causing a computer to function as the response sentence generation device according to claim 1.