WO2021234838A1 - Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor - Google Patents

Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor Download PDF

Info

Publication number
WO2021234838A1
WO2021234838A1 PCT/JP2020/019887 JP2020019887W WO2021234838A1 WO 2021234838 A1 WO2021234838 A1 WO 2021234838A1 JP 2020019887 W JP2020019887 W JP 2020019887W WO 2021234838 A1 WO2021234838 A1 WO 2021234838A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
vector
speaker
response
attention
Prior art date
Application number
PCT/JP2020/019887
Other languages
French (fr)
Japanese (ja)
Inventor
雅博 水上
弘晃 杉山
宏美 成松
庸浩 有本
竜一郎 東中
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2022524741A priority Critical patent/JP7428245B2/en
Priority to PCT/JP2020/019887 priority patent/WO2021234838A1/en
Publication of WO2021234838A1 publication Critical patent/WO2021234838A1/en
Priority to JP2024008193A priority patent/JP2024028569A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation

Definitions

  • the present invention relates to a dialogue technique for interacting with a user, and particularly to a technique for generating a system utterance that reflects individuality.
  • Non-Patent Document 1 a method of generating a response using a neural network (hereinafter referred to as a "sentence generation method”) has become common, and it is expected that a method of considering individuality will be realized in the sentence generation method as well.
  • sentence generation method a method of generating a response using a neural network
  • the personality peculiar to a specific individual is difficult to verbalize or is common. It may be necessary to have a response that deviates from the relationship between the utterance and the response. For example, when responding to the input "Next year is the year of the monkey” that reflects the individuality of Toyotomi Hideyoshi, response sentences such as "It's the year of the eagle” and "Who is the monkey !” are generated. It is expected.
  • An object of the present invention is to provide a dialogue technique capable of generating a response sentence reflecting individuality without inputting individuality information in view of the above technical problems.
  • the response sentence generation device of the first aspect of the present invention responds to an input unit for inputting an input sentence and a speaker identifier representing a speaker, and an input sentence and a speaker identifier.
  • the response sentence generation model includes a response sentence generation unit that obtains a response sentence by inputting to the sentence generation model, and the response sentence generation model generates a speaker model that obtains a speaker embedded vector from a speaker identifier and a sentence vector from an utterance sentence.
  • Attention to generate a response sentence using an encoder a decoder that generates a response sentence using a caution vector that represents the content of attention to an utterance sentence, and a content vector that represents the internal state of the decoder, a sentence vector, and a speaker embedded vector. Including the mechanism.
  • the response sentence generation model learning device of the second aspect of the present invention stores learning data including a spoken sentence, a response sentence in which a predetermined speaker responds to the spoken sentence, and a speaker identifier representing the speaker.
  • the response sentence generation model includes a unit and a model learning unit that learns a response sentence generation model that inputs a utterance sentence and a speaker identifier using training data and outputs a response sentence that responds to the utterance sentence.
  • a speaker model that obtains a speaker embedded vector from a speaker identifier, an encoder that generates a sentence vector from a spoken sentence, a decoder that generates a response sentence using a caution vector that represents the content of attention to the spoken sentence, and a decoder. It includes a attention mechanism that generates a caution vector using a content vector representing an internal state, a sentence vector, and a speaker embedding vector.
  • FIG. 1 is a diagram illustrating a functional configuration of a response sentence generator.
  • FIG. 2 is a diagram illustrating a processing procedure of a response sentence generation method.
  • FIG. 3 is a diagram illustrating the functional configuration of the response sentence generation model.
  • FIG. 4 is a diagram illustrating the functional configuration of the response sentence generation model learning device.
  • FIG. 5 is a diagram illustrating the processing procedure of the response sentence generation model learning method.
  • FIG. 6 is a diagram illustrating a functional configuration of a computer.
  • a framework for considering the characteristics of individuality for each speaker is provided for the attention mechanism that learns the correspondence between input and output. Introduce and learn the correspondence between input and output that is characteristic for each individuality. For example, the tendency of attention that differs from speaker to speaker, such as "this person is likely to pay attention to the word monkey” or "the word monkey is likely to read this meaning", is realized in the response sentence generation. As a result, the performance of response statement generation in consideration of individuality (that is, the quality of the generated response statement) is improved.
  • An embodiment of the present invention is a response sentence generation device and method for generating a response sentence for an input sentence based on a user utterance in a dialogue system using a sentence generation method, and a response sentence used in the response sentence generation device and method. It consists of a response sentence generation model learning device and a method for learning a generation model.
  • the response sentence generation device 1 of the embodiment inputs an input sentence representing the content of the user's utterance and a speaker identifier that uniquely identifies the speaker, and inputs the content of the system utterance to the input sentence. Output the representative response statement.
  • the response sentence generation device 1 includes, for example, a model storage unit 10, an input unit 11, and a response sentence generation unit 12.
  • the response sentence generation method of the embodiment is realized by the response sentence generation device 1 performing the processing of each step illustrated in FIG. 2.
  • the response statement generation device 1 is configured by loading a special program into a publicly known or dedicated computer having, for example, a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a special device.
  • the response statement generation device 1 executes each process under the control of the central processing unit, for example.
  • the data input to the response sentence generation device 1 and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as needed. Used for other processing.
  • At least a part of the response sentence generation device 1 may be configured by hardware such as an integrated circuit.
  • Each storage unit included in the response sentence generation device 1 is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory).
  • a main storage device such as RAM (RandomAccessMemory)
  • auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory).
  • middleware such as a relational database or key-value store.
  • the trained response sentence generation model is stored in the model storage unit 10. As shown in FIG. 3, the response sentence generation model 100 takes an input sentence and a speaker identifier as inputs, and outputs a response sentence.
  • the response sentence generation model 100 includes, for example, a speaker model 101, an encoder 102, a decoder 103, and an attention mechanism 104.
  • the input sentence is, for example, an utterance sentence representing the content of a question uttered by the user to the dialogue system.
  • the speaker identifier is an identifier that uniquely identifies a person whose individuality is desired to be reflected.
  • the response sentence is, for example, an utterance sentence representing the content of the answer from the dialogue system to the question sentence given as the input sentence.
  • the speaker model 101 is a trained model that takes a speaker identifier as an input, converts the speaker identifier into a speaker embedding vector, and outputs it.
  • a model called the Speaker model described in Reference 1 can be used.
  • the encoder 102 takes an utterance sentence as an input, converts the utterance sentence into a sentence vector, and outputs the utterance sentence.
  • the decoder 103 generates and outputs a response statement using the attention vector output by the attention mechanism 104.
  • the encoder 102 and the decoder 103 are the same as the encoder and the decoder used in the conventional sentence generation method.
  • For the conventional sentence generation method refer to Reference 2. [Reference 2] Vinyals, Oriol, and Quoc Le, "A neural conversational model," arXiv preprint, arXiv: 1506.05869, 2015.
  • the difference from the attention mechanism in the conventional sentence generation method is that the speaker embedded vector is referred to in the calculation of the attention weight and the calculation of the attention vector.
  • the tendency of attention is, for example, a feature such as "Toyotomi Hideyoshi pays close attention to the word monkey.”
  • the content of caution is, for example, a feature such as "Toyotomi Hideyoshi takes the word monkey negatively.”
  • is an operator representing the element product.
  • t is a variable indicating that the decoder is outputting the t-th word.
  • i is a variable indicating that it is the i-th word in the input sentence consisting of N words input to the encoder.
  • h t (dec) is a d-dimensional content vector that represents the internal state of the decoder. Note that d is the size (number of dimensions) of the calculated part of the attention mechanism.
  • H (enc) is an N ⁇ d-dimensional sentence vector generated by the encoder.
  • h i (enc) ⁇ H (enc) is the element corresponding to the i-th word of the sentence vector.
  • the attention mechanism 104 calculates the attention vector as follows. First, in order to calculate the attention weight a i , the speaker embedded vector s u is transformed using the linear transformation f ( ⁇ ) of order M. A linear transformation to speaker embedding vectors f (s u), calculates an element product of each element h i (enc) of the encoded sentence vector H (enc), - h i , and k (enc) ( Corresponds to the second line of the formula).
  • a speaker embedded vector s u - h i Deformed by a speaker embedded vector s u - h i, with k (enc) and decoder of the content vector h t (dec), calculates the i-th note weight a i (corresponding to the fourth line of Equation) ..
  • the speaker embedded vector s u is transformed using the linear transformation g ( ⁇ ) of order M.
  • a linear transformation to speaker embedding vector g (s u) calculates an element product of each element h i (enc) of the encoded sentence vector H (enc), - h i , and v (enc) ( Corresponds to the third line of the formula).
  • step S11 the input sentence and the speaker identifier are input to the input unit 11.
  • the input unit 11 outputs the input sentence and the speaker identifier to the response sentence generation unit 12.
  • step S12 the response sentence generation unit 12 receives the input sentence and the speaker identifier from the input unit 11, and inputs the input sentence and the speaker identifier into the response sentence generation model stored in the model storage unit 10. , Obtains and outputs a response sentence that reflects the individuality of the speaker.
  • the word string to be the response sentence is obtained by repeating the output of the words associated with the vector obtained from the output layer of the response sentence generation model.
  • the response sentence generation unit 12 outputs the obtained response sentence to the response sentence generation device 1.
  • the response sentence generation model learning device 2 of the embodiment includes, for example, a learning data storage unit 20, a model learning unit 21, and a model storage unit 10.
  • the response sentence generation model learning device 2 realizes the response sentence generation model learning method of the embodiment by performing the processing of each step illustrated in FIG.
  • the response sentence generation model learning device 2 is configured by loading a special program into a publicly known or dedicated computer having, for example, a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a special device that has been made.
  • the response sentence generation model learning device 2 executes each process under the control of the central processing unit, for example.
  • the data input to the response sentence generation model learning device 2 and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read to the central processing unit as needed. It is issued and used for other processing.
  • the response sentence generation model learning device 2 may be at least partially configured by hardware such as an integrated circuit.
  • Each storage unit included in the response sentence generation model learning device 2 is composed of, for example, a main storage device such as RAM (RandomAccessMemory) and an auxiliary memory composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory). It can be configured with a device or middleware such as a relational database or key-value store.
  • a main storage device such as RAM (RandomAccessMemory)
  • an auxiliary memory composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory).
  • flashMemory flash memory
  • the learning data includes, for example, an utterance sentence which is a question sentence, a response sentence in which a predetermined speaker responds to the utterance sentence, and a speaker identifier representing the speaker.
  • the learning data may be a collection of dialogues actually performed by a dialogue system or the like, may be manually created assuming a specific person, or may be a mixture of them.
  • the model learning unit 21 learns each parameter of the neural network of the response sentence generation model 100 using the input learning data.
  • the learning method of the response sentence generation model is the same as the learning method of the model for generating the output using the conventional input and speaker identifier disclosed in Reference 2. That is, the softmax cross entropy for the output statement of the model is used as the loss function, and the parameters of the encoder 102, the decoder 103, and the attention mechanism 104 are learned so as to minimize the loss. Attention
  • the parameters f and g are updated a predetermined number of times or until a predetermined condition is satisfied.
  • the parameters of the speaker model 101 that converts the speaker identifier into the speaker embedded vector are learned in the same manner as the conventional Speaker model.
  • the model learning unit 21 stores the parameters of the learned response sentence generation model 100 in the model storage unit 10.
  • the answer sentence generated in this embodiment and the correct answer sentence associated with the question of the evaluation data were compared using BLEU-1, BLEU-4 (the larger the value, the better).
  • the model generation probability (PPL: Perplexity) for the correct answer sentence was calculated (the smaller the value, the better).
  • Table 1 shows the experimental results. It can be seen that the method of this embodiment was the best evaluation in all the evaluation indexes.
  • the program that describes this processing content can be recorded on a computer-readable recording medium.
  • the recording medium that can be read by a computer is, for example, a non-temporary recording medium, such as a magnetic recording device and an optical disc.
  • this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded.
  • the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
  • a computer that executes such a program for example, first transfers a program recorded on a portable recording medium or a program transferred from a server computer to an auxiliary recording unit 1050, which is its own non-temporary storage device. Store. Then, at the time of executing the process, the computer reads the program stored in the auxiliary recording unit 1050, which is its non-temporary storage device, into the storage unit 1020, which is the temporary storage device, and follows the read program. Execute the process. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time.
  • ASP Application Service Provider
  • the program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
  • the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Abstract

In the present invention, response sentences that reflect individuality are generated, with no inputting of individuating information. An input unit (11) receives input of an input sentence and a speaker identifier representing a speaker. A response-sentence generating unit (12) finds a response sentence by inputting the input sentence and the speaker identifier into a response-sentence generation model. The response-sentence generation model contains: a speaker model that finds a speaker-embedded vector from a speaker identifier; an encoder that generates a sentence vector from a spoken sentence; a decoder that generates a response sentence utilizing an attention vector that represents attentional content with respect to the spoken sentence; and an attention mechanism that generates an attention vector by utilizing the speaker-embedded vector, the sentence vector, and a content vector representing the internal state of the decoder.

Description

応答文生成装置、応答文生成モデル学習装置、それらの方法、およびプログラムResponse statement generator, response statement generator model learning device, their methods, and programs
 この発明は、ユーザと対話を行う対話技術に関し、特に、個性を反映したシステム発話を生成する技術に関する。 The present invention relates to a dialogue technique for interacting with a user, and particularly to a technique for generating a system utterance that reflects individuality.
 対話システムの発展に伴い、個性やキャラクターといった特徴(以下、まとめて「個性」と呼ぶ)を対話システムに付与する需要が高まっている(例えば、非特許文献1)。従来の多くの商用対話システムはルール方式を用いており、個性を反映した応答ルールを事前に用意しておくことで、個性のある対話システムを構築していた。近年の対話システムでは、ニューラルネットワークを用いて応答を生成する方式(以下、「文生成方式」と呼ぶ)が一般化しており、文生成方式でも個性を考慮する手法の実現が期待されている。 With the development of dialogue systems, there is an increasing demand for giving features such as individuality and characters (hereinafter collectively referred to as "individuality") to dialogue systems (for example, Non-Patent Document 1). Many conventional commercial dialogue systems use a rule method, and by preparing response rules that reflect individuality in advance, a dialogue system with individuality has been constructed. In recent dialogue systems, a method of generating a response using a neural network (hereinafter referred to as a "sentence generation method") has become common, and it is expected that a method of considering individuality will be realized in the sentence generation method as well.
 文生成方式で個性を考慮する場合に、応答に使える個性の情報を入力文と共に入力する方法がある。例えば、個性を考慮しない場合に「食べ物は何が好きですか?」という入力文に対して「カレーライスが好きです」という応答文を生成する対話システムがあるとする。ここで、個性を考慮する場合には、「食べ物は何が好きですか?」という入力文と共に「食べ物だと唐揚げが好き。趣味はサーフィン。犬を飼っている。」といった個性の情報を入力すれば、「唐揚げが好きです」という応答文を生成することができる。この手法では、一般的な発話と応答の関係を学習した上で、同時に入力された個性の情報が応答に直接利用できる場合に、その個性の情報を反映した応答文を生成する。 When considering individuality in the sentence generation method, there is a method of inputting individuality information that can be used for the response together with the input sentence. For example, suppose there is a dialogue system that generates a response sentence "I like curry rice" to an input sentence "What do you like food?" Without considering individuality. Here, when considering individuality, along with the input sentence "What do you like about food?", Information on individuality such as "I like fried chicken when it comes to food. My hobby is surfing. I have a dog." If you enter it, you can generate a response statement that says "I like fried chicken." In this method, after learning the relationship between a general utterance and a response, when the personality information input at the same time can be directly used for the response, a response sentence reflecting the personality information is generated.
 しかしながら、特定の個人特有の個性(例えば、織田信長や豊臣秀吉のような、個性がよく知られた人物の個性を対話システムに付与したい場合など)は、言語化が困難であったり、一般的な発話と応答の関係から外れた応答が必要になったりする場合がある。例えば、「来年はサル年ですね」という入力に対して、豊臣秀吉の個性を反映した応答をする場合、「ワシの年じゃな」や「誰がサルじゃ!!」といった応答文が生成されることが期待される。しかしながら、従来の手法では一般的な発話と応答の関係に対して個性の情報が反映できる場合でしか有効ではないため、「名前は豊臣秀吉。三英傑の一人。織田信長に仕え、天下統一も果たした。来年はサル年ですね」といった内容を入力としても、上記のような応答文を生成することはできない。 However, the personality peculiar to a specific individual (for example, when you want to give the dialogue system the personality of a person with a well-known personality such as Nobunaga Oda or Hideyoshi Toyotomi) is difficult to verbalize or is common. It may be necessary to have a response that deviates from the relationship between the utterance and the response. For example, when responding to the input "Next year is the year of the monkey" that reflects the individuality of Toyotomi Hideyoshi, response sentences such as "It's the year of the eagle" and "Who is the monkey !!" are generated. It is expected. However, since the conventional method is effective only when the information of individuality can be reflected for the relationship between general utterance and response, "The name is Hideyoshi Toyotomi. One of Sanei-Ketsu. Serving Nobunaga Oda and unifying the world. Even if you enter the content such as "I've done it. Next year is the year of the monkey," the above response statement cannot be generated.
 この発明の目的は、上記のような技術的課題を鑑みて、個性の情報を入力することなく、個性を反映した応答文を生成することができる対話技術を提供することである。 An object of the present invention is to provide a dialogue technique capable of generating a response sentence reflecting individuality without inputting individuality information in view of the above technical problems.
 上記の課題を解決するために、この発明の第一の態様の応答文生成装置は、入力文と話者を表す話者識別子とを入力する入力部と、入力文と話者識別子とを応答文生成モデルに入力することで応答文を求める応答文生成部と、を含み、応答文生成モデルは、話者識別子から話者埋め込みベクトルを求める話者モデルと、発話文から文ベクトルを生成するエンコーダと、発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、デコーダの内部状態を表す内容ベクトルと文ベクトルと話者埋め込みベクトルとを用いて注意ベクトルを生成する注意機構と、を含む。 In order to solve the above-mentioned problems, the response sentence generation device of the first aspect of the present invention responds to an input unit for inputting an input sentence and a speaker identifier representing a speaker, and an input sentence and a speaker identifier. The response sentence generation model includes a response sentence generation unit that obtains a response sentence by inputting to the sentence generation model, and the response sentence generation model generates a speaker model that obtains a speaker embedded vector from a speaker identifier and a sentence vector from an utterance sentence. Attention to generate a response sentence using an encoder, a decoder that generates a response sentence using a caution vector that represents the content of attention to an utterance sentence, and a content vector that represents the internal state of the decoder, a sentence vector, and a speaker embedded vector. Including the mechanism.
 この発明の第二の態様の応答文生成モデル学習装置は、発話文と所定の話者が発話文に応答する応答文と話者を表す話者識別子とからなる学習データを記憶する学習データ記憶部と、学習データを用いて、発話文と話者識別子を入力とし、当該発話文に応答する応答文を出力する応答文生成モデルを学習するモデル学習部と、を含み、応答文生成モデルは、話者識別子から話者埋め込みベクトルを求める話者モデルと、発話文から文ベクトルを生成するエンコーダと、発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、デコーダの内部状態を表す内容ベクトルと文ベクトルと話者埋め込みベクトルとを用いて注意ベクトルを生成する注意機構と、を含む。 The response sentence generation model learning device of the second aspect of the present invention stores learning data including a spoken sentence, a response sentence in which a predetermined speaker responds to the spoken sentence, and a speaker identifier representing the speaker. The response sentence generation model includes a unit and a model learning unit that learns a response sentence generation model that inputs a utterance sentence and a speaker identifier using training data and outputs a response sentence that responds to the utterance sentence. , A speaker model that obtains a speaker embedded vector from a speaker identifier, an encoder that generates a sentence vector from a spoken sentence, a decoder that generates a response sentence using a caution vector that represents the content of attention to the spoken sentence, and a decoder. It includes a attention mechanism that generates a caution vector using a content vector representing an internal state, a sentence vector, and a speaker embedding vector.
 この発明によれば、個性の情報を入力することなく、個性を反映した応答文を生成することができる。 According to the present invention, it is possible to generate a response sentence that reflects the individuality without inputting the individuality information.
図1は応答文生成装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of a response sentence generator. 図2は応答文生成方法の処理手順を例示する図である。FIG. 2 is a diagram illustrating a processing procedure of a response sentence generation method. 図3は応答文生成モデルの機能構成を例示する図である。FIG. 3 is a diagram illustrating the functional configuration of the response sentence generation model. 図4は応答文生成モデル学習装置の機能構成を例示する図である。FIG. 4 is a diagram illustrating the functional configuration of the response sentence generation model learning device. 図5は応答文生成モデル学習方法の処理手順を例示する図である。FIG. 5 is a diagram illustrating the processing procedure of the response sentence generation model learning method. 図6はコンピュータの機能構成を例示する図である。FIG. 6 is a diagram illustrating a functional configuration of a computer.
 以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In the drawings, the components having the same function are given the same number, and duplicate description is omitted.
 文中で使用する記号「」は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。数式中においてはこれらの記号は本来の位置、すなわち文字の真上に記載している。 The symbol " - " used in the text should be written directly above the character immediately after it, but due to restrictions on the text notation, it should be written immediately before the character. In the formula, these symbols are written in their original positions, that is, directly above the letters.
 [発明の概要]
 この発明では、文生成方式を用いる対話システムにおいて、任意の話者を想定し、その話者の個性が反映した応答文を生成するように構成する。このとき、従来の文生成方式において個性を考慮するために必要とされていた個性の情報を不要とする。例えば、「来年はサル年ですね」という入力文から、豊臣秀吉の個性を反映した「誰がサルじゃ!!」という応答文を生成できるようにする。
[Outline of the invention]
In the present invention, in a dialogue system using a sentence generation method, an arbitrary speaker is assumed and a response sentence reflecting the individuality of the speaker is generated. At this time, the information on the individuality required for considering the individuality in the conventional sentence generation method becomes unnecessary. For example, from the input sentence "Next year is the year of the monkey", it is possible to generate the response sentence "Who is the monkey !!" that reflects the individuality of Toyotomi Hideyoshi.
 そのために、個性ごとに異なる発話と応答の関係を学習するためのニューラルネットワークにおいて、入力と出力の対応関係を学習する注意機構に対して、話者ごとの個性の特徴を考慮するための枠組みを導入し、個性ごとに特徴的な入力と出力の対応関係を学習する。例えば、「この人物ならサルという単語に注目しそうだ」とか「サルという単語からこういう意味を読み取りそうだ」といった、話者ごとに異なる注意の傾向を、応答文生成において実現する。これにより、個性を考慮した応答文生成の性能(すなわち、生成された応答文の品質)が向上する。 Therefore, in a neural network for learning the relationship between utterances and responses that differ for each individuality, a framework for considering the characteristics of individuality for each speaker is provided for the attention mechanism that learns the correspondence between input and output. Introduce and learn the correspondence between input and output that is characteristic for each individuality. For example, the tendency of attention that differs from speaker to speaker, such as "this person is likely to pay attention to the word monkey" or "the word monkey is likely to read this meaning", is realized in the response sentence generation. As a result, the performance of response statement generation in consideration of individuality (that is, the quality of the generated response statement) is improved.
 [実施形態]
 この発明の実施形態は、文生成方式を用いる対話システムにおいて、ユーザ発話に基づく入力文に対して応答文を生成する応答文生成装置および方法と、その応答文生成装置および方法において用いられる応答文生成モデルを学習する応答文生成モデル学習装置および方法とからなる。
[Embodiment]
An embodiment of the present invention is a response sentence generation device and method for generating a response sentence for an input sentence based on a user utterance in a dialogue system using a sentence generation method, and a response sentence used in the response sentence generation device and method. It consists of a response sentence generation model learning device and a method for learning a generation model.
 <応答文生成装置>
 図1に示すように、実施形態の応答文生成装置1は、ユーザ発話の内容を表す入力文と、話者を一意に特定する話者識別子とを入力とし、入力文に対するシステム発話の内容を表す応答文を出力する。応答文生成装置1は、例えば、モデル記憶部10、入力部11、および応答文生成部12を備える。この応答文生成装置1が、図2に例示する各ステップの処理を行うことにより実施形態の応答文生成方法が実現される。
<Response sentence generator>
As shown in FIG. 1, the response sentence generation device 1 of the embodiment inputs an input sentence representing the content of the user's utterance and a speaker identifier that uniquely identifies the speaker, and inputs the content of the system utterance to the input sentence. Output the representative response statement. The response sentence generation device 1 includes, for example, a model storage unit 10, an input unit 11, and a response sentence generation unit 12. The response sentence generation method of the embodiment is realized by the response sentence generation device 1 performing the processing of each step illustrated in FIG. 2.
 応答文生成装置1は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。応答文生成装置1は、例えば、中央演算処理装置の制御のもとで各処理を実行する。応答文生成装置1に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。応答文生成装置1は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。応答文生成装置1が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The response statement generation device 1 is configured by loading a special program into a publicly known or dedicated computer having, for example, a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a special device. The response statement generation device 1 executes each process under the control of the central processing unit, for example. The data input to the response sentence generation device 1 and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as needed. Used for other processing. At least a part of the response sentence generation device 1 may be configured by hardware such as an integrated circuit. Each storage unit included in the response sentence generation device 1 is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory). Alternatively, it can be configured with middleware such as a relational database or key-value store.
 モデル記憶部10には、学習済みの応答文生成モデルが記憶されている。図3に示すように、応答文生成モデル100は、入力文と話者識別子とを入力とし、応答文を出力する。応答文生成モデル100は、例えば、話者モデル101、エンコーダ102、デコーダ103、および注意機構104を含む。入力文は、例えば、対話システムに対してユーザが発話した質問の内容を表す発話文である。話者識別子は、個性を反映させたい人物を一意に特定する識別子である。応答文は、例えば、入力文として与えられた質問文に対する対話システムからの回答の内容を表す発話文である。 The trained response sentence generation model is stored in the model storage unit 10. As shown in FIG. 3, the response sentence generation model 100 takes an input sentence and a speaker identifier as inputs, and outputs a response sentence. The response sentence generation model 100 includes, for example, a speaker model 101, an encoder 102, a decoder 103, and an attention mechanism 104. The input sentence is, for example, an utterance sentence representing the content of a question uttered by the user to the dialogue system. The speaker identifier is an identifier that uniquely identifies a person whose individuality is desired to be reflected. The response sentence is, for example, an utterance sentence representing the content of the answer from the dialogue system to the question sentence given as the input sentence.
 話者モデル101は、話者識別子を入力とし、その話者識別子を話者埋め込みベクトルに変換して出力する、学習済みのモデルである。話者モデル101は、例えば、参考文献1に記載されたSpeaker modelと呼ばれるモデルを用いることができる。
 〔参考文献1〕Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan, "A persona-based neural conversation model," arXiv preprint, arXiv:1603.06155, 2016.
The speaker model 101 is a trained model that takes a speaker identifier as an input, converts the speaker identifier into a speaker embedding vector, and outputs it. As the speaker model 101, for example, a model called the Speaker model described in Reference 1 can be used.
[Reference 1] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan, "A persona-based neural conversation model," arXiv preprint, arXiv: 1603.06155, 2016.
 エンコーダ102は、発話文を入力とし、その発話文を文ベクトルに変換して出力する。デコーダ103は、注意機構104が出力する注意ベクトルを用いて応答文を生成して出力する。エンコーダ102およびデコーダ103は、従来の文生成方式で用いられるエンコーダおよびデコーダと同じものである。従来の文生成方式については、参考文献2を参照されたい。
 〔参考文献2〕Vinyals, Oriol, and Quoc Le, "A neural conversational model," arXiv preprint, arXiv:1506.05869, 2015.
The encoder 102 takes an utterance sentence as an input, converts the utterance sentence into a sentence vector, and outputs the utterance sentence. The decoder 103 generates and outputs a response statement using the attention vector output by the attention mechanism 104. The encoder 102 and the decoder 103 are the same as the encoder and the decoder used in the conventional sentence generation method. For the conventional sentence generation method, refer to Reference 2.
[Reference 2] Vinyals, Oriol, and Quoc Le, "A neural conversational model," arXiv preprint, arXiv: 1506.05869, 2015.
 注意機構104は、話者モデル101が出力する話者埋め込みベクトル、エンコーダ102が出力する文ベクトル、およびデコーダ103の内部状態を表す内容ベクトルを入力とし、注意ベクトルを生成して出力する。注意機構104は、まず、話者埋め込みベクトルと文ベクトルと内容ベクトルとを用いて、入力文のどの部分に注目するか(以下、「注意の傾向」と呼ぶ)を表すベクトルである注意重みを生成する。次に、注意機構104は、注意重みと話者埋め込みベクトルと文ベクトルとを用いて、入力文に対して注意の傾向に従って注意した内容(以下、「注意の内容」と呼ぶ)を表す注意ベクトルを生成する。 The attention mechanism 104 inputs the speaker embedding vector output by the speaker model 101, the sentence vector output by the encoder 102, and the content vector representing the internal state of the decoder 103, and generates and outputs the attention vector. Attention mechanism 104 first uses a speaker embedding vector, a sentence vector, and a content vector to obtain an attention weight, which is a vector indicating which part of an input sentence is to be focused on (hereinafter referred to as “attention tendency”). Generate. Next, the attention mechanism 104 uses the attention weight, the speaker embedding vector, and the sentence vector to represent the content of attention to the input sentence according to the tendency of attention (hereinafter referred to as “content of attention”). To generate.
 従来の文生成方式における注意機構との相違点は、注意重みの計算および注意ベクトルの計算において話者埋め込みベクトルを参照することである。これにより、個性に応じて、注意の傾向および注意の内容を変化させる。注意の傾向は、例えば、「豊臣秀吉はサルという単語に強く注目する」といった特徴である。注意の内容は、例えば、「豊臣秀吉はサルという単語をネガティブに捉える」といった特徴である。こうした特徴は、事前に人手で付与する必要はなく、注意の傾向および注意の内容が反映された学習データ、具体的には、話者と紐づけられた多数の文データを用意して学習データとすることで、注意ベクトルに反映される。 The difference from the attention mechanism in the conventional sentence generation method is that the speaker embedded vector is referred to in the calculation of the attention weight and the calculation of the attention vector. As a result, the tendency of attention and the content of attention are changed according to the individuality. The tendency of attention is, for example, a feature such as "Toyotomi Hideyoshi pays close attention to the word monkey." The content of caution is, for example, a feature such as "Toyotomi Hideyoshi takes the word monkey negatively." These characteristics do not need to be manually assigned in advance, and learning data that reflects the tendency of attention and the content of attention, specifically, a large number of sentence data associated with the speaker, is prepared and learned data. By setting, it is reflected in the attention vector.
 注意機構104は、具体的には、以下の数式を計算する。 Attention mechanism 104 specifically calculates the following formula.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 ここで、○は要素積を表す演算子である。tはデコーダがt番目の単語を出力していることを示す変数である。iはエンコーダに入力されたN単語からなる入力文のうちi番目の単語であることを示す変数である。ht (dec)はデコーダの内部状態を表すd次元の内容ベクトルである。なお、dは注意機構の計算部分のサイズ(次元数)である。H(enc)はエンコーダが生成したN×d次元の文ベクトルである。hi (enc)∈H(enc)は文ベクトルのi番目の単語に対応する要素である。suは話者モデルが生成したd次元の話者埋め込みベクトルである。f(・)およびg(・)は相異なる線形変換である。f(・)およびg(・)は、1次の線形変換でもよいし、任意のM次の線形変換でもよいし、sigmoid関数やsoftsign関数等を用いて出力が0~1や-1~1などの一定の閾値に収まるような関数を定義してもよいし、これらを組み合わせてもよい。aiは入力文のうちi番目の単語に対する注意重みである。 Here, ○ is an operator representing the element product. t is a variable indicating that the decoder is outputting the t-th word. i is a variable indicating that it is the i-th word in the input sentence consisting of N words input to the encoder. h t (dec) is a d-dimensional content vector that represents the internal state of the decoder. Note that d is the size (number of dimensions) of the calculated part of the attention mechanism. H (enc) is an N × d-dimensional sentence vector generated by the encoder. h i (enc) ∈ H (enc) is the element corresponding to the i-th word of the sentence vector. s u is a d-dimensional speaker embedding vector generated by the speaker model. f (・) and g (・) are different linear transformations. f (・) and g (・) may be a linear transformation of the first order, an arbitrary linear transformation of the Mth order, and the output may be 0 to 1 or -1 to 1 using the sigmoid function or the softsign function. You may define a function that falls within a certain threshold such as, or you may combine these. a i is the attention weight for the i-th word in the input sentence.
 すなわち、注意機構104は、以下のようにして、注意ベクトルを計算する。まず、注意重みaiを計算するために、話者埋め込みベクトルsuをM次の線形変換f(・)を用いて変換する。線形変換した話者埋め込みベクトルf(su)と、エンコードした文ベクトルH(enc)の各要素hi (enc)との要素積を計算して、-hi,k (enc)とする(数式の2行目に相当)。話者埋め込みベクトルsuによって変形した-hi,k (enc)とデコーダの内容ベクトルht (dec)を用いて、i番目の注意重みaiを計算する(数式の4行目に相当)。次に、注意ベクトルを計算するために、話者埋め込みベクトルsuをM次の線形変換g(・)を用いて変換する。線形変換した話者埋め込みベクトルg(su)と、エンコードした文ベクトルH(enc)の各要素hi (enc)との要素積を計算して、-hi,v (enc)とする(数式の3行目に相当)。これらのエンコードした文ベクトルの各要素と線形変換した話者埋め込みベクトルにより計算される-hi,k (enc),-hi,v (enc)の添字k, vはそれぞれkey, valueの頭文字をとったものであり、注意機構では慣例的に重みをkey、重みをかけるベクトルをvalueと呼ぶためこのような添え字をとる。最後に、すべてのiについて注意重みai-hi,v (enc)の積を計算し、それらの総和を求める。この総和が最終的な出力となる注意ベクトルである(数式の1行目に相当)。 That is, the attention mechanism 104 calculates the attention vector as follows. First, in order to calculate the attention weight a i , the speaker embedded vector s u is transformed using the linear transformation f (・) of order M. A linear transformation to speaker embedding vectors f (s u), calculates an element product of each element h i (enc) of the encoded sentence vector H (enc), - h i , and k (enc) ( Corresponds to the second line of the formula). Deformed by a speaker embedded vector s u - h i, with k (enc) and decoder of the content vector h t (dec), calculates the i-th note weight a i (corresponding to the fourth line of Equation) .. Next, in order to calculate the attention vector, the speaker embedded vector s u is transformed using the linear transformation g (・) of order M. A linear transformation to speaker embedding vector g (s u), calculates an element product of each element h i (enc) of the encoded sentence vector H (enc), - h i , and v (enc) ( Corresponds to the third line of the formula). It is calculated by each element and linearly converted speaker embedding vectors of these encoded sentence vector - h i, k (enc) , - h i, v subscript k of (enc), v respectively key, value head It is a character, and in the attention mechanism, the weight is customarily called a key, and the vector to be weighted is called a value, so such a subscript is used. Finally, for all i and attention weight a i - h i, v the product of (enc) is calculated to determine their sum. This sum is the attention vector that is the final output (corresponding to the first line of the formula).
 図2を参照して、実施形態の応答文生成装置1が実行する応答文生成方法の処理手続きを説明する。 With reference to FIG. 2, the processing procedure of the response sentence generation method executed by the response sentence generation device 1 of the embodiment will be described.
 ステップS11において、入力部11へ、入力文と話者識別子とが入力される。入力部11は、入力文と話者識別子とを応答文生成部12へ出力する。 In step S11, the input sentence and the speaker identifier are input to the input unit 11. The input unit 11 outputs the input sentence and the speaker identifier to the response sentence generation unit 12.
 ステップS12において、応答文生成部12は、入力部11から入力文と話者識別子とを受け取り、入力文と話者識別子とをモデル記憶部10に記憶された応答文生成モデルに入力することで、話者の個性が反映された応答文を得て出力する。応答文の出力では、応答文生成モデルの出力層から得られたベクトルに紐づく単語を出力することを繰り返すことで応答文となる単語列が得られる。応答文生成部12は、得られた応答文を応答文生成装置1の出力とする。 In step S12, the response sentence generation unit 12 receives the input sentence and the speaker identifier from the input unit 11, and inputs the input sentence and the speaker identifier into the response sentence generation model stored in the model storage unit 10. , Obtains and outputs a response sentence that reflects the individuality of the speaker. In the output of the response sentence, the word string to be the response sentence is obtained by repeating the output of the words associated with the vector obtained from the output layer of the response sentence generation model. The response sentence generation unit 12 outputs the obtained response sentence to the response sentence generation device 1.
 <応答文生成モデル学習装置>
 図4に示すように、実施形態の応答文生成モデル学習装置2は、例えば、学習データ記憶部20、モデル学習部21、およびモデル記憶部10を備える。この応答文生成モデル学習装置2が、図5に例示する各ステップの処理を行うことにより実施形態の応答文生成モデル学習方法が実現される。
<Response sentence generation model learning device>
As shown in FIG. 4, the response sentence generation model learning device 2 of the embodiment includes, for example, a learning data storage unit 20, a model learning unit 21, and a model storage unit 10. The response sentence generation model learning device 2 realizes the response sentence generation model learning method of the embodiment by performing the processing of each step illustrated in FIG.
 応答文生成モデル学習装置2は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。応答文生成モデル学習装置2は、例えば、中央演算処理装置の制御のもとで各処理を実行する。応答文生成モデル学習装置2に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。応答文生成モデル学習装置2は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。応答文生成モデル学習装置2が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The response sentence generation model learning device 2 is configured by loading a special program into a publicly known or dedicated computer having, for example, a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a special device that has been made. The response sentence generation model learning device 2 executes each process under the control of the central processing unit, for example. The data input to the response sentence generation model learning device 2 and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read to the central processing unit as needed. It is issued and used for other processing. The response sentence generation model learning device 2 may be at least partially configured by hardware such as an integrated circuit. Each storage unit included in the response sentence generation model learning device 2 is composed of, for example, a main storage device such as RAM (RandomAccessMemory) and an auxiliary memory composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory). It can be configured with a device or middleware such as a relational database or key-value store.
 図5を参照して、実施形態の応答文生成モデル学習装置2が実行する応答文生成モデル学習方法の処理手続きを説明する。 With reference to FIG. 5, the processing procedure of the response sentence generation model learning method executed by the response sentence generation model learning device 2 of the embodiment will be described.
 学習データ記憶部20には、学習データが記憶されている。学習データは、例えば質問文である発話文と、所定の話者がその発話文に応答する応答文と、その話者を表す話者識別子とからなる。学習データは、対話システム等で実際に行われた対話を収集したものでもよいし、特定の人物を想定して人手で作成したものでもよいし、それらが混在していてもよい。 Learning data is stored in the learning data storage unit 20. The learning data includes, for example, an utterance sentence which is a question sentence, a response sentence in which a predetermined speaker responds to the utterance sentence, and a speaker identifier representing the speaker. The learning data may be a collection of dialogues actually performed by a dialogue system or the like, may be manually created assuming a specific person, or may be a mixture of them.
 ステップS20において、応答文生成モデル学習装置2は、学習データ記憶部20から学習データを読み出す。応答文生成モデル学習装置2は、読み出した学習データをモデル学習部21へ入力する。 In step S20, the response sentence generation model learning device 2 reads out the learning data from the learning data storage unit 20. The response sentence generation model learning device 2 inputs the read learning data to the model learning unit 21.
 ステップS21において、モデル学習部21は、入力された学習データを用いて、応答文生成モデル100のニューラルネットワークの各パラメータを学習する。応答文生成モデルの学習方法は、参考文献2に開示されている、従来の入力と話者識別子を用いて出力を生成するモデルの学習方法と同様である。すなわち、モデルの出力文に対するsoftmax cross entropyを損失関数とし、損失を最小化するようにエンコーダ102、デコーダ103、および注意機構104のパラメータを学習する。注意機構104のパラメータの学習に際しては、パラメータf, gの更新を所定の回数、または、所定の条件を満たすまで繰り返す。同時に、話者識別子を話者埋め込みベクトルに変換する話者モデル101のパラメータを、従来のSpeaker modelと同様にして学習する。モデル学習部21は、学習済みの応答文生成モデル100のパラメータをモデル記憶部10へ記憶する。 In step S21, the model learning unit 21 learns each parameter of the neural network of the response sentence generation model 100 using the input learning data. The learning method of the response sentence generation model is the same as the learning method of the model for generating the output using the conventional input and speaker identifier disclosed in Reference 2. That is, the softmax cross entropy for the output statement of the model is used as the loss function, and the parameters of the encoder 102, the decoder 103, and the attention mechanism 104 are learned so as to minimize the loss. Attention When learning the parameters of the mechanism 104, the parameters f and g are updated a predetermined number of times or until a predetermined condition is satisfied. At the same time, the parameters of the speaker model 101 that converts the speaker identifier into the speaker embedded vector are learned in the same manner as the conventional Speaker model. The model learning unit 21 stores the parameters of the learned response sentence generation model 100 in the model storage unit 10.
 [実験結果]
 上記実施形態の効果を測定するために、非特許文献1に開示されているなりきり質問応答のデータを用いて実験を行った。具体的には、3名分のなりきりデータ5万件の質問応答ペアを学習データとし、2千件を評価データとした。注意機構の次元数dは512とし、エンコーダおよびデコーダはTransformerを用いた。注意機構はTransformer内の自己注意およびソースターゲット注意を置き換える形で、本実施形態の注意機構を実装した。学習データでモデルを学習し、評価データの質問文を与えて回答を生成した。評価尺度は、BLEU-1, BLEU-4, PPLを用いた。すなわち、本実施形態で生成された回答文と評価データの質問に紐づいた正解回答文とをBLEU-1, BLEU-4を用いて比較した(値は大きい方が良い)。また、正解回答文に対するモデルの生成確率(PPL: Perplexity)を計算した(値は小さい方が良い)。
[Experimental result]
In order to measure the effect of the above embodiment, an experiment was conducted using the data of the impersonator question answering disclosed in Non-Patent Document 1. Specifically, 50,000 question-answering pairs of impersonator data for three people were used as learning data, and 2,000 were used as evaluation data. The dimension number d of the attention mechanism was 512, and Transformer was used for the encoder and decoder. The attention mechanism implements the attention mechanism of this embodiment by replacing the self-attention and the source-target attention in the Transformer. The model was trained with the training data, and the question text of the evaluation data was given to generate the answer. BLEU-1, BLEU-4, PPL were used as the evaluation scale. That is, the answer sentence generated in this embodiment and the correct answer sentence associated with the question of the evaluation data were compared using BLEU-1, BLEU-4 (the larger the value, the better). In addition, the model generation probability (PPL: Perplexity) for the correct answer sentence was calculated (the smaller the value, the better).
 表1に実験結果を示す。すべての評価指標において、本実施形態の手法が最も良い評価となったことがわかる。 Table 1 shows the experimental results. It can be seen that the method of this embodiment was the best evaluation in all the evaluation indexes.
Figure JPOXMLDOC01-appb-T000003
     
Figure JPOXMLDOC01-appb-T000003
     
 以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 Although the embodiments of the present invention have been described above, the specific configuration is not limited to these embodiments, and even if the design is appropriately changed without departing from the spirit of the present invention, the specific configuration is not limited to these embodiments. Needless to say, it is included in the present invention. The various processes described in the embodiments are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes.
 [プログラム、記録媒体]
 上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムを図6に示すコンピュータの記憶部1020に読み込ませ、演算処理部1010、入力部1030、出力部1040などに動作させることにより、上記各装置における各種の処理機能がコンピュータ上で実現される。
[Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by loading this program into the storage unit 1020 of the computer shown in FIG. 6 and operating it in the arithmetic processing unit 1010, the input unit 1030, the output unit 1040, etc., various processing functions in each of the above devices are realized on the computer. Will be done.
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体は、例えば、非一時的な記録媒体であり、磁気記録装置、光ディスク等である。 The program that describes this processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer is, for example, a non-temporary recording medium, such as a magnetic recording device and an optical disc.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 In addition, the distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の非一時的な記憶装置である補助記録部1050に格納する。そして、処理の実行時、このコンピュータは、自己の非一時的な記憶装置である補助記録部1050に格納されたプログラムを一時的な記憶装置である記憶部1020に読み込み、読み込んだプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み込み、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program, for example, first transfers a program recorded on a portable recording medium or a program transferred from a server computer to an auxiliary recording unit 1050, which is its own non-temporary storage device. Store. Then, at the time of executing the process, the computer reads the program stored in the auxiliary recording unit 1050, which is its non-temporary storage device, into the storage unit 1020, which is the temporary storage device, and follows the read program. Execute the process. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
 また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this form, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims (7)

  1.  入力文と話者を表す話者識別子とを入力する入力部と、
     前記入力文と前記話者識別子とを応答文生成モデルに入力することで応答文を求める応答文生成部と、
     を含み、
     前記応答文生成モデルは、
     話者識別子から話者埋め込みベクトルを求める話者モデルと、
     発話文から文ベクトルを生成するエンコーダと、
     前記発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、
     前記デコーダの内部状態を表す内容ベクトルと前記文ベクトルと前記話者埋め込みベクトルとを用いて前記注意ベクトルを生成する注意機構と、
     を含む、
     応答文生成装置。
    An input unit for inputting an input sentence and a speaker identifier representing a speaker,
    A response sentence generation unit that obtains a response sentence by inputting the input sentence and the speaker identifier into the response sentence generation model, and a response sentence generation unit.
    Including
    The response sentence generation model is
    A speaker model that obtains a speaker embedding vector from a speaker identifier,
    An encoder that generates a sentence vector from an utterance sentence, and
    A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
    A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
    including,
    Response statement generator.
  2.  請求項1に記載の応答文生成装置であって、
     前記注意機構は、前記文ベクトルと前記話者埋め込みベクトルを第一の線形変換により変換したベクトルとの要素積および前記内容ベクトルから注意重みを計算し、前記注意重みを用いて前記文ベクトルと前記話者埋め込みベクトルを第二の線形変換により変換したベクトルを重み付けすることで、前記注意ベクトルを生成する、
     応答文生成装置。
    The response sentence generation device according to claim 1.
    The attention mechanism calculates the attention weight from the element product of the sentence vector and the vector obtained by transforming the speaker embedded vector by the first linear transformation and the content vector, and uses the attention weight to calculate the sentence vector and the above. The attention vector is generated by weighting the vector obtained by transforming the speaker embedded vector by the second linear transformation.
    Response statement generator.
  3.  請求項2に記載の応答文生成装置であって、
     前記注意機構は、H(enc)を前記文ベクトルとし、Nを前記文ベクトルの要素数とし、hi (enc)を前記文ベクトルのi番目の単語に対応する要素とし、ht (dec)を前記応答文のt番目の要素を求めるときの前記内容ベクトルとし、suを前記話者埋め込みベクトルとし、fを前記第一の線形変換とし、gを前記第二の線形変換とし、次式を計算することで、前記注意ベクトルを生成する、
    Figure JPOXMLDOC01-appb-M000001

     応答文生成装置。
    The response sentence generation device according to claim 2.
    In the attention mechanism, H (enc) is the sentence vector, N is the number of elements of the sentence vector, and h i (enc) is the element corresponding to the i-th word of the sentence vector, h t (dec). Is the content vector when the t-th element of the response sentence is obtained, s u is the speaker embedded vector, f is the first linear transformation, and g is the second linear transformation. Generates the attention vector by calculating
    Figure JPOXMLDOC01-appb-M000001

    Response statement generator.
  4.  発話文と所定の話者が前記発話文に応答する応答文と前記話者を表す話者識別子とからなる学習データを記憶する学習データ記憶部と、
     前記学習データを用いて、発話文と話者識別子とを入力とし、当該発話文に応答する応答文を出力する応答文生成モデルを学習するモデル学習部と、
     を含み、
     前記応答文生成モデルは、
     話者識別子から話者埋め込みベクトルを求める話者モデルと、
     発話文から文ベクトルを生成するエンコーダと、
     前記発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、
     前記デコーダの内部状態を表す内容ベクトルと前記文ベクトルと前記話者埋め込みベクトルとを用いて前記注意ベクトルを生成する注意機構と、
     を含む、
     応答文生成モデル学習装置。
    A learning data storage unit that stores learning data including an utterance sentence, a response sentence in which a predetermined speaker responds to the utterance sentence, and a speaker identifier representing the speaker.
    Using the training data, a model learning unit that learns a response sentence generation model that inputs an utterance sentence and a speaker identifier and outputs a response sentence that responds to the utterance sentence, and a model learning unit.
    Including
    The response sentence generation model is
    A speaker model that obtains a speaker embedding vector from a speaker identifier,
    An encoder that generates a sentence vector from an utterance sentence, and
    A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
    A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
    including,
    Response sentence generation model learning device.
  5.  入力部が、入力文と話者を表す話者識別子とを入力し、
     応答文生成部が、前記入力文と前記話者識別子とを応答文生成モデルに入力することで応答文を求め、
     前記応答文生成モデルは、
     話者識別子から話者埋め込みベクトルを求める話者モデルと、
     発話文から文ベクトルを生成するエンコーダと、
     前記発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、
     前記デコーダの内部状態を表す内容ベクトルと前記文ベクトルと前記話者埋め込みベクトルとを用いて前記注意ベクトルを生成する注意機構と、
     を含む、
     応答文生成方法。
    The input unit inputs the input sentence and the speaker identifier representing the speaker,
    The response sentence generation unit obtains the response sentence by inputting the input sentence and the speaker identifier into the response sentence generation model.
    The response sentence generation model is
    A speaker model that obtains a speaker embedding vector from a speaker identifier,
    An encoder that generates a sentence vector from an utterance sentence, and
    A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
    A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
    including,
    Response statement generation method.
  6.  学習データ記憶部に、発話文と所定の話者が前記発話文に応答する応答文と前記話者を表す話者識別子とからなる学習データが記憶されており、
     モデル学習部が、前記学習データを用いて、発話文と話者識別子とを入力とし、当該発話文に応答する応答文を出力する応答文生成モデルを学習し、
     前記応答文生成モデルは、
     話者識別子から話者埋め込みベクトルを求める話者モデルと、
     発話文から文ベクトルを生成するエンコーダと、
     前記発話文に対する注意の内容を表す注意ベクトルを用いて応答文を生成するデコーダと、
     前記デコーダの内部状態を表す内容ベクトルと前記文ベクトルと前記話者埋め込みベクトルとを用いて前記注意ベクトルを生成する注意機構と、
     を含む、
     応答文生成モデル学習方法。
    The learning data storage unit stores learning data including an utterance sentence, a response sentence in which a predetermined speaker responds to the utterance sentence, and a speaker identifier representing the speaker.
    Using the learning data, the model learning unit learns a response sentence generation model that inputs an utterance sentence and a speaker identifier and outputs a response sentence that responds to the utterance sentence.
    The response sentence generation model is
    A speaker model that obtains a speaker embedding vector from a speaker identifier,
    An encoder that generates a sentence vector from an utterance sentence, and
    A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
    A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
    including,
    Response sentence generation model learning method.
  7.  請求項1から3のいずれかに記載の応答文生成装置もしくは請求項4に記載の応答文生成モデル学習装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as the response sentence generation device according to any one of claims 1 to 3 or the response sentence generation model learning device according to claim 4.
PCT/JP2020/019887 2020-05-20 2020-05-20 Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor WO2021234838A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022524741A JP7428245B2 (en) 2020-05-20 2020-05-20 Response sentence generator and program
PCT/JP2020/019887 WO2021234838A1 (en) 2020-05-20 2020-05-20 Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor
JP2024008193A JP2024028569A (en) 2020-05-20 2024-01-23 Response sentence generation device, response sentence generation model learning device, methods thereof, and programs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/019887 WO2021234838A1 (en) 2020-05-20 2020-05-20 Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor

Publications (1)

Publication Number Publication Date
WO2021234838A1 true WO2021234838A1 (en) 2021-11-25

Family

ID=78708291

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/019887 WO2021234838A1 (en) 2020-05-20 2020-05-20 Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor

Country Status (2)

Country Link
JP (2) JP7428245B2 (en)
WO (1) WO2021234838A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197121A1 (en) * 2017-12-22 2019-06-27 Samsung Electronics Co., Ltd. Method and apparatus with natural language generation
WO2019198386A1 (en) * 2018-04-13 2019-10-17 国立研究開発法人情報通信研究機構 Request rephrasing system, method for training of request rephrasing model and of request determination model, and conversation system
WO2019212729A1 (en) * 2018-05-03 2019-11-07 Microsoft Technology Licensing, Llc Generating response based on user's profile and reasoning on contexts
CN110874402A (en) * 2018-08-29 2020-03-10 北京三星通信技术研究有限公司 Reply generation method, device and computer readable medium based on personalized information
CN111078854A (en) * 2019-12-13 2020-04-28 北京金山数字娱乐科技有限公司 Question-answer prediction model training method and device and question-answer prediction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197121A1 (en) * 2017-12-22 2019-06-27 Samsung Electronics Co., Ltd. Method and apparatus with natural language generation
WO2019198386A1 (en) * 2018-04-13 2019-10-17 国立研究開発法人情報通信研究機構 Request rephrasing system, method for training of request rephrasing model and of request determination model, and conversation system
WO2019212729A1 (en) * 2018-05-03 2019-11-07 Microsoft Technology Licensing, Llc Generating response based on user's profile and reasoning on contexts
CN110874402A (en) * 2018-08-29 2020-03-10 北京三星通信技术研究有限公司 Reply generation method, device and computer readable medium based on personalized information
CN111078854A (en) * 2019-12-13 2020-04-28 北京金山数字娱乐科技有限公司 Question-answer prediction model training method and device and question-answer prediction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
INABA, MICHIMASA ET AL.: "Estimating User Interest from Chat Dialogues. Proceedings of 84th SIG-SLUD", SIG-SLUD-B802, 15 November 2018 (2018-11-15), pages 155 - 160 *

Also Published As

Publication number Publication date
JPWO2021234838A1 (en) 2021-11-25
JP7428245B2 (en) 2024-02-06
JP2024028569A (en) 2024-03-04

Similar Documents

Publication Publication Date Title
US11055497B2 (en) Natural language generation of sentence sequences from textual data with paragraph generation model
WO2020186778A1 (en) Error word correction method and device, computer device, and storage medium
Mazaré et al. Training millions of personalized dialogue agents
JP2022500726A (en) Global-local memory pointer network for task-oriented dialogue
JP2019215841A (en) Question generator, question generation method, and program
JP7315065B2 (en) QUESTION GENERATION DEVICE, QUESTION GENERATION METHOD AND PROGRAM
JP2018055548A (en) Interactive device, learning device, interactive method, learning method, and program
US10963819B1 (en) Goal-oriented dialog systems and methods
CN111930914B (en) Problem generation method and device, electronic equipment and computer readable storage medium
JP6649536B1 (en) Dialogue processing device, learning device, dialogue processing method, learning method and program
JP2019159823A (en) Learning program, learning method and learning device
JP2023544336A (en) System and method for multilingual speech recognition framework
JP7070653B2 (en) Learning devices, speech recognition ranking estimators, their methods, and programs
JP2020154076A (en) Inference unit, learning method and learning program
Zhang et al. Gazev: Gan-based zero-shot voice conversion over non-parallel speech corpus
KR20200023664A (en) Response inference method and apparatus
KR20210045217A (en) Device and method for emotion transplantation
JP7469698B2 (en) Audio signal conversion model learning device, audio signal conversion device, audio signal conversion model learning method and program
WO2021234838A1 (en) Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor
JP6082657B2 (en) Pose assignment model selection device, pose assignment device, method and program thereof
CN111797220A (en) Dialog generation method and device, computer equipment and storage medium
US20230140480A1 (en) Utterance generation apparatus, utterance generation method, and program
KR20230072656A (en) Device and method for generating dialogue based on pre-trained language model
Xu et al. Linear transformation on x‐vector for text‐independent speaker verification
JP2022174517A (en) Machine learning program, machine learning method, and information processing apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936368

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022524741

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20936368

Country of ref document: EP

Kind code of ref document: A1