WO2021234838A1

WO2021234838A1 - Response-sentence generating device, response-sentence-generation model learning device, and method and program therefor

Info

Publication number: WO2021234838A1
Application number: PCT/JP2020/019887
Authority: WO
Inventors: 雅博水上; 弘晃杉山; 宏美成松; 庸浩有本; 竜一郎東中
Original assignee: 日本電信電話株式会社
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2021-11-25
Also published as: JPWO2021234838A1; JP7428245B2; JP2024028569A

Abstract

In the present invention, response sentences that reflect individuality are generated, with no inputting of individuating information. An input unit (11) receives input of an input sentence and a speaker identifier representing a speaker. A response-sentence generating unit (12) finds a response sentence by inputting the input sentence and the speaker identifier into a response-sentence generation model. The response-sentence generation model contains: a speaker model that finds a speaker-embedded vector from a speaker identifier; an encoder that generates a sentence vector from a spoken sentence; a decoder that generates a response sentence utilizing an attention vector that represents attentional content with respect to the spoken sentence; and an attention mechanism that generates an attention vector by utilizing the speaker-embedded vector, the sentence vector, and a content vector representing the internal state of the decoder.

Description

Response statement generator, response statement generator model learning device, their methods, and programs

The present invention relates to a dialogue technique for interacting with a user, and particularly to a technique for generating a system utterance that reflects individuality.

With the development of dialogue systems, there is an increasing demand for giving features such as individuality and characters (hereinafter collectively referred to as "individuality") to dialogue systems (for example, Non-Patent Document 1). Many conventional commercial dialogue systems use a rule method, and by preparing response rules that reflect individuality in advance, a dialogue system with individuality has been constructed. In recent dialogue systems, a method of generating a response using a neural network (hereinafter referred to as a "sentence generation method") has become common, and it is expected that a method of considering individuality will be realized in the sentence generation method as well.

When considering individuality in the sentence generation method, there is a method of inputting individuality information that can be used for the response together with the input sentence. For example, suppose there is a dialogue system that generates a response sentence "I like curry rice" to an input sentence "What do you like food?" Without considering individuality. Here, when considering individuality, along with the input sentence "What do you like about food?", Information on individuality such as "I like fried chicken when it comes to food. My hobby is surfing. I have a dog." If you enter it, you can generate a response statement that says "I like fried chicken." In this method, after learning the relationship between a general utterance and a response, when the personality information input at the same time can be directly used for the response, a response sentence reflecting the personality information is generated.

However, the personality peculiar to a specific individual (for example, when you want to give the dialogue system the personality of a person with a well-known personality such as Nobunaga Oda or Hideyoshi Toyotomi) is difficult to verbalize or is common. It may be necessary to have a response that deviates from the relationship between the utterance and the response. For example, when responding to the input "Next year is the year of the monkey" that reflects the individuality of Toyotomi Hideyoshi, response sentences such as "It's the year of the eagle" and "Who is the monkey !!" are generated. It is expected. However, since the conventional method is effective only when the information of individuality can be reflected for the relationship between general utterance and response, "The name is Hideyoshi Toyotomi. One of Sanei-Ketsu. Serving Nobunaga Oda and unifying the world. Even if you enter the content such as "I've done it. Next year is the year of the monkey," the above response statement cannot be generated.

An object of the present invention is to provide a dialogue technique capable of generating a response sentence reflecting individuality without inputting individuality information in view of the above technical problems.

In order to solve the above-mentioned problems, the response sentence generation device of the first aspect of the present invention responds to an input unit for inputting an input sentence and a speaker identifier representing a speaker, and an input sentence and a speaker identifier. The response sentence generation model includes a response sentence generation unit that obtains a response sentence by inputting to the sentence generation model, and the response sentence generation model generates a speaker model that obtains a speaker embedded vector from a speaker identifier and a sentence vector from an utterance sentence. Attention to generate a response sentence using an encoder, a decoder that generates a response sentence using a caution vector that represents the content of attention to an utterance sentence, and a content vector that represents the internal state of the decoder, a sentence vector, and a speaker embedded vector. Including the mechanism.

The response sentence generation model learning device of the second aspect of the present invention stores learning data including a spoken sentence, a response sentence in which a predetermined speaker responds to the spoken sentence, and a speaker identifier representing the speaker. The response sentence generation model includes a unit and a model learning unit that learns a response sentence generation model that inputs a utterance sentence and a speaker identifier using training data and outputs a response sentence that responds to the utterance sentence. , A speaker model that obtains a speaker embedded vector from a speaker identifier, an encoder that generates a sentence vector from a spoken sentence, a decoder that generates a response sentence using a caution vector that represents the content of attention to the spoken sentence, and a decoder. It includes a attention mechanism that generates a caution vector using a content vector representing an internal state, a sentence vector, and a speaker embedding vector.

According to the present invention, it is possible to generate a response sentence that reflects the individuality without inputting the individuality information.

FIG. 1 is a diagram illustrating a functional configuration of a response sentence generator. FIG. 2 is a diagram illustrating a processing procedure of a response sentence generation method. FIG. 3 is a diagram illustrating the functional configuration of the response sentence generation model. FIG. 4 is a diagram illustrating the functional configuration of the response sentence generation model learning device. FIG. 5 is a diagram illustrating the processing procedure of the response sentence generation model learning method. FIG. 6 is a diagram illustrating a functional configuration of a computer.

Hereinafter, embodiments of the present invention will be described in detail. In the drawings, the components having the same function are given the same number, and duplicate description is omitted.

The symbol " ^- " used in the text should be written directly above the character immediately after it, but due to restrictions on the text notation, it should be written immediately before the character. In the formula, these symbols are written in their original positions, that is, directly above the letters.

[Outline of the invention]
In the present invention, in a dialogue system using a sentence generation method, an arbitrary speaker is assumed and a response sentence reflecting the individuality of the speaker is generated. At this time, the information on the individuality required for considering the individuality in the conventional sentence generation method becomes unnecessary. For example, from the input sentence "Next year is the year of the monkey", it is possible to generate the response sentence "Who is the monkey !!" that reflects the individuality of Toyotomi Hideyoshi.

Therefore, in a neural network for learning the relationship between utterances and responses that differ for each individuality, a framework for considering the characteristics of individuality for each speaker is provided for the attention mechanism that learns the correspondence between input and output. Introduce and learn the correspondence between input and output that is characteristic for each individuality. For example, the tendency of attention that differs from speaker to speaker, such as "this person is likely to pay attention to the word monkey" or "the word monkey is likely to read this meaning", is realized in the response sentence generation. As a result, the performance of response statement generation in consideration of individuality (that is, the quality of the generated response statement) is improved.

[Embodiment]
An embodiment of the present invention is a response sentence generation device and method for generating a response sentence for an input sentence based on a user utterance in a dialogue system using a sentence generation method, and a response sentence used in the response sentence generation device and method. It consists of a response sentence generation model learning device and a method for learning a generation model.

<Response sentence generator>
As shown in FIG. 1, the response sentence generation device 1 of the embodiment inputs an input sentence representing the content of the user's utterance and a speaker identifier that uniquely identifies the speaker, and inputs the content of the system utterance to the input sentence. Output the representative response statement. The response sentence generation device 1 includes, for example, a model storage unit 10, an input unit 11, and a response sentence generation unit 12. The response sentence generation method of the embodiment is realized by the response sentence generation device 1 performing the processing of each step illustrated in FIG. 2.

The response statement generation device 1 is configured by loading a special program into a publicly known or dedicated computer having, for example, a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a special device. The response statement generation device 1 executes each process under the control of the central processing unit, for example. The data input to the response sentence generation device 1 and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as needed. Used for other processing. At least a part of the response sentence generation device 1 may be configured by hardware such as an integrated circuit. Each storage unit included in the response sentence generation device 1 is, for example, a main storage device such as RAM (RandomAccessMemory), an auxiliary storage device composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory). Alternatively, it can be configured with middleware such as a relational database or key-value store.

The trained response sentence generation model is stored in the model storage unit 10. As shown in FIG. 3, the response sentence generation model 100 takes an input sentence and a speaker identifier as inputs, and outputs a response sentence. The response sentence generation model 100 includes, for example, a speaker model 101, an encoder 102, a decoder 103, and an attention mechanism 104. The input sentence is, for example, an utterance sentence representing the content of a question uttered by the user to the dialogue system. The speaker identifier is an identifier that uniquely identifies a person whose individuality is desired to be reflected. The response sentence is, for example, an utterance sentence representing the content of the answer from the dialogue system to the question sentence given as the input sentence.

The speaker model 101 is a trained model that takes a speaker identifier as an input, converts the speaker identifier into a speaker embedding vector, and outputs it. As the speaker model 101, for example, a model called the Speaker model described in Reference 1 can be used.
[Reference 1] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan, "A persona-based neural conversation model," arXiv preprint, arXiv: 1603.06155, 2016.

The encoder 102 takes an utterance sentence as an input, converts the utterance sentence into a sentence vector, and outputs the utterance sentence. The decoder 103 generates and outputs a response statement using the attention vector output by the attention mechanism 104. The encoder 102 and the decoder 103 are the same as the encoder and the decoder used in the conventional sentence generation method. For the conventional sentence generation method, refer to Reference 2.
[Reference 2] Vinyals, Oriol, and Quoc Le, "A neural conversational model," arXiv preprint, arXiv: 1506.05869, 2015.

The attention mechanism 104 inputs the speaker embedding vector output by the speaker model 101, the sentence vector output by the encoder 102, and the content vector representing the internal state of the decoder 103, and generates and outputs the attention vector. Attention mechanism 104 first uses a speaker embedding vector, a sentence vector, and a content vector to obtain an attention weight, which is a vector indicating which part of an input sentence is to be focused on (hereinafter referred to as “attention tendency”). Generate. Next, the attention mechanism 104 uses the attention weight, the speaker embedding vector, and the sentence vector to represent the content of attention to the input sentence according to the tendency of attention (hereinafter referred to as “content of attention”). To generate.

The difference from the attention mechanism in the conventional sentence generation method is that the speaker embedded vector is referred to in the calculation of the attention weight and the calculation of the attention vector. As a result, the tendency of attention and the content of attention are changed according to the individuality. The tendency of attention is, for example, a feature such as "Toyotomi Hideyoshi pays close attention to the word monkey." The content of caution is, for example, a feature such as "Toyotomi Hideyoshi takes the word monkey negatively." These characteristics do not need to be manually assigned in advance, and learning data that reflects the tendency of attention and the content of attention, specifically, a large number of sentence data associated with the speaker, is prepared and learned data. By setting, it is reflected in the attention vector.

Attention mechanism 104 specifically calculates the following formula.

Here, ○ is an operator representing the element product. t is a variable indicating that the decoder is outputting the t-th word. i is a variable indicating that it is the i-th word in the input sentence consisting of N words input to the encoder. h _t ^(dec) is a d-dimensional content vector that represents the internal state of the decoder. Note that d is the size (number of dimensions) of the calculated part of the attention mechanism. H ^(enc) is an N × d-dimensional sentence vector generated by the encoder. h _i ^(enc) ∈ H ^(enc) is the element corresponding to the i-th word of the sentence vector. s _u is a d-dimensional speaker embedding vector generated by the speaker model. f (・) and g (・) are different linear transformations. f (・) and g (・) may be a linear transformation of the first order, an arbitrary linear transformation of the Mth order, and the output may be 0 to 1 or -1 to 1 using the sigmoid function or the softsign function. You may define a function that falls within a certain threshold such as, or you may combine these. a _i is the attention weight for the i-th word in the input sentence.

That is, the attention mechanism 104 calculates the attention vector as follows. First, in order to calculate the attention weight a _i , the speaker embedded vector s _u is transformed using the linear transformation f (・) of order M. A linear transformation to speaker embedding vectors f (s _u), calculates an element product of each element h _i ^(enc) of the encoded sentence vector ^{^{_{H (enc), - h i}}} , and _k ^(enc) ( Corresponds to the second line of the formula). Deformed by a speaker embedded vector s _u ^- h _i, with _k ^(enc) and decoder of the content vector h _t ^(dec), calculates the i-th note weight a _i (corresponding to the fourth line of Equation) .. Next, in order to calculate the attention vector, the speaker embedded vector s _u is transformed using the linear transformation g (・) of order M. A linear transformation to speaker embedding vector g (s _u), calculates an element product of each element h _i ^(enc) of the encoded sentence vector ^{^{_{H (enc), - h i}}} , and _v ^(enc) ( Corresponds to the third line of the formula). It is calculated by each element and linearly converted speaker embedding vectors of these encoded sentence vector ^{_{^{- h i, k (enc)}}} , - h i, v subscript k of ^(enc), v respectively key, value head It is a character, and in the attention mechanism, the weight is customarily called a key, and the vector to be weighted is called a value, so such a subscript is used. Finally, for all i and attention weight a _i ^- h _{i, v} the product of ^(enc) is calculated to determine their sum. This sum is the attention vector that is the final output (corresponding to the first line of the formula).

With reference to FIG. 2, the processing procedure of the response sentence generation method executed by the response sentence generation device 1 of the embodiment will be described.

In step S11, the input sentence and the speaker identifier are input to the input unit 11. The input unit 11 outputs the input sentence and the speaker identifier to the response sentence generation unit 12.

In step S12, the response sentence generation unit 12 receives the input sentence and the speaker identifier from the input unit 11, and inputs the input sentence and the speaker identifier into the response sentence generation model stored in the model storage unit 10. , Obtains and outputs a response sentence that reflects the individuality of the speaker. In the output of the response sentence, the word string to be the response sentence is obtained by repeating the output of the words associated with the vector obtained from the output layer of the response sentence generation model. The response sentence generation unit 12 outputs the obtained response sentence to the response sentence generation device 1.

<Response sentence generation model learning device>
As shown in FIG. 4, the response sentence generation model learning device 2 of the embodiment includes, for example, a learning data storage unit 20, a model learning unit 21, and a model storage unit 10. The response sentence generation model learning device 2 realizes the response sentence generation model learning method of the embodiment by performing the processing of each step illustrated in FIG.

The response sentence generation model learning device 2 is configured by loading a special program into a publicly known or dedicated computer having, for example, a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. It is a special device that has been made. The response sentence generation model learning device 2 executes each process under the control of the central processing unit, for example. The data input to the response sentence generation model learning device 2 and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read to the central processing unit as needed. It is issued and used for other processing. The response sentence generation model learning device 2 may be at least partially configured by hardware such as an integrated circuit. Each storage unit included in the response sentence generation model learning device 2 is composed of, for example, a main storage device such as RAM (RandomAccessMemory) and an auxiliary memory composed of a hard disk, an optical disk, or a semiconductor memory element such as a flash memory (FlashMemory). It can be configured with a device or middleware such as a relational database or key-value store.

With reference to FIG. 5, the processing procedure of the response sentence generation model learning method executed by the response sentence generation model learning device 2 of the embodiment will be described.

Learning data is stored in the learning data storage unit 20. The learning data includes, for example, an utterance sentence which is a question sentence, a response sentence in which a predetermined speaker responds to the utterance sentence, and a speaker identifier representing the speaker. The learning data may be a collection of dialogues actually performed by a dialogue system or the like, may be manually created assuming a specific person, or may be a mixture of them.

In step S20, the response sentence generation model learning device 2 reads out the learning data from the learning data storage unit 20. The response sentence generation model learning device 2 inputs the read learning data to the model learning unit 21.

In step S21, the model learning unit 21 learns each parameter of the neural network of the response sentence generation model 100 using the input learning data. The learning method of the response sentence generation model is the same as the learning method of the model for generating the output using the conventional input and speaker identifier disclosed in Reference 2. That is, the softmax cross entropy for the output statement of the model is used as the loss function, and the parameters of the encoder 102, the decoder 103, and the attention mechanism 104 are learned so as to minimize the loss. Attention When learning the parameters of the mechanism 104, the parameters f and g are updated a predetermined number of times or until a predetermined condition is satisfied. At the same time, the parameters of the speaker model 101 that converts the speaker identifier into the speaker embedded vector are learned in the same manner as the conventional Speaker model. The model learning unit 21 stores the parameters of the learned response sentence generation model 100 in the model storage unit 10.

[Experimental result]
In order to measure the effect of the above embodiment, an experiment was conducted using the data of the impersonator question answering disclosed in Non-Patent Document 1. Specifically, 50,000 question-answering pairs of impersonator data for three people were used as learning data, and 2,000 were used as evaluation data. The dimension number d of the attention mechanism was 512, and Transformer was used for the encoder and decoder. The attention mechanism implements the attention mechanism of this embodiment by replacing the self-attention and the source-target attention in the Transformer. The model was trained with the training data, and the question text of the evaluation data was given to generate the answer. BLEU-1, BLEU-4, PPL were used as the evaluation scale. That is, the answer sentence generated in this embodiment and the correct answer sentence associated with the question of the evaluation data were compared using BLEU-1, BLEU-4 (the larger the value, the better). In addition, the model generation probability (PPL: Perplexity) for the correct answer sentence was calculated (the smaller the value, the better).

Table 1 shows the experimental results. It can be seen that the method of this embodiment was the best evaluation in all the evaluation indexes.

Although the embodiments of the present invention have been described above, the specific configuration is not limited to these embodiments, and even if the design is appropriately changed without departing from the spirit of the present invention, the specific configuration is not limited to these embodiments. Needless to say, it is included in the present invention. The various processes described in the embodiments are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes.

[Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by loading this program into the storage unit 1020 of the computer shown in FIG. 6 and operating it in the arithmetic processing unit 1010, the input unit 1030, the output unit 1040, etc., various processing functions in each of the above devices are realized on the computer. Will be done.

The program that describes this processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer is, for example, a non-temporary recording medium, such as a magnetic recording device and an optical disc.

In addition, the distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.

A computer that executes such a program, for example, first transfers a program recorded on a portable recording medium or a program transferred from a server computer to an auxiliary recording unit 1050, which is its own non-temporary storage device. Store. Then, at the time of executing the process, the computer reads the program stored in the auxiliary recording unit 1050, which is its non-temporary storage device, into the storage unit 1020, which is the temporary storage device, and follows the read program. Execute the process. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).

Further, in this form, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims

An input unit for inputting an input sentence and a speaker identifier representing a speaker,
A response sentence generation unit that obtains a response sentence by inputting the input sentence and the speaker identifier into the response sentence generation model, and a response sentence generation unit.
Including
The response sentence generation model is
A speaker model that obtains a speaker embedding vector from a speaker identifier,
An encoder that generates a sentence vector from an utterance sentence, and
A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
including,
Response statement generator.
The response sentence generation device according to claim 1.
The attention mechanism calculates the attention weight from the element product of the sentence vector and the vector obtained by transforming the speaker embedded vector by the first linear transformation and the content vector, and uses the attention weight to calculate the sentence vector and the above. The attention vector is generated by weighting the vector obtained by transforming the speaker embedded vector by the second linear transformation.
Response statement generator.
The response sentence generation device according to claim 2.
In the attention mechanism, H (enc) is the sentence vector, N is the number of elements of the sentence vector, and h i (enc) is the element corresponding to the i-th word of the sentence vector, h t (dec). Is the content vector when the t-th element of the response sentence is obtained, s u is the speaker embedded vector, f is the first linear transformation, and g is the second linear transformation. Generates the attention vector by calculating

Response statement generator.
A learning data storage unit that stores learning data including an utterance sentence, a response sentence in which a predetermined speaker responds to the utterance sentence, and a speaker identifier representing the speaker.
Using the training data, a model learning unit that learns a response sentence generation model that inputs an utterance sentence and a speaker identifier and outputs a response sentence that responds to the utterance sentence, and a model learning unit.
Including
The response sentence generation model is
A speaker model that obtains a speaker embedding vector from a speaker identifier,
An encoder that generates a sentence vector from an utterance sentence, and
A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
including,
Response sentence generation model learning device.
The input unit inputs the input sentence and the speaker identifier representing the speaker,
The response sentence generation unit obtains the response sentence by inputting the input sentence and the speaker identifier into the response sentence generation model.
The response sentence generation model is
A speaker model that obtains a speaker embedding vector from a speaker identifier,
An encoder that generates a sentence vector from an utterance sentence, and
A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
including,
Response statement generation method.
The learning data storage unit stores learning data including an utterance sentence, a response sentence in which a predetermined speaker responds to the utterance sentence, and a speaker identifier representing the speaker.
Using the learning data, the model learning unit learns a response sentence generation model that inputs an utterance sentence and a speaker identifier and outputs a response sentence that responds to the utterance sentence.
The response sentence generation model is
A speaker model that obtains a speaker embedding vector from a speaker identifier,
An encoder that generates a sentence vector from an utterance sentence, and
A decoder that generates a response sentence using an attention vector that represents the content of attention to the utterance sentence, and
A caution mechanism that generates the caution vector using the content vector representing the internal state of the decoder, the sentence vector, and the speaker embedded vector.
including,
Response sentence generation model learning method.
A program for operating a computer as the response sentence generation device according to any one of claims 1 to 3 or the response sentence generation model learning device according to claim 4.