CN112905776A

CN112905776A - Emotional dialogue model construction method, emotional dialogue system and method

Info

Publication number: CN112905776A
Application number: CN202110283821.0A
Authority: CN
Inventors: 李凯伟; 马力
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-06-04
Anticipated expiration: 2041-03-17
Also published as: CN112905776B

Abstract

The invention discloses an emotional dialogue model construction method, an emotional dialogue system and an emotional dialogue method. The disclosed method adopts production confrontation network training to obtain a corresponding model, wherein a generator is used for generating a dialogue reply, a content discriminator can distinguish whether an input reply text sequence belongs to a 'universal reply', and participates in the confrontation training process of the model; the emotion discriminator can distinguish whether the emotion type of the reply text generated by the generation model is a certain appointed emotion type, and the emotion type of the dialogue text generated by the generation model can be guided to be closer to the certain appointed emotion type by adding the emotion discrimination model. In the man-machine conversation process, the method and the system can generate the conversation reply meeting specific emotion, so that the reply diversity is realized, the emotion consistency is also met, the quality of the reply generated by the machine can be improved, and the user experience in man-machine interaction can be improved.

Description

Emotional dialogue model construction method, emotional dialogue system and method

Technical Field

The invention belongs to the technical field of man-machine conversation, and particularly relates to an emotion conversation model construction method and a related emotion conversation system and method.

Background

As a novel man-machine interaction mode, the man-machine conversation can save manpower to a great extent, is mainly divided into task type conversation and chatting type conversation in function, and is respectively used for meeting different user requirements.

The task type conversation can be used in scenes such as intelligent customer service, ticket booking, weather inquiry and the like in a certain fixed field, such as Taobao intelligent customer service, Jingdong intelligent customer service and the like, and the intention understanding, conversation management and the like of a user are realized according to field knowledge and conversation history related to the task.

The chatting conversation is mostly used in the chatting robot, such as microsoft ice, the conversation has various topics, the system plays a role closer to a human, and the system can give a proper reply to any text input by a user.

In practical applications, however, people not only want the robot to have intelligence and understand the intention of the conversation, but also want the robot to understand the emotion of people and provide personalized emotional communication, so that the chat-type conversation system can just meet the requirement of emotional conversation communication, can continuously converse with the user, understand the emotion of the user in the conversation, and express the emotion in the conversation.

At present, a dialog generation system mainly uses a Sequence-to-Sequence model (Sequence 2Seq), which determines generated words by using a maximum likelihood estimation method, so that words with high frequency in training data are easily selected, but general sentences lacking information quantity, such as ' good ' and ' not known by me, have high frequency, are easy to generate general answers, and the model answers are relatively single. And the model lacks the encoding and decoding of emotion and does not consider the emotional relationship between questions and answers. Natural communication with the user is difficult to achieve, and emotional ties between the user and the machine are established.

Disclosure of Invention

Aiming at the defects or shortcomings of the prior art, the invention provides a method for constructing an emotional dialogue model.

Therefore, the construction method of the emotion conversation model provided by the invention comprises the following steps:

step1, constructing an emotion dialogue corpus data set and a general reply data set, wherein the emotion dialogue corpus data set comprises a plurality of sentences and emotion category labels of the sentences, and the general reply data set comprises a plurality of general sentences;

step2, training by utilizing an emotion dialogue corpus data set and a general reply data set to generate a confrontation network, wherein the generated confrontation network comprises a generator and a discriminator, the generator comprises an encoder and a decoder, and the discriminator comprises a first discriminator and a second discriminator;

the encoder vectorizes each statement in the emotion dialogue corpus data set to obtain a semantic expression vector; the decoder generates replies of sentences according to semantic expression vectors and randomly assigned emotion category labels, the replies contain emotion information, and the objective function of the generation process is-r-log (P (Y | X, e)), r ═ a rewarded_c+b*Reward_e(ii) a Log (Y | X) is the objective function of the underlying Seq2Seq dialogue model, X is the semantic representation vector, Y is the reply to X, e is the randomly assigned emotion class label, and P (Y | X) is the probability of generation based on X maximizing Y.

P_c≥P₁While, Reward_c＝1-P_c，P_c<P₁While, Reward_c＝P_cWherein: 1>P₁≥0.4；

P_e≥P₂While, Reward_e＝P_e，P_e<P₂While, Reward_e＝1-P_eWherein: 1>P₂≥0.4；

a is Reward_cWeight of term, 0<a<1；

b is Reward_eWeight of term, 0<b<1；

r takes any value of [0,1] initially and randomly;

the first discriminator judges and outputs the probability P that each reply belongs to the general reply according to the general reply data set_c；

The second discriminator is used for judging each reply bandAnd outputs the consistency probability P of the emotion category label and the randomly designated emotion category label when the reply is generated_e。

Further, the first discriminator adopts a keyword extraction method or template matching.

Optionally, the universal reply dataset includes a universal statement dataset and a non-universal statement dataset; the first discriminant data device is trained by the universal reply data set, and the trained first discriminant is used for judging and outputting the probability P that each reply belongs to the universal reply_c(ii) a The first discriminator adopts a CNN network, an LSTM model, a GRU model, a Bi-LSTM model and a Bi-GRU model.

The invention provides another construction method of an emotion conversation model, which comprises the following steps:

step 1: constructing an emotion dialogue corpus data set, wherein the emotion dialogue corpus data set comprises a plurality of sentences, emotion category labels of the sentences, a plurality of general sentences and general reply category labels;

step2, training by utilizing an emotion dialogue corpus data set and a general reply data set to generate a confrontation network, wherein the generated confrontation network comprises a generator and a discriminator, the generator comprises an encoder and a decoder, and the discriminator comprises a first discriminator and a second discriminator; the first discriminator adopts a CNN network, an LSTM model, a GRU model, a Bi-LSTM model and a Bi-GRU model;

a is Reward_cWeight of term, 0<a<1；

b is Reward_eWeight of term, 0<b<1；

r takes any value of [0,1] initially and randomly;

the first discriminant data device is trained by the emotion dialogue corpus data set, and the trained first discriminant is adopted to judge and output the probability P that each reply belongs to a universal reply_c；

The second discriminator is used for judging the emotion category label to which the emotion information carried by each reply belongs and outputting the consistency probability P of the emotion category label to which the emotion information belongs and the emotion category label randomly designated when the reply is generated_e。

Optionally, the second discriminator adopts a CNN network, an LSTM model, a GRU model, a Bi-LSTM model, and a Bi-GRU model.

The invention also provides an emotional dialogue method. The provided dialogue method adopts the emotion dialogue model trained by the method to carry out emotion dialogue.

The invention also provides an emotional dialogue system, which comprises an input module, a dialogue reply generation module and an output module;

the input module is used for inputting conversation content;

the dialogue reply generation module adopts the emotion dialogue model trained by the method and is used for generating the reply of the dialogue content;

the output module is used for outputting the reply content.

The beneficial technical effects of the invention are as follows:

in the man-machine conversation process, the method can generate conversation replies meeting specific emotions, so that the reply diversity is realized, the emotion consistency is also met, the quality of the machine generated replies can be improved, and the user experience in man-machine interaction can be improved.

And (II) in the process of generating the reply, feedback information in the conversation is emphasized, and the current conversation generation model tends to generate a safe reply and contains a low amount of information.

Drawings

The present invention will be explained in further detail with reference to examples.

FIG. 1 is a schematic diagram of the working principle of the model of the present invention;

FIG. 2 shows the emotional resonance of the input and the reply of the model of the present invention in the test results.

Detailed Description

Unless otherwise indicated, the terminology herein is to be understood in light of the knowledge of one of ordinary skill in the relevant art.

The source Conversation data set of the invention can be obtained from crawlers in the network or adopt public data sets, for example, the existing data sets, such as 'electronic Conversation Generation Task 4', are obtained from customer service conversations of platforms such as Taobao, microblog and American groups.

And preprocessing the dialogue data set to obtain a dialogue corpus data set or an emotion dialogue corpus data set. The preprocessing includes deleting the content, such as: deleting redundant punctuations and symbols (such as @, emoticons, repeated symbols and the like) in the conversation data set, and then deleting sentences with unreasonable length, such as sentences with the number of characters less than 3 or more than 25 in the data through length analysis, and in some cases, deleting dialect expressions in different places, such as cantonese; and obtaining a dialogue corpus data set or an emotion dialogue corpus data set after the preprocessing is finished. Preprocessing the existing data set to obtain an emotion dialogue corpus data set; and preprocessing the original data directly obtained from the network, and then carrying out emotion classification to determine the emotion category of each statement so as to obtain an emotion dialogue corpus data set. The emotion classification can be performed manually or by using emotion classification methods (such as emotion classification methods based on keyword matching) or emotion classification models (such as BilSTM models).

The emotion dialogue corpus data set is composed of a plurality of sentences and emotion category labels of the sentences, wherein the emotion category labels are determined according to emotion types of people, for example, the currently recognized emotion types in the field include: happy, angry, sad, liked, disliked, fear, surprised, and others. Examples of sentences and their emotion category labels are shown in table 1.

TABLE 1 Emotion Categories

Input device	Emotion categories	Label (R)
			A tree is arranged in a courtyard	Others	0
Poor alcohol of alcohol	Sadness and sorrow	1
			How to say that	Aversion to	2
I are angry	Anger and anger	3
			Haha, i see	Happy music	4
Feeling of love	Xi Huan	5
			She frightened like nailing on the ground	Fear of	6
It is too beautiful	Is surprised	7

The universal reply data set construction of the invention can adopt different technical ideas, one is that the universal reply data set is directly formed by a universal statement set, and specifically can be formed by short conversation statements used by users in network data, namely daily common statements, for example, statements with the length less than three words in the data and/or statements which continuously and repeatedly appear for more than three times. For example: "to-gether", "kay", "good, yes", "yes, yes", etc. And secondly, a data set formed by non-universal sentences is constructed or added on the basis of the constructed universal sentence set and is used for training the first generator. And thirdly, the emotion dialogue corpus data set contains a universal sentence, and a universal reply category label is added while the emotion dialogue corpus data set is constructed, wherein the universal reply category label comprises that the sentence is a universal reply (for example, represented by '1') and the sentence does not belong to the universal reply (for example, represented by '0').

The generation countermeasure network of the invention is composed of a generator and two discriminators, wherein the generator can adopt a recurrent neural network, such as an LSTM model or a Seq2Seq model; the first and second discriminators can be trained by the data set of the invention or trained CNN model, LSTM model, GRU model, Bi-LSTM model, Bi-GRU model, the training can be based on the emotion dialogue corpus data set of the invention, or trained models in the prior art.

The following embodiments of the present invention are provided, and it should be noted that the present invention is not limited to the following embodiments, and all equivalent changes based on the technical solutions of the present invention are within the protection scope of the present invention.

Example (b):

the embodiment provides an emotional dialogue model construction method provided by the invention. The emotion Conversation data set used in this embodiment is "Emotional conversion Generation Task 4", as shown in Table 2; in the operation of this embodiment, a general reply category label is added to each statement in the "electronic conversion Generation Task 4" dataset, specifically, the general reply category label of a statement in the dataset with a length of less than three words and a statement that appears three times or more continuously is "1", and the other statements are "0";

TABLE 2 Emotion dialogue data set

The generator used in the Generation countermeasure network of this embodiment is based on the seq2seq framework, wherein both the encoder and the decoder use LSTM networks, and the first discriminator uses CNN networks trained by each statement in the "electronic conversion Generation Task 4" dataset and its generic reply category label; the second discriminator adopts a Bi-LSTM model (model disclosed in Zhou H, Huang M, Zhang T, et al. electronic changing Machine: electronic conversion Generation with Internal and External Memory [ J ]. 2017.); the associated parameter settings were vocabulary set to 40000, word embedding size set to 100, all parameters of the model were initialized randomly, batch size set to 128, and initial learning rate was 0.001 using Adam optimizer.

As shown in fig. 1, the generator embeds emotion information during decoding, the first discriminator distinguishes whether the generated reply belongs to the probability of a general reply, the second discriminator distinguishes the emotion generating the reply and the consistency of the randomly assigned emotion types, and the feedback obtained by the two discriminators is returned to the rewrite model to guide the generation of an emotion text, and the specific process is as follows:

step1, converting any statement into vector form by adopting word2vec model, namely, the statement represents vector X ═ { X ═ X₁,x₂,…x_i,…x_mM is the length of the statement, x_iA word vector corresponding to the ith element in the X;

step2, the encoder will vector X₁＝{x₁,x₂,…,x_mConverting into a semantic expression vector;

step3, randomly assigning emotion category labels, the decoder generates replies (shown in table 3, for example) of each sentence according to each semantic expression vector and the randomly assigned emotion category labels, specifically, one-hot encoding is adopted for the randomly assigned emotion category labels (for example, "like" can be encoded as [0,1,0,0, 0)]) Obtaining the emotion vector of the sentence, embedding the emotion vector into a decoding process, and generating a reply of a corresponding sentence according to the semantic vector representation and the embedded emotion vector; the objective function of the generation process is-r · log (P (Y | X, e)); in this example P₁＝0.5，P₂＝0.5，a＝0.5b＝0.5；

TABLE 3 Emotion dialog example

The first discriminator firstly carries out convolution operation to obtain a feature matrix of each reply, then carries out maximum pooling through a pooling layer, and then outputs the probability P that each reply belongs to a universal reply through the processing of a full connection layer_c(ii) a Example (c):

TABLE 4 first discriminator working example

Input device	Recovery	Probability P_c
			The rainbow is good	Is	0.91

The second discriminator judges the emotion category label of the emotion information carried by each reply and outputs the consistency probability P of the emotion category label and the randomly designated emotion category label when the reply is generated_e. Example (c):

TABLE 5 second discriminator working example

And (3) simulation comparison:

simulation input: randomly selecting 1000 data in an "electronic conversion Generation Task 4" dataset;

1000 pieces of data were inputted into the Model trained in the above-mentioned examples, the Seq2Seq Model (Vinyals O, Le Q.A Neural switching Model [ J ]. Computer, 2015.), the ECM (Zhou H, Huang M, Zhang T, et al. electronic changing Machine: electronic switching Generation with Internal and External Memory [ J ].2017.), respectively, the output results of each Model were compared and evaluated,

the effectiveness of the emotion-based dialog generation method provided by the invention is evaluated, and the method mainly comprises the following two aspects: diversity and emotional consistency of the generated replies. Where emotional reasonableness is measured by emotional scores and semantic aspects are reflected by diversity and confusion. Meanwhile, manual evaluation is carried out, so that the annotator scores the responses generated by each sentence in the aspect of emotion and the aspect of semantics, and the evaluation rule is shown in table 6.

TABLE 6 rules for human evaluation

The comparative experiment was set up as Seq2Seq model and ECM model: the Seq2Seq method uses an encoder and decoder to generate a sentence, which has been considered the basic model of dialog generation, which the present invention compares with the diversity of generating reply text. The ECM model introduces emotion information in large-scale session generation for the first time, and generates emotion reply by embedding emotion vectors and an internal and external memory storage mechanism. The invention compares it with the emotional intensity and emotional accuracy of the reply. Experimental comparison results are given in table 7 and table 8, respectively.

Table 7 automatic evaluation of experimental results

Perplexity, namely, the confusion degree, whether the model generation reply is smooth in content or not is evaluated, the lower the confusion degree is, the larger the sentence probability is, the better the model effect is, and the concrete reference can be made to: li J, Galley M, Brockett C, et al.A Persona-Based Neural conversion Model [ J ]. 2016;

dist-1 is obtained by dividing the number of different unary groups in the generated reply by the total number of the unary groups in the generated reply, Dist-2 is obtained by dividing the number of different binary groups in the generated reply by the total number of the binary groups in the generated reply, and the larger Dist-1 and Dist-2 are, the more different information is contained in the generated reply; see in particular: xing C, Wu W, Wu Y, et al.Topic Aware Neural Response Generation [ J ]. 2016;

the Accuracy rate that the emotion type of the generated reply is consistent with the randomly designated emotion type is represented by the Accuracy, and the larger the Accuracy is, the reply of the randomly designated emotion type can be generated by the description model; see in particular: study of the see xiwson controllable chat conversation system 2019.

As shown in Table 7, the Seq2Seq model is easy to generate "general reply" with very low emotional intensity, since it produces the same response for different emotional categories; the ECM adopts the Beam Search algorithm to avoid safe and universal reply to a certain extent, and a simple emotion embedding model is used, so that the emotion factors in the reply are weak; after the invention is added with a reward mechanism, various sentences containing abundant information can be generated more easily, and the emotion is richer.

TABLE 8 results of the Manual evaluation experiment

Since "others" do not contain a specific emotion category, they are not scored. From the analysis in table 8, in the responses of the five emotion categories, compared with the Seq2Seq model, after the emotion constraint is added to the model of the present invention, the content and emotion are greatly improved, and the content score is higher than Seq2Seq by 0.35 points and the emotion score is higher than Seq2Seq by 0.12 points in average score. Compared with the ECM model, the model provided by the invention performs better in the emotion categories of like and happy, and the content score is 0.12 points higher than the ECM and the emotion score is 0.03 points higher than the ECM in average score. It can also be seen that the emotional scores generated by the three models for replies in the "aversion" and "anger" emotion categories are low, and the analysis may be due to the relatively small corpus of dialogue training in both emotion categories, resulting in the selection of the generated sentence ignoring the "aversion" and "anger" constraints and optimizing the deviation of replies. In general, compared with two baseline models, the model provided by the invention has good improvement in the reply of five emotion categories, both in content and emotion.

The emotional change between the input (1000 pieces of data) and the reply is observed, noting that the input and reply typically carry the same or similar emotional tags. For example, an input tagged with a "like" tag rarely returns "angry", while each type of emotion interacts emotionally with the rest, indicating that the degree of emotional resonance is not only related to emotional similarity, but also affected by emotional complementarity. The emotional resonance degree of the pleased emotion and the happy emotion is not equal to the resonance degree of the happy emotion and the pleased emotion, and the fact that the interactive emotional resonance degree has directionality is shown. As shown in fig. 2, is not completely symmetrical about the main diagonal. (in the embodiment, the evaluation method of the emotional resonance degree can be specifically referred to Liuning, human-computer interaction emotional anthropomorphic strategy research [ D ] 2020.).

Claims

1. A construction method of an emotional dialogue model is characterized by comprising the following steps:

a is Reward_cWeight of term, 0<a<1；

b is Reward_eWeight of term, 0<b<1；

r takes any value of [0,1] initially and randomly;

2. The method of claim 1, wherein the first discriminator uses keyword extraction or template matching.

3. The method of constructing an emotional dialog model of claim 1, wherein the universal reply dataset comprises a universal sentence dataset and a non-universal sentence dataset; the first discriminant data device is trained by the universal reply data set, and the trained first discriminant is used for judging and outputting the probability P that each reply belongs to the universal reply_c(ii) a The first discriminator adopts a CNN network, an LSTM model, a GRU model, a Bi-LSTM model and a Bi-GRU model.

4. A construction method of an emotional dialogue model is characterized by comprising the following steps:

a is Reward_cWeight of term, 0<a<1；

b is Reward_eWeight of term, 0<b<1；

r takes any value of [0,1] initially and randomly;

5. The method for constructing an emotional dialogue model according to claim 1 or 4, wherein the second discriminator employs a CNN network, an LSTM model, a GRU model, a Bi-LSTM model, and a Bi-GRU model.

6. An emotional dialogue method, characterized in that an emotional dialogue is performed by using an emotional dialogue model trained by the method of any one of claims 1 to 5.

7. The emotion conversation system is characterized by comprising an input module, a conversation reply generation module and an output module;

the input module is used for inputting conversation content;

the dialogue reply generation module adopts an emotional dialogue model trained by the method of any one of claims 1 to 5 to generate the reply of the dialogue content;

the output module is used for outputting the reply content.