CN113946670B

CN113946670B - Contrast type context understanding enhancement method for dialogue emotion recognition

Info

Publication number: CN113946670B
Application number: CN202111217510.0A
Authority: CN
Inventors: 宋大为; 张寒青
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2024-05-10
Anticipated expiration: 2041-10-19
Also published as: CN113946670A

Abstract

The invention relates to a contrast type context understanding enhancement method for dialogue emotion recognition, and belongs to the technical field of computers and information science. First, based on the existing dialog emotion analysis framework, a hidden state sequence for emotion classification is extracted. Then, based on the extracted sequence representation, a comparison sample is constructed that contains the contextual semantic perception patterns. The model is then enabled to learn the patterns it contains from the sample using a contrast learning loss function to enhance the model's understanding of the dialog context. And finally, adding the comparison loss and the emotion classification loss function, and performing multi-task learning to complete the training of the network model. The method has strong suitability, can be flexibly embedded into the existing emotion classification model, can judge emotion from the viewpoint of understanding the dialogue context content to a certain extent, and can effectively improve the emotion classification accuracy and the robustness of disturbance of the existing model.

Description

Contrast type context understanding enhancement method for dialogue emotion recognition

Technical Field

The invention relates to a contrast type context understanding enhancement method for dialogue emotion recognition, and belongs to the technical field of computers and information science.

Background

The research goal of dialog emotion recognition (CER) is to distinguish emotion of each session in a session. Effective dialog emotion recognition is critical to the construction of dialog systems. If the dialog system is able to take into account the emotional state of the user, it will be made to exhibit a human-like concentricity, which is of great value for improving the user-friendliness of the human-machine interaction of the dialog system. Accordingly, research into emotion recognition of a conversation has attracted more and more attention in recent years.

With the progress of deep learning technology, the emotion recognition method based on the neural network has a certain breakthrough in performance. Currently, existing approaches mostly strive to build a more efficient speech characterization to better model dialog context. Specifically, the utterances in the dialogue are regarded as a sequence, and various emotion influencing factors (such as inter-speaker influence, intra-speaker influence, topics, individuality and the like) on each target utterance are aggregated by using a sequence model commonly used in natural language processing, such as a cyclic neural network (RNN), a Transformer (Transformer), a graphic neural network (GCN) and the like, so as to obtain a final utterance-level emotion recognition vector representation, and finally emotion classification is performed. However, the process of the dialogue is affected by a plurality of factors such as theme, intention, view, and demonstration logic of the dialogue, so that it is still difficult for these methods to effectively judge the emotion of the current dialogue through understanding the context information, and thus the classification accuracy and robustness of the existing models are limited to a certain extent.

In summary, the application provides a contrast type context understanding enhancement method for dialogue emotion recognition. By introducing contrast learning, the existing dialogue emotion classification model is forced to pay attention to context information while emotion discrimination is completed, understanding of the dialogue emotion classification model in the dialogue context is enhanced, and accuracy and robustness of emotion classification of the model are improved.

Disclosure of Invention

The invention aims to provide a contrast type context understanding enhancement method for dialogue emotion recognition, aiming at the technical problems of low classification accuracy and poor model robustness caused by insufficient understanding of dialogue contexts in the existing neural network-based dialogue emotion recognition method. By introducing contrast learning, the existing dialogue emotion classification model is forced to pay attention to context information while emotion discrimination is completed, so that understanding of the dialogue emotion classification model in the dialogue context is enhanced, and accuracy and robustness of model emotion classification are improved.

The innovation point of the invention is that: first, based on the existing dialog emotion analysis framework, a hidden state sequence for emotion classification is extracted. Then, based on the extracted sequence representation, a comparison sample is constructed that contains the contextual semantic perception patterns. The model is then enabled to learn the patterns it contains from the sample using a contrast learning loss function to enhance the model's understanding of the dialog context. And finally, adding the comparison loss and the emotion classification loss function, and performing multi-task learning to complete the training of the network model.

The technical scheme of the invention is realized by the following steps.

A contrast type context understanding enhancement method for dialogue emotion recognition comprises the following steps:

step1: extracting emotion expression sequences in the existing emotion classification frames.

Specifically, the following method may be employed:

step 1.1: and vectorizing the conversation text of the conversation to obtain a corresponding distributed text representation.

Step 1.2: and (3) sending the text representation in the step 1.1 into an existing dialogue emotion classification model to obtain an emotion representation sequence before the model is fully connected with a classification layer.

Step 2: a comparison sample pair is constructed that includes context-aware characteristics.

Specifically, the following method may be employed:

step 2.1: and encoding the historical information of each target session emotion representation to be classified to obtain the abstract representation of the context.

Step 2.2: the target session itself and the session adjacent in the same direction are used as positive examples of the target session context representation, and the session emotion representations sampled from other irrelevant dialogs are used as negative examples, so that the construction of the comparison sample pair is completed.

Step 2.3: steps 2.1 to 2.2 are repeated to construct a corresponding pair of comparison samples for each target session in the opposite direction of the dialog flow.

Step 3: and constructing a contrast loss function, and carrying out joint training with the original emotion classification frame.

Specifically, the following method may be employed:

step 3.1: constructing a comparison loss function, and making the distance between negative examples in the comparison sample constructed in the step2 become far in the constructed implicit semantic space and making the distance between positive examples closer.

Step 3.2: and adding the contrast loss function and the loss function in the original dialogue emotion classification frame, and jointly training with the original network to obtain a new dialogue emotion classification model.

And judging emotion in the target dialogue by using the obtained dialogue emotion classification model, and classifying the dialogue emotion.

Advantageous effects

Compared with the prior art, the method has the following advantages:

The method has strong suitability, can be flexibly embedded into the existing emotion classification model, can judge emotion from the viewpoint of understanding the dialogue context content to a certain extent, and can effectively improve the emotion classification accuracy and the robustness of disturbance of the existing model.

Drawings

FIG. 1 is a schematic diagram of the method of the present invention.

Fig. 2 is a process diagram of constructing a comparative sample pair.

Detailed Description

For a better illustration of the objects and advantages of the invention, a more detailed description of the specific embodiments of the method of the invention will be given below with reference to examples.

Further, step1 includes the following steps.

Step 1.1: for a segment of text dialog, the text sequence content of the dialog is represented by word2vector as a text sequence in vector form: Where u represents the conversation text and T _l represents the number of conversation turns in the first conversation segment in the training data. /(I) Is text information.

Step 1.2: as shown in fig. 1, a dialogue emotion classification model CER is provided, the vector text sequence obtained in the step 1 is represented as a fed emotion classification model, and the emotion sequence representation before the full connection layer of the model for emotion classification is obtained:

wherein H is emotion sequence representation, H represents emotion vector representation of conversation at different moments, and CER is an existing dialogue emotion classification model. Is an emotion vector representation.

Step 2: as shown in fig. 2, a comparison sample pair including context-aware characteristics is constructed based on the affective sequence representation H obtained in step 1.

Further, step2 includes the following steps.

Step 2.1: using a sequence modelFor each target session emotion representation h _k to be classified, the historical session information is encoded to obtain the abstract representation/>, of the context

Wherein h _k-1 is the emotion vector of the session at time k-1.

Step 2.2: taking the representations of w sessions which are in the same direction as the target session as positive examples of the target session context representation, and forming a positive example pair set:

Wherein, Representing a set of forward sample pairs, p _k represents a single forward sample pair, h _k+w is the emotion vector for a session at time k+w, where k+w < T _l, k represents the location of the target session in the conversation.

Similarly, a sequence model in the opposite direction is utilizedEncoding future sequence information of emotion representation of target session to obtain abstract context representation/>Then construct a positive example pair of its opposite directions:

Wherein, Representing a set of reverse pairs of samples, h _k-w is the emotion representation vector for the session at time k-w.

To sum up, a set P _k of all positive sample pairs of the target session is obtained:

Step 2.3: combining the two-directional context representations obtained in step 2.2 And/>Taking conversation emotion representations sampled from other irrelevant conversation data as negative examples, constructing negative sample pairs:

Wherein, And/>Representing a session emotion representation randomly sampled from other sessions. /(I)Respectively represent the negative sample pair sets in the positive and negative directions, and n _k is a single comparison sample pair therein.

The negative pair set N _k for the target session h _k is represented as:

P _k and N _k contain patterns that enable models to perceive dialog context. For each emotional representation of the words, the corresponding comparison sample pairs are obtained through the same process. Combining them with the contrast loss allows the model to learn the features contained in these contrast samples.

Step 3: and (3) constructing a contrast loss function by combining the contrast sample pair obtained in the step (2), and carrying out joint training with the emotion classification frame.

Further, step 3 includes the following steps.

Step 3.1: and constructing a contrast loss function, so that the distance between negative examples in the contrast sample becomes far and the distance between positive examples becomes near.

For a target utterance h _k, whose corresponding pair of comparison samples is D _k＝{P_k,N_k, there is provided any one of the pairsFirst splice them, then calculate the matching score between samples o _j by a fully connected perceptron (MLP):

Wherein, And/>Respectively representing the context and the session vector representation in any one comparison sample; mlp is a fully connected perceptron network.

Then, the matching score o _j is normalized to between [ -1, +1] by a sigmoid function:

s_j＝sigmoid(o_j) (9)

based on the matching network constructed for each sample pair, a contrast loss is constructed, so that the matching score between the positive sample pair is increased, and the matching score between the negative sample pair is decreased:

Wherein, And/>Respectively, represent the matching score values between the corresponding positive and negative pairs of samples, |p _k | represents the number of positive samples, and |n _k | represents the number of corresponding negative pairs of samples. L _c represents the contrast penalty for each target session. D _k is a comparative sample pair.

Step 3.2: the loss function L (theta) of the whole network is obtained by adding the original emotion classification loss function and the comparison loss function, and the specific form is as follows:

Where θ is all parameters of the entire network. T _l represents the number of dialog turns contained in the first dialog in the training database; l _e(u_t) represents a penalty function for emotion classification of the target session u _t; l _c(D_t) is a contrast loss function. Lambda represents the intensity parameter of the contrast loss used to control the intensity of the context enhancement task. L (θ) represents the loss function of the entire network.

The whole network realizes the effect of enhancing the context of the existing dialogue emotion classification model by carrying out joint training on the two tasks.

Experiment verification

3 Representative dialogue emotion models were chosen as baseline models and experiments were performed on MELD and IEMOCAP data, respectively. The result shows that after the context understanding enhancement method provided by the application is added, the classification accuracy rate can be improved by 2-3% on the baseline model; and when based on the disturbance test of the context replacement of each session content, the method can show stronger robustness, and can still maintain higher classification accuracy.

Claims

1. A contrast type context understanding enhancement method for dialogue emotion recognition is characterized by comprising the following steps:

step 1: extracting hidden state sequences for emotion classification based on the existing dialogue emotion analysis frame;

Step 1.1: vectorizing a conversation text of a conversation to obtain a corresponding distributed text representation;

the text sequence content of the dialog is represented as a text sequence in vector form: where u represents the text of the conversation, T _l represents the number of conversation turns in the first conversation in the training data,/> Is text information;

Step 1.2: sending the text representation into the existing dialogue emotion classification model to obtain emotion representation sequences before the model is fully connected with a classification layer;

The method comprises the steps that a dialogue emotion classification model CER is provided, an obtained vector type text sequence is expressed as a fed emotion classification model, and an emotion sequence before a full connection layer for emotion classification of the model is obtained is expressed as follows:

wherein H is emotion sequence representation, H is emotion vector representation of conversation at different moments, CER is an existing dialogue emotion classification model, Is represented by emotion vectors;

step 2: constructing a contrast sample pair containing a context semantic perception mode based on the extracted sequence representation;

Step 2.1: encoding historical information of each target session emotion representation to be classified to obtain abstract representation of the context of the historical information;

using a sequence model For each target session emotion representation h _k to be classified, the historical session information is encoded to obtain the abstract representation/>, of the context

H _k-1 is the emotion vector of the session at the moment k-1;

Step 2.2: taking the target session and the session representations adjacent in the same direction as positive examples of the target session context representation, taking the session emotion representations sampled from other irrelevant dialogues as negative examples, and further completing the construction of a comparison sample pair;

Taking the representations of w sessions which are in the same direction as the target session as positive examples of the target session context representation, and forming a positive example pair set:

Wherein, Representing a set of forward sample pairs, p _k representing a single forward sample pair, h _k+w being the emotion vector of the session at time k+w, where k+w < T _l, k represents the location of the target session in the conversation;

using a sequence model in the opposite direction Encoding future sequence information of emotion representation of target session to obtain abstract context representation/>Then construct a positive example pair of its opposite directions:

Wherein, H _k-w is the emotion expression vector of the session at the moment k-w;

step 2.3: repeating steps 2.1 to 2.2, and constructing corresponding comparison sample pairs for each target session in the opposite direction of the dialogue flow;

Combining forward and reverse directional context representations And/>Taking conversation emotion representations sampled from other irrelevant conversation data as negative examples, constructing negative sample pairs:

Wherein, And/>Representing a session emotion representation randomly sampled from other sessions; /(I)Respectively representing negative sample pair sets in the positive direction and the negative direction, wherein n _k is a single comparison sample pair;

the negative pair set N _k for the target session h _k is represented as:

P _k and N _k contain patterns that enable models to perceive dialog context; for emotion characterization of each utterance, corresponding comparison sample pairs are obtained through the same process; combining them with contrast loss, allowing the model to learn the features contained in these contrast samples;

Step 3: constructing a contrast learning loss function, and enabling the model to learn the modes contained in the model from the sample; adding the comparison loss and the emotion classification loss function, and performing multitask learning to complete network model training;

and judging emotion in the target dialogue by using the dialogue emotion classification model, and realizing classification of dialogue emotion.

2. The method for enhancing the comparative context understanding of dialog emotion recognition as claimed in claim 1, wherein the implementation method of step 3 is as follows:

Step 3.1: constructing a contrast loss function, and enabling the distance between negative examples in the contrast sample to be far and the distance between positive examples to be close;

For a target utterance h _k, whose corresponding pair of comparison samples is D _k＝{P_k,N_k, there is provided any one of the pairs First splice them, then calculate the matching score o _j between samples by a fully connected perceptron MLP:

Wherein, And/>Respectively representing the context and the session vector representation in any one comparison sample; mlp is a fully connected perceptron network;

s_j＝sigmoid(o_j) (9)

based on the matching network constructed for each sample pair, a contrast penalty is constructed, with the matching score between positive sample pairs being increased, and the matching score between negative sample pairs being decreased:

Wherein, And/>Respectively representing the matching score value between the corresponding positive and negative sample pairs, |p _k | represents the number of positive samples, and |n _k | represents the number of corresponding negative sample pairs; l _c represents the contrast penalty for each target session; d _k is a comparative sample pair;

Wherein θ is all parameters of the entire network; t _l represents the number of dialog turns contained in the first dialog in the training database; l _e(u_t) represents a penalty function for emotion classification of the target session u _t; l _c(D_t) is a contrast loss function; λ represents a contrast loss intensity parameter used to control the intensity of the context enhancement task; l (θ) represents a loss function of the entire network;