CN112948554B

CN112948554B - Real-time multi-mode dialogue emotion analysis method based on reinforcement learning and domain knowledge

Info

Publication number: CN112948554B
Application number: CN202110222049.1A
Authority: CN
Inventors: 张科; 李苑青; 王靖宇; 苏雨; 谭明虎
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-02-28
Filing date: 2021-02-28
Publication date: 2024-03-08
Anticipated expiration: 2041-02-28
Also published as: CN112948554A

Abstract

The invention relates to a real-time multi-mode dialogue emotion analysis method based on reinforcement learning and domain knowledge, and belongs to the technical field of user emotion tendency analysis. Aiming at the characteristic that the real-time multi-modal emotion analysis cannot obtain the related information after the target, a new model and a network structure are designed by combining reinforcement learning and a cyclic neural network, multi-modal information in a target and pre-target sampling time period is fully extracted, fused and analyzed, and recognition efficiency and accuracy are further improved by combining with field knowledge.

Description

Real-time multi-mode dialogue emotion analysis method based on reinforcement learning and domain knowledge

Technical Field

The invention belongs to the technical field of emotion tendency analysis of users, and particularly relates to a real-time multi-mode dialogue emotion analysis model and method based on reinforcement learning and domain knowledge.

Background

User multi-mode emotion analysis is a very popular research field in recent years, and has wide development potential and application prospect, for example: driver fatigue driving monitoring of an automatic driving system, airport security protection monitoring for dangerous molecules in crowd, self-closing symptom accompanying and monitoring in medical field, accompanying and monitoring for solitary old people and children in intelligent home field, alarming and monitoring, and the like. In the existing multi-mode emotion analysis technology, the modes for analysis are various according to different research directions, wherein the following four main modes are: visual signals, acoustic signals, text information and brain electrical signals. The electroencephalogram signal has relatively highest accuracy, but is required to be matched with corresponding special signal acquisition sensor equipment, so that the electroencephalogram signal is difficult to popularize in the field of daily life conveniently and widely. Thus, vision, sound and text are the most common input modalities for multimodal user emotion analysis studies. In the prior related technology using the three modes, the analysis is mainly divided into two types, wherein one type is that sentence-by-sentence or segment-by-segment is used as an object for analysis, namely, emotion analysis of context information is not considered; the other is to consider the context information, that is, to make a judgment on the emotion of the user at a certain point in time on the basis of considering the entire dialogue content. The former technology has strong real-time performance, but does not have good accuracy because context information is not considered, and the latter identification accuracy is greatly improved compared with the former technology, but does not have real-time performance function in practical application, and the capability of real-time monitoring is lost.

The cyclic neural network is a very popular research direction in the field of artificial intelligent machine learning in recent years, and is also used as reinforcement learning of one of a paradigm and a methodology of machine learning, and by continuously combining with the cyclic neural network in recent years, the algorithm design is more flexible, and the application field is greatly expanded. Accordingly, different application fields correspond to different field knowledge, which is common sense specification and guidance of the studied problem, and can optimize the result obtained by the algorithm to a certain extent, for example, filter out causal relationships against common sense or actual conditions, increase the probability of more likely events being selected, and the like. By combining reinforcement learning and domain knowledge, the cyclic neural network has breakthrough progress in the directions of image processing, text analysis, voice recognition and the like, and has the characteristics of short training time, few training parameters and simple design.

Liu Qiyuan, zhang Dong (Multi-modal emotion analysis based on context enhanced LSTM) computer science 2019,046 (011): 181-185) for multi-modal emotion analysis to obtain information inside a single modality and interaction information between multiple modalities, a method of multi-modal emotion analysis based on context enhanced LSTM is presented. LSTM is one of the cyclic neural networks, for each expression of multiple modes, they combine contextual features, encoded separately using LSTM, and each capturing information inside a single mode; then fusing the independent information of the single modes, and obtaining interaction information among the multiple modes by using LSTM, so as to form a multi-mode characteristic representation; and finally, adopting a maximum pooling strategy to reduce the dimension of the multi-mode, thereby constructing the emotion classifier. The algorithm obtains good recognition accuracy on the public data set, and greatly improves the training speed. However, the multi-modal emotion analysis model uses all the context information related to the recognition target as input, belongs to post-hoc analysis, and cannot have the capability of real-time emotion analysis.

Disclosure of Invention

Technical problem to be solved

The existing multi-modal emotion analysis model aims at post analysis of an analyzed target, not only needs information before the target, but also needs to extract information after the target, and does not meet the requirements and actual conditions of real-time multi-modal dialogue emotion analysis. Aiming at the defect that the prior art cannot analyze in real time, the invention provides a real-time multi-mode dialogue emotion analysis model and method based on reinforcement learning and domain knowledge.

Technical proposal

The reinforcement learning model based on the cyclic neural network for emotion analysis is characterized by comprising 12 layers, wherein the first layer is an input layer, the middle 10 layers are hidden layers, the reinforcement learning model comprises 2 cyclic neural network layers, 2 normalization layers, 1 activation layer and 5 full connection layers, and the last layer is an output layer; inputting three-mode information of images, characters and voices in a current dialogue sampling section, and firstly, respectively carrying out single-mode feature processing; the image feature processing layer, the text feature processing layer and the voice feature processing layer respectively comprise a normalization layer, a circulating neural network layer and a full-connection layer; then the three modes are fused through a normalization layer, a circulating neural network layer, an activation layer and a 1-layer full-connection layer, and finally the three full-connection layers are connected to output results; the network output is the probability value of the last sentence of dialogue information in the emotion type of the sampling segment.

The technical scheme of the invention is as follows: the emotion types include 6 types: happiness, excitement, depression, sadness, anger and neutrality.

The technical scheme of the invention is as follows: the sampling segment comprises 4 consecutive dialogs.

A real-time multi-mode dialogue emotion analysis method based on reinforcement learning and domain knowledge is characterized by comprising the following steps:

step 1: acquiring a multi-mode dialogue information database, and generating dialogue emotion field knowledge according to the database;

step 2: constructing the reinforcement learning model based on the recurrent neural network as claimed in claim 1, and training the model;

step 3: and (3) collecting multi-mode dialogue information in real time, sequentially sampling according to the occurrence time sequence of the dialogue, analyzing the emotion of the dialogue in real time by using the reinforced learning model based on the cyclic neural network trained in the step (2), outputting probability values of the emotion types respectively appearing, and correcting the recognition result according to the domain knowledge to obtain the final classification result.

The technical scheme of the invention is as follows: the construction of the real-time multi-mode dialogue emotion analysis model in the step 2 is specifically as follows:

1) Representing the input multimodal information as:

s(t)＝[V(t),T(t),A(t)]

t is the current sampling time, s (T) is the state information of the current sampling time, V (T) is the image information in the current sampling time, T (T) is the text information in the current sampling time, and A (T) is the voice information in the current sampling time;

2) Training the model on a multi-mode dialogue information database, and calculating the result of the multi-mode information at the sampling time t, which passes through a normalization layer, a circulating neural network layer, an activation layer and a full connection layer to obtain an output layer, wherein the formula is as follows:

action(t)＝Q(s(t))

q is an emotion type identification result of the current sampling moment output by constructing a reinforcement learning algorithm model based on a cyclic neural network, action (t) is a model, and a reward function R is calculated according to the output result;

wherein label (t) is a true emotion type; then, calculating the difference value between the expected value and the estimated value to obtain a loss function of the whole network; wherein the expected value eval has the following calculation formula:

eval＝Q(s(t+1))

the calculation formula of the estimated value epet is:

thereby obtaining a loss function loss:

loss＝E[epet-eval]

wherein E is the desire for epet-eval.

The technical scheme of the invention is as follows: and step 2, training a reinforcement learning model based on the cyclic neural network by adopting a gradient descent and back propagation algorithm.

Advantageous effects

Compared with the existing multi-mode dialogue emotion analysis model, the model provided by the invention focuses on the real-time performance of dialogue emotion analysis, is divided into coherent emotion states containing target related information by sampling according to the dialogue occurrence sequence, processes and fuses the multi-mode information by adopting a cyclic neural network, and screens and corrects recognition results by referring to the domain knowledge, thereby realizing real-time dialogue emotion analysis.

The novel multi-mode emotion analysis model combining reinforcement learning, a circulating neural network and domain knowledge can realize real-time emotion analysis in a dialogue process, ensure real-time performance, simultaneously consider multi-mode information and domain knowledge related to target sentences, and improve recognition accuracy.

According to the real-time multi-mode dialogue emotion analysis method based on reinforcement learning and domain knowledge, aiming at the characteristic that relevant information after a target cannot be obtained through real-time multi-mode emotion analysis, a new model and a network structure are designed through combination of reinforcement learning and a cyclic neural network, multi-mode information in a target and a sampling time period before the target is fully extracted, fused and analyzed, and recognition efficiency and accuracy are further improved through combination with the domain knowledge.

Drawings

FIG. 1 is a block diagram of a real-time multimodal dialog emotion analysis model based on reinforcement learning and domain knowledge;

FIG. 2 is a flow chart of the method of the present invention;

FIG. 3 is a graph of the test results of the present invention.

Detailed Description

The invention will now be further described with reference to examples, figures:

in order to realize real-time and rapid multi-modal dialogue emotion analysis, the invention provides a novel multi-modal emotion analysis model combining reinforcement learning with a circulating neural network and field learning, wherein a competition network structure is adopted as an iterative training algorithm of reinforcement learning, the circulating neural network is used as a network model, 6 basic emotion types (happiness, excitement, frustration, sadness, anger and neutrality) are counted on the basis of a general public dialogue data set, the correlation size of 4 sentences in the sampling length is calculated, and the output result of the model is corrected.

In a multi-modal dialogue, every 4 sentences in the dialogue are sampling segments, namely the sampling length is 4, and the steps are 1 according to the occurrence sequence of the dialogue. Meanwhile, multi-modal dialogue information (images, characters and voices) in each sampling segment is used as a state in the reinforcement learning algorithm environment, a 4 th sentence in the sampling segment is target information needed to be subjected to multi-modal emotion analysis, the first 3 sentences are the 4 th sentence, needed associated information is provided as a reference, the information is used as input parameters of a circulating neural network, the size of the possibility of the target sentence in 6 alternative emotion types is obtained through calculation and recognition of the circulating neural network, finally, final probability value sorting is output through domain knowledge normalization and correction, the emotion type with the highest probability is used as the emotion type of the judged target information, namely, the judged emotion type is regarded as the emotion type selected based on the current state, and the judged emotion type is compared with the true emotion type to obtain a reward function. And finally, completing state transition through the action, wherein the corresponding next state is the multi-mode dialogue information contained in the next sampling section in the current dialogue until the current dialogue is ended, and completing identification.

As shown in FIG. 1, in the invention, the structure based on the reinforcement learning algorithm of the cyclic neural network has 12 layers, namely an input layer and an output layer, and the middle 10 layers are hidden layers, and comprise 2 cyclic neural network layers, 2 normalization layers, 1 activation layer and 5 full connection layers. The neural network inputs three-mode information of images, characters and voices in the current dialogue sampling section, and firstly, single-mode characteristic processing is respectively carried out. The image feature processing layer, the text feature processing layer and the voice feature processing layer respectively comprise a normalization layer, a cyclic neural network layer and a full-connection layer. And then the three modes are fused through a normalization layer, a circulating neural network layer, an activation layer and a 1-layer full-connection layer, and finally the three full-connection layers are connected to output results. The network output is the possible probability value of the last sentence of dialogue information in the sampling segment in 6 emotion types, namely a Q table. And finally, correcting the calculated probability value by combining the domain knowledge corresponding to the current dialogue to obtain a corrected Q table, and selecting the emotion type with the highest probability as the recognition result.

As shown in FIG. 2, the embodiment of the invention relates to a method for analyzing emotion of a real-time multi-mode dialogue for reinforcement learning and domain knowledge, which comprises the following steps:

step one, acquiring a multi-mode dialogue information database and statistical dialogue emotion domain knowledge. The method specifically comprises the following steps: the multi-modal dialogue database with good diversity is constructed, and the multi-modal dialogue database needs to have the characteristics of average sex proportion of the talkers, approximately uniform distribution of talking contents and emotion types and the like. After the database is determined, sampling is sequentially completed by taking a complete dialogue as a unit and taking the occurrence sequence of the dialogue as a unit to form a sample library, and the probability of occurrence of six emotion types under different bases is calculated by taking the sampling length as a unit and taking the corresponding three emotion types of the previous three sentences as the basis to generate domain knowledge K of dialogue emotion analysis.

Step two, constructing a reinforcement learning algorithm model based on a cyclic neural network, and training the model by adopting a gradient descent and back propagation algorithm, wherein the specific process is as follows:

(1) A reinforcement learning algorithm model based on a cyclic neural network is constructed according to fig. 1, and all parameters and weights are initialized by random numbers. Representing the input multimodal information as:

s(t)＝[V(t),T(t),A(t)]

t is the current sampling time, s (T) is the current sampling time state information, V (T) is the image information in the current sampling time, T (T) is the text information in the current sampling time, and A (T) is the voice information in the current sampling time.

(2) Training the model on a multi-mode dialogue information database, and calculating the result of the multi-mode information at the sampling time t, which passes through a normalization layer, a circulating neural network layer, an activation layer and a full connection layer to obtain an output layer, wherein the formula is as follows:

action(t)＝Q(s(t))

wherein Q is an emotion type recognition result of the current sampling moment output by constructing a reinforcement learning algorithm model based on a cyclic neural network according to FIG. 1, action (t) is a model, and a reward function R is calculated according to the output result.

Wherein label (t) is the true emotion type. And then, calculating the difference between the expected value and the estimated value to obtain the loss function of the whole network. Wherein the expected value eval has the following calculation formula:

eval＝Q(s(t+1))

the calculation formula of the estimated value epet is:

thereby obtaining a loss function loss:

loss＝E[epet-eval]

wherein E is the desire for epet-eval.

Training of the model is accomplished by the back propagation loss function loss.

And thirdly, adopting a dialogue which is not trained in the data set as a test example, performing real-time dialogue emotion analysis by using a reinforcement learning model based on a cyclic neural network, outputting probability values which are respectively appeared for six emotion types, correcting the recognition result according to the domain knowledge, and adding the output probability values and the corresponding domain knowledge by the correction method to obtain a final classification result, thereby obtaining the final classification result. The specific process is as follows:

(1) Taking the dialogue as a unit, sequentially sampling according to the occurrence time sequence of the dialogue, and identifying through a reinforcement learning model based on a cyclic neural network;

(2) Normalizing the recognition result, and correcting by using domain knowledge to obtain a final recognition result.

As shown in fig. 3, the black solid line is the test result of the method of the present invention, and the remaining broken lines are the test results of other existing methods. The abscissa in the figure is the dialogue length, taking the whole sentence of one speaker as a unit, the dialogue length is continuously increased along with the progress of the dialogue, and the maximum dialogue length is 50 according to the tested database, namely, the maximum dialogue length is 50 times for the speaker. The ordinate in the figure is the recognition accuracy, and the range is [0,1]. It can be seen from the figure that, firstly, only the method of the invention can dynamically recognize the emotion tendencies of the user in real time along with the progress of the dialogue, while the other methods do not have such capability; secondly, the test result of the method is higher than the identification accuracy of the existing method before the dialogue length is less than or equal to 35, and the measurable dialogue with the dialogue length greater than 35 in the database is greatly reduced after the dialogue length is greater than 35, so that the concussion of the result occurs, but the average accuracy is still higher than that of the existing method, and the effectiveness of the method is illustrated.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims

1. A real-time multi-mode dialogue emotion analysis method based on reinforcement learning and domain knowledge is characterized in that an adopted model comprises 12 layers, wherein the first layer is an input layer, the middle 10 layers are hidden layers, the model comprises 2 circulating neural network layers, 2 normalization layers, 1 activation layer and 5 full connection layers, and the last layer is an output layer; inputting three-mode information of images, characters and voices in a current dialogue sampling section, and firstly, respectively carrying out single-mode feature processing; the image feature processing layer, the text feature processing layer and the voice feature processing layer respectively comprise a normalization layer, a circulating neural network layer and a full-connection layer; then the three modes are fused through a normalization layer, a circulating neural network layer, an activation layer and a 1-layer full-connection layer, and finally the three full-connection layers are connected to output results; the network output is the probability value of the last sentence of dialogue information in the emotion type of the sampling segment; the emotion types include 6 types: happiness, excitement, depression, sadness, anger and neutrality; the sampling section comprises 4 continuous dialogs; the method comprises the following steps:

step 2: building a reinforcement learning model based on a cyclic neural network and training the model;

1) Representing the input multimodal information as:

s(t)＝[V(t),T(t),A(t)]

action(t)＝Q(s(t))

eval＝Q(s(t+1))

the calculation formula of the estimated value epet is:

thereby obtaining a loss function loss:

loss＝Ε[epet-eval]

wherein, E is the expectation of epi-eval;

training a reinforcement learning model based on a cyclic neural network by adopting a gradient descent and back propagation algorithm;