CN114020897A - Conversation emotion recognition method and related device - Google Patents
Conversation emotion recognition method and related device Download PDFInfo
- Publication number
- CN114020897A CN114020897A CN202111648205.7A CN202111648205A CN114020897A CN 114020897 A CN114020897 A CN 114020897A CN 202111648205 A CN202111648205 A CN 202111648205A CN 114020897 A CN114020897 A CN 114020897A
- Authority
- CN
- China
- Prior art keywords
- feature
- expressions
- context
- speaker
- emotion recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 89
- 238000000034 method Methods 0.000 title claims abstract description 79
- 230000014509 gene expression Effects 0.000 claims abstract description 147
- 230000008451 emotion Effects 0.000 claims abstract description 44
- 238000000605 extraction Methods 0.000 claims abstract description 39
- 230000008569 process Effects 0.000 claims abstract description 29
- 238000013528 artificial neural network Methods 0.000 claims abstract description 17
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 14
- 238000003860 storage Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 43
- 238000012545 processing Methods 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 11
- 238000010276 construction Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 claims description 5
- 230000015654 memory Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000004364 calculation method Methods 0.000 description 14
- 238000012360 testing method Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 239000000126 substance Substances 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 230000002996 emotional effect Effects 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 230000004913 activation Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Abstract
The application discloses a conversation emotion recognition method, which comprises the following steps: adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data; performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions; carrying out speaker feature extraction on the plurality of context feature expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker feature expressions; and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result. Context characteristics and speaker characteristics are added in the emotion recognition process, so that the accuracy of conversation emotion recognition is improved. The application also discloses a relevant device which comprises the conversation emotion recognition device, the server and the computer readable storage medium, and the beneficial effects are achieved.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a server, and a computer-readable storage medium for recognizing conversation emotion.
Background
Along with the falling of artificial intelligence application, how to make machines more emotional also becomes an urgent problem to be solved in academic and industrial fields. In emotion recognition, one important direction is emotion recognition in conversation, because in real life, people mainly transmit emotion through conversation: the customer service answer based on artificial intelligence gradually gets rid of the traditional cold and hard language description mode, but in the conversation process, how to let the machine understand the mood of the conversation person at the moment is a serious challenge. The research content of the dialogue-based emotion recognition task is to research the emotion change of a speaker in a dialogue process and recognize emotion information contained in each sentence of speaking.
In the related art, Graph Neural Network GNN (Graph Neural Network) is used to model the context relationship in the dialog process, and the individual characteristics of the speaker in the dialog process can be captured through explicit relationship modeling. GNN-based conversational emotion recognition schemes generally include four components: feature extraction, dialogue context modeling expression, speaker modeling expression, feature fusion and emotion classification. However, the graph convolution operation used in the conventional conversational emotion recognition has a simple structure and insufficient capability of extracting discrimination information, and reduces the accuracy of conversational emotion recognition.
Therefore, how to improve the accuracy of conversational emotion recognition is a key issue that those skilled in the art are interested in.
Disclosure of Invention
The purpose of the present application is to provide a dialogue emotion recognition method, a dialogue emotion recognition device, a server, and a computer-readable storage medium, which are capable of sufficiently modeling discrimination information and improving the accuracy of dialogue emotion recognition.
In order to solve the above technical problem, the present application provides a dialog emotion recognition method, including:
adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data;
performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions;
and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
Optionally, the speaker feature extraction is performed on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions, including:
carrying out graph construction processing based on the plurality of context feature expressions to obtain a conversation feature graph;
and carrying out speaker characteristic extraction on the conversation characteristic diagram based on a layered stack network structure to obtain the plurality of speaker characteristic tables.
Optionally, performing speaker feature extraction on the dialog feature graph based on a layered stack network structure to obtain the multiple speaker feature expressions, including:
obtaining a feature expression of a first layer in the hierarchical stacking network structure through relational graph convolution;
obtaining a feature representation of a second layer in the hierarchical stacked network structure by an attention-seeking convolutional neural network;
acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the hierarchical stacking network structure based on an attention mechanism;
and processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain the feature expressions of the multiple speakers.
Optionally, performing sentence feature extraction on the dialogue data by using a trained feature extractor to obtain a plurality of sentence feature data, including:
preprocessing the dialogue data to obtain processed dialogue data;
and adopting a trained feature extractor to extract sentence features from the processed dialogue data to obtain a plurality of sentence feature data.
Optionally, performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions, including:
using a double line LSTM as a modeling tool;
and carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.
Optionally, the training process of the feature extractor includes:
acquiring training set data;
and training the initial feature extractor by adopting the training set to obtain the trained feature extractor.
Optionally, performing emotion classification based on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain an emotion recognition result, including:
performing feature splicing on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain feature vectors;
and processing the feature vectors through a full connection layer to obtain the emotion recognition result.
The present application also provides a dialogue emotion recognition apparatus, including:
the feature extraction module is used for carrying out sentence feature extraction on the dialogue data by adopting the trained feature extractor to obtain a plurality of sentence feature data;
the context modeling module is used for carrying out context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
the speaker modeling module is used for extracting speaker characteristics of the plurality of context characteristic expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions;
and the emotion classification module is used for carrying out emotion classification on the basis of the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
The present application further provides a server, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the dialog emotion recognition method as described above when executing the computer program.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, realizes the steps of the dialog emotion recognition method as described above.
The application provides a conversation emotion recognition method, which comprises the following steps: adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data; performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions; carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions; and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
The method comprises the steps of extracting a plurality of sentence characteristic data, modeling to obtain context characteristic expressions and a plurality of speaker characteristic expressions, and classifying emotions based on the context characteristic expressions and the speaker characteristic expressions to obtain emotion recognition results.
The application also provides a conversation emotion recognition device, a server and a computer readable storage medium, which have the beneficial effects and are not described in detail herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for emotion recognition in a dialog according to an embodiment of the present application;
FIG. 2 is a flowchart of another emotion recognition method for dialog provided in an embodiment of the present application;
FIG. 3 is a flowchart illustrating a training process of a method for emotion recognition in dialog according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a dialogue emotion recognition apparatus according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a conversation emotion recognition method, a conversation emotion recognition device, a server and a computer readable storage medium, so as to improve the accuracy of conversation emotion recognition.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the related art, the graph neural network GNN is used to model the context relationship in the dialog process, and the individual characteristics of the speaker in the dialog process can be captured through the explicit relationship modeling. GNN-based conversational emotion recognition schemes generally include four components: feature extraction, dialogue context modeling expression, speaker modeling expression, feature fusion and emotion classification. However, the graph convolution operation used in the conventional dialogue emotion recognition has a simple structure and insufficient capability of extracting discrimination information, and reduces the accuracy of the dialogue emotion recognition.
Therefore, the method for recognizing the conversation emotion comprises the steps of extracting a plurality of sentence characteristic data, modeling to obtain context characteristic expressions and a plurality of speaker characteristic expressions, classifying emotion based on the context characteristic expressions and the speaker characteristic expressions to obtain emotion recognition results, adding the context characteristic and the speaker characteristic in the emotion recognition process, and improving the accuracy of conversation emotion recognition.
The following describes a method for recognizing dialogue emotion according to an embodiment.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for emotion recognition in a dialog according to an embodiment of the present application.
In this embodiment, the method may include:
s101, sentence feature extraction is carried out on the dialogue data by adopting a trained feature extractor to obtain a plurality of sentence feature data;
it can be seen that this step is intended to perform sentence feature extraction on the conversational data using a trained feature extractor to obtain a plurality of sentence feature data.
That is, the initial expression of sentence level in the dialogue is extracted by adopting a pre-training mode, pre-training is carried out by constructing a network structure of a convolutional layer, a maximum pooling layer and a full connection layer and using a data set with emotion marking, the trained model is used as a feature extractor for carrying out initial feature extraction, and initial features which can be analyzed are extracted.
Further, the step may include:
step 1, preprocessing dialogue data to obtain processed dialogue data;
and 2, carrying out sentence characteristic extraction on the processed dialogue data by adopting the trained characteristic extractor to obtain a plurality of sentence characteristic data.
It can be seen that the present alternative is mainly to illustrate how feature extraction is performed. In the alternative, the dialogue data is preprocessed to obtain processed dialogue data, and a trained feature extractor is used for sentence feature extraction of the processed dialogue data to obtain a plurality of sentence feature data.
S102, performing context feature expression modeling on the feature data of the sentences based on the bidirectional LSTM to obtain a plurality of context feature expressions;
on the basis of S101, the step aims to perform context feature expression modeling on a plurality of sentence feature data based on bidirectional LSTM, and obtain a plurality of context feature expressions.
In the prior art, the interaction information between different sentences in the whole dialogue is modeled. In order to sufficiently consider emotional relation before and after a conversation, a bidirectional GRU (Gated recursive Unit) is used as a modeling tool.
In this step, in order to improve the modeling performance, bidirectional LSTM (Long Short-Term Memory) is used for modeling.
Further, the step may include:
step 1, using a double-line LSTM as a modeling tool;
and 2, carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.
It can be seen that the present alternative scheme mainly illustrates how to extract context features. In the alternative, the double-line LSTM is used as a modeling tool, and the modeling tool is used to perform context information expression processing on the plurality of sentence feature data to obtain a plurality of context feature expressions.
S103, speaker feature extraction is carried out on the plurality of context feature expressions based on the graph neural network and the layered stack graph convolution structure, and a plurality of speaker feature expressions are obtained;
on the basis of S102, the step aims to extract the speaker characteristics of a plurality of context characteristic expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions.
The information modeled at this stage is mainly the characteristics of different speakers, which may be contained in a conversation, and the emotion-related characteristics of different speakers are different, and in order to model such characteristics, the scholars propose to use GNN to capture such difference characteristics. Firstly, a bidirectional graph structure is constructed according to a conversation, each node in the graph represents an expression characteristic corresponding to a certain sentence in the conversation, edges in the graph are defined as connections between any two nodes, and in order to reduce the calculation amount, windows are defined to limit the number of the connections between the nodes. Then, based on the fact that the emotion of different speakers in the conversation changes, the edges in the graph are divided into different categories according to the dependency relationship of the speakers and the time sequence relationship of the conversation, and finally, feature extraction is carried out by using a two-layer graph convolution operation.
Further, in the present embodiment, a discriminant feature extraction is further performed based on the convolution structure of the hierarchical stacked graph, so as to improve the accuracy of speaker feature extraction.
Further, the step may include:
step 1, carrying out graph construction processing based on a plurality of context feature expressions to obtain a conversation feature graph;
and 2, carrying out speaker feature extraction on the dialogue feature graph based on the layered stack network structure to obtain a plurality of speaker feature tables.
It can be seen that the present alternative is primarily illustrative of how speaker characteristics can be extracted. In the alternative, graph construction processing is performed based on a plurality of context feature expressions to obtain a conversation feature graph, and speaker feature extraction is performed on the conversation feature graph based on a layered stacking network structure to obtain a plurality of speaker feature tables.
Further, step 2 in the last alternative may include:
step 1, obtaining a feature expression of a first layer in a layered stack network structure through convolution of a relation graph;
step 2, acquiring feature expression of a second layer in the layered stack network structure through an attention-seeking convolutional neural network;
step 3, acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the layered stack network structure based on an attention mechanism;
and 4, processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain a plurality of speaker feature tables.
And S104, carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
On the basis of S103, the step aims to carry out emotion classification based on a plurality of context characteristic expressions and a plurality of speaker characteristic expressions to obtain an emotion recognition result. That is, the characteristics of the dialogue context modeling expression and the speaker modeling expression are fused and sent into a classifier consisting of a full connection layer with a softmax activation function for final emotion classification.
Further, the step may include:
step 1, performing feature splicing on a plurality of context feature expressions and a plurality of speaker feature expressions to obtain feature vectors;
and 2, processing the characteristic vectors through the full connection layer to obtain an emotion recognition result.
It can be seen that the present alternative is mainly illustrative of how sentiment classification may be performed. In the alternative scheme, a plurality of context feature expressions and a plurality of speaker feature expressions are subjected to feature splicing to obtain feature vectors, and the feature vectors are processed through a full connection layer to obtain emotion recognition results.
In addition, the method in this embodiment may further include:
step 1, acquiring training set data;
and 2, training the initial feature extractor by adopting a training set to obtain the trained feature extractor.
It can be seen that this alternative also illustrates how the training is performed. In this alternative, training set data is obtained, and the training set is used to train the initial feature extractor to obtain a trained feature extractor.
In summary, in the embodiment, the plurality of sentence characteristic data are extracted first, then the context characteristic expression and the plurality of speaker characteristic expressions are obtained through modeling, finally the emotion recognition result is obtained through emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions, the context characteristic and the speaker characteristic are added in the emotion recognition process, and the accuracy of conversation emotion recognition is improved.
The following further describes a method for recognizing dialog emotion according to a specific embodiment.
Referring to fig. 2, fig. 2 is a flowchart illustrating another emotion recognition method according to an embodiment of the present disclosure.
The embodiment provides a dialog emotion recognition method based on layered stack graph convolution. Firstly, extracting initial emotion feature information in a conversation sentence, then constructing an emotion recognition network based on layered stack graph convolution, carrying out supervised training on network parameters by using data labels, and after the training is finished, inputting data to carry out prediction output of emotion recognition.
The core of the technical scheme is the emotion recognition network based on the layered stack graph convolution, as shown in figure 2, the emotion recognition network backbone framework structure based on the layered stack graph convolution comprises a feature extraction layer, a dialogue context modeling expression layer, a speaker modeling expression layer, a feature fusion and emotion classification layer. The method comprises the steps that a feature extraction layer mainly extracts features of an input dialogue content at a single sentence level, then a dialogue context modeling expression layer models information of the sentence level features at the dialogue context level to obtain features with contexts, then the dialogue context modeling expression layer enters a speaker modeling expression layer, the layer needs to build a graph-based layered stack graph convolution structure to model and express information of characteristics of a speaker, and finally output features of the two expression layers are spliced and fused to be sent to a classifier to obtain a final emotion recognition result.
For symbolic description of emotion recognition tasks in a conversation, use is made here ofTo represent a set of dialogs, where N represents the number of sentences contained in the dialog,a sentence in a dialog is represented and,representing sentencesCorresponding to the speaker, the emotion recognition task in the conversation aims to build a model to predict eachCorresponding label。
Based on the above description, the operation process in this embodiment may include:
and step 1, feature extraction. And extracting targeted emotion relevant features by adopting a pre-training strategy. Specifically, a simple network composed of a convolutional layer, a maximum pooling layer and a full-link layer is constructed, a word vector model GloVe processed by natural language is used as initialization of word features in an input sentence, emotion classification training is carried out on an emotion recognition data set, finally, the trained network is used as a feature extractor, and finally, the features of the full-link layer are extracted and used as output of a feature extraction stage. To reduce the use of symbols, it is still used hereTo represent the corresponding characteristics of the sentence.
And 2, modeling and expressing the conversation context. All sentences are treated equally in the dialogue context modeling expression, and the distinction of specific speakers is not carried out. The expression of context information in a dialog is performed by using a bi-directional LSTM as a modeling tool, and the process can be expressed as:
wherein the content of the first and second substances,the representation contains a feature representation of the dialog context information,representing indices of different directions of the bi-directional LSTM,representing a sentence in a dialogue, particularly at a boundaryAndinitialization is performed using an all zero vector.
And 3, modeling and expressing the speaker. In the conversation process, different speakers have mutual influence, meanwhile, the speakers also have the characteristic that the emotion of the speakers keeps unchanged in a short time, and the relationship among the different speakers is modeled by using a graph. Use ofTo represent a constructed graph in whichRepresents a collection of vertices in the graph,represents a collection of edges in the graph,representing a collection of relationships between two vertices in the graph.
Vertices representing two sentences in a conversationAndin relation to (2)The definition of (c) can be considered from two aspects: the sentence corresponds to the speaker and the relative position relationship between the two sentences in the conversation process. Specifically, for example, a dialog containing two speakers a and B, optionally two sentences, the corresponding speaker may have four cases: AA, BB, AB, BA, where AA represents that both sentences were spoken by speaker a, other similar reason. The relative position relationship in the dialog can be defined as front and back. In combination, all possible relationship cases exist as 8, which are respectively expressed by integers 0 to 7. For the edges in the figure, although there is a certain relation between any two sentences in the dialogue in advance, a front and back range is defined from the viewpoint of calculation amount, and a vertex is pointed outConsider only its frontA vertexAnd a rear faceA vertexAnd itselfThe edges of the connection therebetween.
After the graph construction is completed, further discriminant feature extraction is carried out through the structure based on the convolution of the layered stack graph. In particular, the vertices in the graphCorresponding features are expressed using context corresponding to sentencesTo initialize, obtain a representation of a first layer in a hierarchical stack by a relational graph convolution operation:
Wherein the content of the first and second substances,representation and vertexCoincidence relationIn combination with the sequence numbers of all the vertices,represents a normalized constant, numericallyThe number of elements in the set, ReLU for activation function, takes on non-negative values, subscriptsRepresenting the corresponding parameter matrix to be trained.
At a second layer, which outputs information around vertices in the graph using a single parameter controlled Attention (AM) graph convolution neural network to dynamically aggregate information around vertices in the graphThe specific calculation process of (a) can be expressed as:
wherein the content of the first and second substances,indicating the serial numberThe set of vertex numbers in the neighborhood in the graph of the corresponding vertex,is a union operation of sets, with vertices addedConnection to itself, cosine distanceWhereinRepresentsL2 norm.Representing the sequence number after normalizationThe characteristic and the sequence number of the vertex ofThe degree of correlation between the features of the vertices of (b),representing the weight that the layer needs to learn,representing the denominator in the normalization calculation.
The next three layers are all based on the transform calculation mode to perform further aggregation of features in the graph, a structure of residual connection is adopted between the layers, here, transfonv is used to represent a layer of calculation process with transform as a main calculation mode, and then the transform structure of the three-layer stack can be formally described as follows:
wherein the content of the first and second substances,representing the entire vertex output set of the different convolutional layers, for a total of five-layer graph convolutional operations, the TransConv computation process using the multi-head attention mechanism involves retrieving vertex featuresKey-value vertex featuresAnd numerical vertex featuresRetrieving vertex features based on currently interesting verticesCalculating, wherein key value vertex characteristics and numerical value vertex characteristics are calculated based on neighborhood vertexes, and the calculation modes of the three characteristics are similar:
In order to weight and combine vertex features of different neighborhoods, weights are calculated according to the vertex features of retrieval and the vertex features of key values:
Wherein the subscript c represents the index of the number of heads in the multi-head attention mechanism, and the number of heads in the multi-head attention mechanism is represented byIt is shown that,representing the vector dimension corresponding to each attention head in the attention mechanism. After the weights are obtained, the aggregated numerical vertex features can be weighted and calculated, and the output of the multi-head attention structure is spliced as follows:
wherein the content of the first and second substances,representing the operation of splicing the vectors.
Subsequently, the output of the TransConv layer is obtained by using a gated residual concatenation mode:
Wherein, with subscriptsAndrepresenting the weight matrix and bias to be trained, sigmoid and ReLU representing the activation function, laylernenorm representing the skin normalization operation,it represents the residual stitching operation performed on features separated by semicolons within parentheses.
The above is the computation process involving TransConv in the three layers, and in the last fifth layer, vector stitching and nonlinear mapping ReLU are removed:
wherein the content of the first and second substances,other symbols have the same meaning as the above symbols, and through the above feature processing, the vertex feature of the final fifth-layer output can be expressed as。
Step 4, feature fusion and emotion classification: the feature fusion here is the output after the concatenation convolution operationAnd original sentence expressionFinally, probability distribution of different categories can be obtained through the full connection layer, and the index category corresponding to the maximum probability is the output category:
wherein, with subscriptsAndrepresenting the weight matrix and the bias that need to be trained,the feature vector representing the final splice is then,is a dense feature that is subject to a non-linear mapping,a class probability vector representing the final output,is the prediction category of the final output.
Parameters in the updated model can be calculated using a back-propagation algorithm (based on SGD) by calculating cross-entropy loss functions with the real classes. With the convergence of the model, the updating of the parameters is finished, the parameters are fixed to be a feasible solution of the proposed algorithm, and the test of the test data can be carried out.
In the specific implementation process, the implementation process can be divided into three parts of training, verifying and testing, after the start, training of a pre-training network for emotion recognition is firstly established, the part of the step 1 is mainly referred to, initial expression of sentences in a dialogue capable of extracting emotion characteristics is obtained through training, then training data and a model for defining layered stacked graph convolution dialogue emotion recognition are established, model parameters are updated by using the training data, if the condition of model convergence is not met, calculation and updating of the model parameters are continued, if the condition of model convergence is met, a testing stage is entered, test data are input, an output result of model calculation is output, and the whole process is ended. It should be noted that the model convergence condition herein includes not only the number of times the training reaches the set value or the degree of decrease of the training error stabilizes to a certain range, but also a threshold value of the error between the predicted value and the true value may be set, and when the error of the model is smaller than the given threshold value, it may be determined that the training is stopped. In the definition of the model loss function, a cross entropy loss function suitable for multi-classification is used, or other improvement methods suitable for multi-classification models are used. In the aspect of updating parameters of the model, an RMSprob (root Mean Square prediction) algorithm may be adopted, and other Gradient-based parameter optimization methods may also be adopted, including but not limited to random Gradient Descent (SGD), adaptive Gradient, Adam (adaptive motion estimation), Adamax (Adam based on a variant of an infinite norm), asgd (acquired stored Gradient prediction), RMSprob, and the like.
Referring to fig. 3, fig. 3 is a flowchart illustrating a training process of a method for emotion recognition in dialog according to an embodiment of the present application.
Based on the above description, a neural network is constructed according to the content of the present invention to perform emotion recognition, so as to describe in detail the specific implementation of the present invention. It should be noted that the embodiments described herein are only for explaining the present invention, and are not limited to the present invention.
A multimodal emotion recognition data set IEMOCAP (interactive emotion binary motion capture database) is downloaded, which contains the process of two-person conversation and tags the emotion. The IEMOCAP data set contains 2199 self-portrait video clips, which are divided proportionally into three parts as a whole: training set (80%), validation set (10%) and test set (10%). Here, using the text of the speech transcription as the processed data, the tags of all the samples are defined as one of the following six kinds for the tags of the sample sentences: prior to the following operations, the data set is divided into a training set, a validation set, and a test set.
According to the part of the step 1, a simple neural network is constructed as a feature extraction network, the network is trained by using training data and sentence labels, after the training is finished, the network is used for extracting the features of all data, and the features in the process are obtained。
Constructing a network structure according to the calculation method in the steps 2 to 4, wherein the number of heads in the multi-head attention is set to be 3, inputting training data, performing forward calculation to obtain emotion recognition output of a final model。
The cross entropy loss function described above measures the output prediction value of the model and the tag value in the data set during the training process.
According to the parameter optimization method, a proper optimization method is selected to update the parameters needing to be updated in the model according to the actual implementation situation. In this implementation, the parameter is updated by using an Adam method.
During the training process, the parameters are updated on the training set first, and after adjusting the parameters one Epoch (one training) for the whole training set each time, the loss calculation is performed on the verification set and recorded, and the number of epochs of the training is set, here to 120. And selecting the model with the minimum loss on the verification set as the model of the final training output.
And inputting the test data into the trained model for forward calculation to obtain the final emotion recognition output.
Compared with the existing conversational emotion recognition method, the proposed conversational emotion recognition method of the layered stack graph convolution has the following significant advantages: improving the judgment of emotional expression by using the convolution structure of the layered stack diagram; fusing information of surrounding nodes in the dialogue graph by using a single-parameter attention mechanism; and performing emotional characteristic aggregation globally through a Transformer structure neutron attention mechanism calculation method.
In summary, in the embodiment, the plurality of sentence characteristic data are extracted first, then modeling is performed to obtain the context characteristic expression and the plurality of speaker characteristic expressions, finally emotion classification is performed based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain the emotion recognition result, the context characteristic and the speaker characteristic are added in the emotion recognition process, and the accuracy of conversation emotion recognition is improved.
In the following, the dialogue emotion recognition apparatus provided in the embodiment of the present application is introduced, and the dialogue emotion recognition apparatus described below and the dialogue emotion recognition method described above may be referred to in correspondence with each other.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a dialog emotion recognition device according to an embodiment of the present application.
In this embodiment, the apparatus may include:
a feature extraction module 100, configured to perform sentence feature extraction on the conversational data by using a trained feature extractor to obtain a plurality of sentence feature data;
the context modeling module 200 is configured to perform context feature expression modeling on a plurality of sentence feature data based on bidirectional LSTM to obtain a plurality of context feature expressions;
the speaker modeling module 300 is configured to perform speaker feature extraction on the multiple context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain multiple speaker feature expressions;
and the emotion classification module 400 is used for performing emotion classification based on the multiple context feature expressions and the multiple speaker feature expressions to obtain an emotion recognition result.
An embodiment of the present application further provides a server, including:
a memory for storing a computer program;
a processor for implementing the steps of the dialog emotion recognition method as described in the above embodiments when the computer program is executed.
The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the dialog emotion recognition method according to the above embodiments.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The present application provides a method, apparatus, server, and computer-readable storage medium for emotion recognition in a dialog. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
Claims (10)
1. A conversation emotion recognition method is characterized by comprising the following steps:
adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data;
performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions;
and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
2. The method for recognizing dialogue emotion according to claim 1, wherein the step of performing speaker feature extraction on the plurality of contextual feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions comprises:
carrying out graph construction processing based on the plurality of context feature expressions to obtain a conversation feature graph;
and carrying out speaker feature extraction on the conversation feature diagram based on a layered stack network structure to obtain the multiple speaker feature expressions.
3. The conversation emotion recognition method of claim 2, wherein performing speaker feature extraction on the conversation feature map based on a hierarchical network structure to obtain the plurality of speaker feature expressions comprises:
obtaining a feature expression of a first layer in the hierarchical stacking network structure through relational graph convolution;
obtaining a feature representation of a second layer in the hierarchical stacked network structure by an attention-seeking convolutional neural network;
acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the hierarchical stacking network structure based on an attention mechanism;
and processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain the feature expressions of the multiple speakers.
4. The method of claim 1, wherein the sentence feature extraction is performed on the dialogue data by using a trained feature extractor to obtain a plurality of sentence feature data, and the method comprises:
preprocessing the dialogue data to obtain processed dialogue data;
and adopting a trained feature extractor to extract sentence features from the processed dialogue data to obtain a plurality of sentence feature data.
5. The method of recognizing dialogue emotion according to claim 1, wherein the modeling of context feature expression for the plurality of sentence feature data based on bidirectional LSTM to obtain a plurality of context feature expressions comprises:
using a double line LSTM as a modeling tool;
and carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.
6. The method for recognizing dialogue emotion according to claim 1, wherein the training process of the feature extractor comprises:
acquiring training set data;
and training the initial feature extractor by adopting the training set to obtain the trained feature extractor.
7. The method for recognizing dialogue emotion according to claim 1, wherein the emotion classification is performed based on the plurality of contextual characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result, and the method comprises:
performing feature splicing on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain feature vectors;
and processing the feature vectors through a full connection layer to obtain the emotion recognition result.
8. A conversational emotion recognition apparatus, comprising:
the feature extraction module is used for carrying out sentence feature extraction on the dialogue data by adopting the trained feature extractor to obtain a plurality of sentence feature data;
the context modeling module is used for carrying out context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
the speaker modeling module is used for extracting speaker characteristics of the plurality of context characteristic expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions;
and the emotion classification module is used for carrying out emotion classification on the basis of the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
9. A server, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of conversational emotion recognition according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the dialog emotion recognition method according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111648205.7A CN114020897A (en) | 2021-12-31 | 2021-12-31 | Conversation emotion recognition method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111648205.7A CN114020897A (en) | 2021-12-31 | 2021-12-31 | Conversation emotion recognition method and related device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114020897A true CN114020897A (en) | 2022-02-08 |
Family
ID=80069434
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111648205.7A Pending CN114020897A (en) | 2021-12-31 | 2021-12-31 | Conversation emotion recognition method and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114020897A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114913590A (en) * | 2022-07-15 | 2022-08-16 | 山东海量信息技术研究院 | Data emotion recognition method, device and equipment and readable storage medium |
CN115374281A (en) * | 2022-08-30 | 2022-11-22 | 重庆理工大学 | Session emotion analysis method based on multi-granularity fusion and graph convolution network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170372694A1 (en) * | 2016-06-23 | 2017-12-28 | Panasonic Intellectual Property Management Co., Ltd. | Dialogue act estimation method, dialogue act estimation apparatus, and storage medium |
CN111950275A (en) * | 2020-08-06 | 2020-11-17 | 平安科技(深圳)有限公司 | Emotion recognition method and device based on recurrent neural network and storage medium |
CN113609289A (en) * | 2021-07-06 | 2021-11-05 | 河南工业大学 | Multi-mode dialog text-based emotion recognition method |
CN113641822A (en) * | 2021-08-11 | 2021-11-12 | 哈尔滨工业大学 | Fine-grained emotion classification method based on graph neural network |
CN113656564A (en) * | 2021-07-20 | 2021-11-16 | 国网天津市电力公司 | Power grid service dialogue data emotion detection method based on graph neural network |
-
2021
- 2021-12-31 CN CN202111648205.7A patent/CN114020897A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170372694A1 (en) * | 2016-06-23 | 2017-12-28 | Panasonic Intellectual Property Management Co., Ltd. | Dialogue act estimation method, dialogue act estimation apparatus, and storage medium |
CN111950275A (en) * | 2020-08-06 | 2020-11-17 | 平安科技(深圳)有限公司 | Emotion recognition method and device based on recurrent neural network and storage medium |
CN113609289A (en) * | 2021-07-06 | 2021-11-05 | 河南工业大学 | Multi-mode dialog text-based emotion recognition method |
CN113656564A (en) * | 2021-07-20 | 2021-11-16 | 国网天津市电力公司 | Power grid service dialogue data emotion detection method based on graph neural network |
CN113641822A (en) * | 2021-08-11 | 2021-11-12 | 哈尔滨工业大学 | Fine-grained emotion classification method based on graph neural network |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114913590A (en) * | 2022-07-15 | 2022-08-16 | 山东海量信息技术研究院 | Data emotion recognition method, device and equipment and readable storage medium |
CN115374281A (en) * | 2022-08-30 | 2022-11-22 | 重庆理工大学 | Session emotion analysis method based on multi-granularity fusion and graph convolution network |
CN115374281B (en) * | 2022-08-30 | 2024-04-05 | 重庆理工大学 | Session emotion analysis method based on multi-granularity fusion and graph convolution network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111275085B (en) | Online short video multi-modal emotion recognition method based on attention fusion | |
KR102071582B1 (en) | Method and apparatus for classifying a class to which a sentence belongs by using deep neural network | |
CN111444340A (en) | Text classification and recommendation method, device, equipment and storage medium | |
CN110321563B (en) | Text emotion analysis method based on hybrid supervision model | |
CN109299267B (en) | Emotion recognition and prediction method for text conversation | |
CN108549658A (en) | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree | |
CN111353029B (en) | Semantic matching-based multi-turn spoken language understanding method | |
CN114020897A (en) | Conversation emotion recognition method and related device | |
CN113220886A (en) | Text classification method, text classification model training method and related equipment | |
Li et al. | Learning fine-grained cross modality excitement for speech emotion recognition | |
CN111027292B (en) | Method and system for generating limited sampling text sequence | |
CN114818703B (en) | Multi-intention recognition method and system based on BERT language model and TextCNN model | |
CN111159375A (en) | Text processing method and device | |
CN116168324A (en) | Video emotion recognition method based on cyclic interaction transducer and dimension cross fusion | |
CN111046178A (en) | Text sequence generation method and system | |
CN110992943B (en) | Semantic understanding method and system based on word confusion network | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
Chen et al. | Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network | |
CN115497465A (en) | Voice interaction method and device, electronic equipment and storage medium | |
CN115587337A (en) | Method, device and storage medium for recognizing abnormal sound of vehicle door | |
CN111914553A (en) | Financial information negative subject judgment method based on machine learning | |
CN114841151A (en) | Medical text entity relation joint extraction method based on decomposition-recombination strategy | |
CN113239690A (en) | Chinese text intention identification method based on integration of Bert and fully-connected neural network | |
CN116244474A (en) | Learner learning state acquisition method based on multi-mode emotion feature fusion | |
CN116258147A (en) | Multimode comment emotion analysis method and system based on heterogram convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220208 |