CN114020897A - Conversation emotion recognition method and related device - Google Patents

Conversation emotion recognition method and related device Download PDF

Info

Publication number
CN114020897A
CN114020897A CN202111648205.7A CN202111648205A CN114020897A CN 114020897 A CN114020897 A CN 114020897A CN 202111648205 A CN202111648205 A CN 202111648205A CN 114020897 A CN114020897 A CN 114020897A
Authority
CN
China
Prior art keywords
feature
expressions
context
speaker
emotion recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111648205.7A
Other languages
Chinese (zh)
Inventor
鲁璐
李仁刚
赵雅倩
王斌强
董刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111648205.7A priority Critical patent/CN114020897A/en
Publication of CN114020897A publication Critical patent/CN114020897A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Abstract

The application discloses a conversation emotion recognition method, which comprises the following steps: adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data; performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions; carrying out speaker feature extraction on the plurality of context feature expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker feature expressions; and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result. Context characteristics and speaker characteristics are added in the emotion recognition process, so that the accuracy of conversation emotion recognition is improved. The application also discloses a relevant device which comprises the conversation emotion recognition device, the server and the computer readable storage medium, and the beneficial effects are achieved.

Description

Conversation emotion recognition method and related device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a server, and a computer-readable storage medium for recognizing conversation emotion.
Background
Along with the falling of artificial intelligence application, how to make machines more emotional also becomes an urgent problem to be solved in academic and industrial fields. In emotion recognition, one important direction is emotion recognition in conversation, because in real life, people mainly transmit emotion through conversation: the customer service answer based on artificial intelligence gradually gets rid of the traditional cold and hard language description mode, but in the conversation process, how to let the machine understand the mood of the conversation person at the moment is a serious challenge. The research content of the dialogue-based emotion recognition task is to research the emotion change of a speaker in a dialogue process and recognize emotion information contained in each sentence of speaking.
In the related art, Graph Neural Network GNN (Graph Neural Network) is used to model the context relationship in the dialog process, and the individual characteristics of the speaker in the dialog process can be captured through explicit relationship modeling. GNN-based conversational emotion recognition schemes generally include four components: feature extraction, dialogue context modeling expression, speaker modeling expression, feature fusion and emotion classification. However, the graph convolution operation used in the conventional conversational emotion recognition has a simple structure and insufficient capability of extracting discrimination information, and reduces the accuracy of conversational emotion recognition.
Therefore, how to improve the accuracy of conversational emotion recognition is a key issue that those skilled in the art are interested in.
Disclosure of Invention
The purpose of the present application is to provide a dialogue emotion recognition method, a dialogue emotion recognition device, a server, and a computer-readable storage medium, which are capable of sufficiently modeling discrimination information and improving the accuracy of dialogue emotion recognition.
In order to solve the above technical problem, the present application provides a dialog emotion recognition method, including:
adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data;
performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions;
and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
Optionally, the speaker feature extraction is performed on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions, including:
carrying out graph construction processing based on the plurality of context feature expressions to obtain a conversation feature graph;
and carrying out speaker characteristic extraction on the conversation characteristic diagram based on a layered stack network structure to obtain the plurality of speaker characteristic tables.
Optionally, performing speaker feature extraction on the dialog feature graph based on a layered stack network structure to obtain the multiple speaker feature expressions, including:
obtaining a feature expression of a first layer in the hierarchical stacking network structure through relational graph convolution;
obtaining a feature representation of a second layer in the hierarchical stacked network structure by an attention-seeking convolutional neural network;
acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the hierarchical stacking network structure based on an attention mechanism;
and processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain the feature expressions of the multiple speakers.
Optionally, performing sentence feature extraction on the dialogue data by using a trained feature extractor to obtain a plurality of sentence feature data, including:
preprocessing the dialogue data to obtain processed dialogue data;
and adopting a trained feature extractor to extract sentence features from the processed dialogue data to obtain a plurality of sentence feature data.
Optionally, performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions, including:
using a double line LSTM as a modeling tool;
and carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.
Optionally, the training process of the feature extractor includes:
acquiring training set data;
and training the initial feature extractor by adopting the training set to obtain the trained feature extractor.
Optionally, performing emotion classification based on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain an emotion recognition result, including:
performing feature splicing on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain feature vectors;
and processing the feature vectors through a full connection layer to obtain the emotion recognition result.
The present application also provides a dialogue emotion recognition apparatus, including:
the feature extraction module is used for carrying out sentence feature extraction on the dialogue data by adopting the trained feature extractor to obtain a plurality of sentence feature data;
the context modeling module is used for carrying out context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
the speaker modeling module is used for extracting speaker characteristics of the plurality of context characteristic expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions;
and the emotion classification module is used for carrying out emotion classification on the basis of the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
The present application further provides a server, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the dialog emotion recognition method as described above when executing the computer program.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, realizes the steps of the dialog emotion recognition method as described above.
The application provides a conversation emotion recognition method, which comprises the following steps: adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data; performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions; carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions; and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
The method comprises the steps of extracting a plurality of sentence characteristic data, modeling to obtain context characteristic expressions and a plurality of speaker characteristic expressions, and classifying emotions based on the context characteristic expressions and the speaker characteristic expressions to obtain emotion recognition results.
The application also provides a conversation emotion recognition device, a server and a computer readable storage medium, which have the beneficial effects and are not described in detail herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for emotion recognition in a dialog according to an embodiment of the present application;
FIG. 2 is a flowchart of another emotion recognition method for dialog provided in an embodiment of the present application;
FIG. 3 is a flowchart illustrating a training process of a method for emotion recognition in dialog according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a dialogue emotion recognition apparatus according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a conversation emotion recognition method, a conversation emotion recognition device, a server and a computer readable storage medium, so as to improve the accuracy of conversation emotion recognition.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the related art, the graph neural network GNN is used to model the context relationship in the dialog process, and the individual characteristics of the speaker in the dialog process can be captured through the explicit relationship modeling. GNN-based conversational emotion recognition schemes generally include four components: feature extraction, dialogue context modeling expression, speaker modeling expression, feature fusion and emotion classification. However, the graph convolution operation used in the conventional dialogue emotion recognition has a simple structure and insufficient capability of extracting discrimination information, and reduces the accuracy of the dialogue emotion recognition.
Therefore, the method for recognizing the conversation emotion comprises the steps of extracting a plurality of sentence characteristic data, modeling to obtain context characteristic expressions and a plurality of speaker characteristic expressions, classifying emotion based on the context characteristic expressions and the speaker characteristic expressions to obtain emotion recognition results, adding the context characteristic and the speaker characteristic in the emotion recognition process, and improving the accuracy of conversation emotion recognition.
The following describes a method for recognizing dialogue emotion according to an embodiment.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for emotion recognition in a dialog according to an embodiment of the present application.
In this embodiment, the method may include:
s101, sentence feature extraction is carried out on the dialogue data by adopting a trained feature extractor to obtain a plurality of sentence feature data;
it can be seen that this step is intended to perform sentence feature extraction on the conversational data using a trained feature extractor to obtain a plurality of sentence feature data.
That is, the initial expression of sentence level in the dialogue is extracted by adopting a pre-training mode, pre-training is carried out by constructing a network structure of a convolutional layer, a maximum pooling layer and a full connection layer and using a data set with emotion marking, the trained model is used as a feature extractor for carrying out initial feature extraction, and initial features which can be analyzed are extracted.
Further, the step may include:
step 1, preprocessing dialogue data to obtain processed dialogue data;
and 2, carrying out sentence characteristic extraction on the processed dialogue data by adopting the trained characteristic extractor to obtain a plurality of sentence characteristic data.
It can be seen that the present alternative is mainly to illustrate how feature extraction is performed. In the alternative, the dialogue data is preprocessed to obtain processed dialogue data, and a trained feature extractor is used for sentence feature extraction of the processed dialogue data to obtain a plurality of sentence feature data.
S102, performing context feature expression modeling on the feature data of the sentences based on the bidirectional LSTM to obtain a plurality of context feature expressions;
on the basis of S101, the step aims to perform context feature expression modeling on a plurality of sentence feature data based on bidirectional LSTM, and obtain a plurality of context feature expressions.
In the prior art, the interaction information between different sentences in the whole dialogue is modeled. In order to sufficiently consider emotional relation before and after a conversation, a bidirectional GRU (Gated recursive Unit) is used as a modeling tool.
In this step, in order to improve the modeling performance, bidirectional LSTM (Long Short-Term Memory) is used for modeling.
Further, the step may include:
step 1, using a double-line LSTM as a modeling tool;
and 2, carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.
It can be seen that the present alternative scheme mainly illustrates how to extract context features. In the alternative, the double-line LSTM is used as a modeling tool, and the modeling tool is used to perform context information expression processing on the plurality of sentence feature data to obtain a plurality of context feature expressions.
S103, speaker feature extraction is carried out on the plurality of context feature expressions based on the graph neural network and the layered stack graph convolution structure, and a plurality of speaker feature expressions are obtained;
on the basis of S102, the step aims to extract the speaker characteristics of a plurality of context characteristic expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions.
The information modeled at this stage is mainly the characteristics of different speakers, which may be contained in a conversation, and the emotion-related characteristics of different speakers are different, and in order to model such characteristics, the scholars propose to use GNN to capture such difference characteristics. Firstly, a bidirectional graph structure is constructed according to a conversation, each node in the graph represents an expression characteristic corresponding to a certain sentence in the conversation, edges in the graph are defined as connections between any two nodes, and in order to reduce the calculation amount, windows are defined to limit the number of the connections between the nodes. Then, based on the fact that the emotion of different speakers in the conversation changes, the edges in the graph are divided into different categories according to the dependency relationship of the speakers and the time sequence relationship of the conversation, and finally, feature extraction is carried out by using a two-layer graph convolution operation.
Further, in the present embodiment, a discriminant feature extraction is further performed based on the convolution structure of the hierarchical stacked graph, so as to improve the accuracy of speaker feature extraction.
Further, the step may include:
step 1, carrying out graph construction processing based on a plurality of context feature expressions to obtain a conversation feature graph;
and 2, carrying out speaker feature extraction on the dialogue feature graph based on the layered stack network structure to obtain a plurality of speaker feature tables.
It can be seen that the present alternative is primarily illustrative of how speaker characteristics can be extracted. In the alternative, graph construction processing is performed based on a plurality of context feature expressions to obtain a conversation feature graph, and speaker feature extraction is performed on the conversation feature graph based on a layered stacking network structure to obtain a plurality of speaker feature tables.
Further, step 2 in the last alternative may include:
step 1, obtaining a feature expression of a first layer in a layered stack network structure through convolution of a relation graph;
step 2, acquiring feature expression of a second layer in the layered stack network structure through an attention-seeking convolutional neural network;
step 3, acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the layered stack network structure based on an attention mechanism;
and 4, processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain a plurality of speaker feature tables.
And S104, carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
On the basis of S103, the step aims to carry out emotion classification based on a plurality of context characteristic expressions and a plurality of speaker characteristic expressions to obtain an emotion recognition result. That is, the characteristics of the dialogue context modeling expression and the speaker modeling expression are fused and sent into a classifier consisting of a full connection layer with a softmax activation function for final emotion classification.
Further, the step may include:
step 1, performing feature splicing on a plurality of context feature expressions and a plurality of speaker feature expressions to obtain feature vectors;
and 2, processing the characteristic vectors through the full connection layer to obtain an emotion recognition result.
It can be seen that the present alternative is mainly illustrative of how sentiment classification may be performed. In the alternative scheme, a plurality of context feature expressions and a plurality of speaker feature expressions are subjected to feature splicing to obtain feature vectors, and the feature vectors are processed through a full connection layer to obtain emotion recognition results.
In addition, the method in this embodiment may further include:
step 1, acquiring training set data;
and 2, training the initial feature extractor by adopting a training set to obtain the trained feature extractor.
It can be seen that this alternative also illustrates how the training is performed. In this alternative, training set data is obtained, and the training set is used to train the initial feature extractor to obtain a trained feature extractor.
In summary, in the embodiment, the plurality of sentence characteristic data are extracted first, then the context characteristic expression and the plurality of speaker characteristic expressions are obtained through modeling, finally the emotion recognition result is obtained through emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions, the context characteristic and the speaker characteristic are added in the emotion recognition process, and the accuracy of conversation emotion recognition is improved.
The following further describes a method for recognizing dialog emotion according to a specific embodiment.
Referring to fig. 2, fig. 2 is a flowchart illustrating another emotion recognition method according to an embodiment of the present disclosure.
The embodiment provides a dialog emotion recognition method based on layered stack graph convolution. Firstly, extracting initial emotion feature information in a conversation sentence, then constructing an emotion recognition network based on layered stack graph convolution, carrying out supervised training on network parameters by using data labels, and after the training is finished, inputting data to carry out prediction output of emotion recognition.
The core of the technical scheme is the emotion recognition network based on the layered stack graph convolution, as shown in figure 2, the emotion recognition network backbone framework structure based on the layered stack graph convolution comprises a feature extraction layer, a dialogue context modeling expression layer, a speaker modeling expression layer, a feature fusion and emotion classification layer. The method comprises the steps that a feature extraction layer mainly extracts features of an input dialogue content at a single sentence level, then a dialogue context modeling expression layer models information of the sentence level features at the dialogue context level to obtain features with contexts, then the dialogue context modeling expression layer enters a speaker modeling expression layer, the layer needs to build a graph-based layered stack graph convolution structure to model and express information of characteristics of a speaker, and finally output features of the two expression layers are spliced and fused to be sent to a classifier to obtain a final emotion recognition result.
For symbolic description of emotion recognition tasks in a conversation, use is made here of
Figure 93975DEST_PATH_IMAGE001
To represent a set of dialogs, where N represents the number of sentences contained in the dialog,
Figure 73433DEST_PATH_IMAGE002
a sentence in a dialog is represented and,
Figure 931798DEST_PATH_IMAGE003
representing sentences
Figure 389324DEST_PATH_IMAGE004
Corresponding to the speaker, the emotion recognition task in the conversation aims to build a model to predict each
Figure 282325DEST_PATH_IMAGE004
Corresponding label
Figure 167105DEST_PATH_IMAGE005
Based on the above description, the operation process in this embodiment may include:
and step 1, feature extraction. And extracting targeted emotion relevant features by adopting a pre-training strategy. Specifically, a simple network composed of a convolutional layer, a maximum pooling layer and a full-link layer is constructed, a word vector model GloVe processed by natural language is used as initialization of word features in an input sentence, emotion classification training is carried out on an emotion recognition data set, finally, the trained network is used as a feature extractor, and finally, the features of the full-link layer are extracted and used as output of a feature extraction stage. To reduce the use of symbols, it is still used here
Figure 509837DEST_PATH_IMAGE004
To represent the corresponding characteristics of the sentence.
And 2, modeling and expressing the conversation context. All sentences are treated equally in the dialogue context modeling expression, and the distinction of specific speakers is not carried out. The expression of context information in a dialog is performed by using a bi-directional LSTM as a modeling tool, and the process can be expressed as:
Figure 911999DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 908774DEST_PATH_IMAGE007
the representation contains a feature representation of the dialog context information,
Figure 980766DEST_PATH_IMAGE008
representing indices of different directions of the bi-directional LSTM,
Figure 62992DEST_PATH_IMAGE009
representing a sentence in a dialogue, particularly at a boundary
Figure 3266DEST_PATH_IMAGE010
And
Figure 605280DEST_PATH_IMAGE011
initialization is performed using an all zero vector.
And 3, modeling and expressing the speaker. In the conversation process, different speakers have mutual influence, meanwhile, the speakers also have the characteristic that the emotion of the speakers keeps unchanged in a short time, and the relationship among the different speakers is modeled by using a graph. Use of
Figure 97441DEST_PATH_IMAGE012
To represent a constructed graph in which
Figure 417695DEST_PATH_IMAGE013
Represents a collection of vertices in the graph,
Figure 20715DEST_PATH_IMAGE014
represents a collection of edges in the graph,
Figure 867448DEST_PATH_IMAGE015
representing a collection of relationships between two vertices in the graph.
Vertices representing two sentences in a conversation
Figure 12734DEST_PATH_IMAGE016
And
Figure 69552DEST_PATH_IMAGE017
in relation to (2)
Figure 226995DEST_PATH_IMAGE018
The definition of (c) can be considered from two aspects: the sentence corresponds to the speaker and the relative position relationship between the two sentences in the conversation process. Specifically, for example, a dialog containing two speakers a and B, optionally two sentences, the corresponding speaker may have four cases: AA, BB, AB, BA, where AA represents that both sentences were spoken by speaker a, other similar reason. The relative position relationship in the dialog can be defined as front and back. In combination, all possible relationship cases exist as 8, which are respectively expressed by integers 0 to 7. For the edges in the figure, although there is a certain relation between any two sentences in the dialogue in advance, a front and back range is defined from the viewpoint of calculation amount, and a vertex is pointed out
Figure 521710DEST_PATH_IMAGE016
Consider only its front
Figure 699881DEST_PATH_IMAGE019
A vertex
Figure 853782DEST_PATH_IMAGE020
And a rear face
Figure 549337DEST_PATH_IMAGE021
A vertex
Figure 229717DEST_PATH_IMAGE022
And itself
Figure 247964DEST_PATH_IMAGE016
The edges of the connection therebetween.
After the graph construction is completed, further discriminant feature extraction is carried out through the structure based on the convolution of the layered stack graph. In particular, the vertices in the graph
Figure 482636DEST_PATH_IMAGE016
Corresponding features are expressed using context corresponding to sentences
Figure 716303DEST_PATH_IMAGE023
To initialize, obtain a representation of a first layer in a hierarchical stack by a relational graph convolution operation
Figure 251189DEST_PATH_IMAGE024
Figure 443267DEST_PATH_IMAGE025
Wherein the content of the first and second substances,
Figure 899656DEST_PATH_IMAGE026
representation and vertex
Figure 937014DEST_PATH_IMAGE016
Coincidence relation
Figure 326407DEST_PATH_IMAGE027
In combination with the sequence numbers of all the vertices,
Figure 686456DEST_PATH_IMAGE028
represents a normalized constant, numerically
Figure 426879DEST_PATH_IMAGE026
The number of elements in the set, ReLU for activation function, takes on non-negative values, subscripts
Figure 658140DEST_PATH_IMAGE029
Representing the corresponding parameter matrix to be trained.
At a second layer, which outputs information around vertices in the graph using a single parameter controlled Attention (AM) graph convolution neural network to dynamically aggregate information around vertices in the graph
Figure 918352DEST_PATH_IMAGE030
The specific calculation process of (a) can be expressed as:
Figure 170341DEST_PATH_IMAGE031
Figure 148793DEST_PATH_IMAGE032
Figure 42799DEST_PATH_IMAGE033
wherein the content of the first and second substances,
Figure 547730DEST_PATH_IMAGE034
indicating the serial number
Figure 721354DEST_PATH_IMAGE035
The set of vertex numbers in the neighborhood in the graph of the corresponding vertex,
Figure 170790DEST_PATH_IMAGE036
is a union operation of sets, with vertices added
Figure 616290DEST_PATH_IMAGE037
Connection to itself, cosine distance
Figure 569202DEST_PATH_IMAGE038
Wherein
Figure 444886DEST_PATH_IMAGE039
Represents
Figure 381618DEST_PATH_IMAGE040
L2 norm.
Figure 368159DEST_PATH_IMAGE041
Representing the sequence number after normalization
Figure 582103DEST_PATH_IMAGE042
The characteristic and the sequence number of the vertex of
Figure 877955DEST_PATH_IMAGE043
The degree of correlation between the features of the vertices of (b),
Figure 52716DEST_PATH_IMAGE044
representing the weight that the layer needs to learn,
Figure 92216DEST_PATH_IMAGE045
representing the denominator in the normalization calculation.
The next three layers are all based on the transform calculation mode to perform further aggregation of features in the graph, a structure of residual connection is adopted between the layers, here, transfonv is used to represent a layer of calculation process with transform as a main calculation mode, and then the transform structure of the three-layer stack can be formally described as follows:
Figure 160666DEST_PATH_IMAGE046
Figure 375222DEST_PATH_IMAGE047
Figure 286546DEST_PATH_IMAGE048
wherein the content of the first and second substances,
Figure 614891DEST_PATH_IMAGE049
representing the entire vertex output set of the different convolutional layers, for a total of five-layer graph convolutional operations, the TransConv computation process using the multi-head attention mechanism involves retrieving vertex features
Figure 662481DEST_PATH_IMAGE050
Key-value vertex features
Figure 175502DEST_PATH_IMAGE051
And numerical vertex features
Figure 324855DEST_PATH_IMAGE052
Retrieving vertex features based on currently interesting vertices
Figure 706157DEST_PATH_IMAGE053
Calculating, wherein key value vertex characteristics and numerical value vertex characteristics are calculated based on neighborhood vertexes, and the calculation modes of the three characteristics are similar:
Figure 624566DEST_PATH_IMAGE054
Figure 308488DEST_PATH_IMAGE055
Figure 928825DEST_PATH_IMAGE056
wherein, with subscripts
Figure 619480DEST_PATH_IMAGE057
And
Figure 641663DEST_PATH_IMAGE058
representing the weight matrix and the bias that need to be trained.
In order to weight and combine vertex features of different neighborhoods, weights are calculated according to the vertex features of retrieval and the vertex features of key values
Figure 371853DEST_PATH_IMAGE059
Figure 620431DEST_PATH_IMAGE060
Figure 77957DEST_PATH_IMAGE061
Wherein the subscript c represents the index of the number of heads in the multi-head attention mechanism, and the number of heads in the multi-head attention mechanism is represented by
Figure 705379DEST_PATH_IMAGE062
It is shown that,
Figure 121317DEST_PATH_IMAGE063
representing the vector dimension corresponding to each attention head in the attention mechanism. After the weights are obtained, the aggregated numerical vertex features can be weighted and calculated, and the output of the multi-head attention structure is spliced as follows:
Figure 201399DEST_PATH_IMAGE064
wherein the content of the first and second substances,
Figure 728196DEST_PATH_IMAGE065
representing the operation of splicing the vectors.
Subsequently, the output of the TransConv layer is obtained by using a gated residual concatenation mode
Figure 334758DEST_PATH_IMAGE066
Figure 669399DEST_PATH_IMAGE067
Figure 751625DEST_PATH_IMAGE068
Figure 567265DEST_PATH_IMAGE069
Wherein, with subscripts
Figure 152967DEST_PATH_IMAGE070
And
Figure 786074DEST_PATH_IMAGE071
representing the weight matrix and bias to be trained, sigmoid and ReLU representing the activation function, laylernenorm representing the skin normalization operation,
Figure 840749DEST_PATH_IMAGE072
it represents the residual stitching operation performed on features separated by semicolons within parentheses.
The above is the computation process involving TransConv in the three layers, and in the last fifth layer, vector stitching and nonlinear mapping ReLU are removed:
Figure 709348DEST_PATH_IMAGE073
Figure 431447DEST_PATH_IMAGE074
wherein the content of the first and second substances,
Figure 828930DEST_PATH_IMAGE075
other symbols have the same meaning as the above symbols, and through the above feature processing, the vertex feature of the final fifth-layer output can be expressed as
Figure 761114DEST_PATH_IMAGE076
Step 4, feature fusion and emotion classification: the feature fusion here is the output after the concatenation convolution operation
Figure 650048DEST_PATH_IMAGE077
And original sentence expression
Figure 475922DEST_PATH_IMAGE078
Finally, probability distribution of different categories can be obtained through the full connection layer, and the index category corresponding to the maximum probability is the output category:
Figure 795039DEST_PATH_IMAGE079
Figure 73574DEST_PATH_IMAGE080
Figure 34708DEST_PATH_IMAGE081
Figure 449508DEST_PATH_IMAGE082
wherein, with subscripts
Figure 205106DEST_PATH_IMAGE083
And
Figure 984319DEST_PATH_IMAGE084
representing the weight matrix and the bias that need to be trained,
Figure 467252DEST_PATH_IMAGE085
the feature vector representing the final splice is then,
Figure 18451DEST_PATH_IMAGE086
is a dense feature that is subject to a non-linear mapping,
Figure 928638DEST_PATH_IMAGE087
a class probability vector representing the final output,
Figure 932497DEST_PATH_IMAGE088
is the prediction category of the final output.
Parameters in the updated model can be calculated using a back-propagation algorithm (based on SGD) by calculating cross-entropy loss functions with the real classes. With the convergence of the model, the updating of the parameters is finished, the parameters are fixed to be a feasible solution of the proposed algorithm, and the test of the test data can be carried out.
In the specific implementation process, the implementation process can be divided into three parts of training, verifying and testing, after the start, training of a pre-training network for emotion recognition is firstly established, the part of the step 1 is mainly referred to, initial expression of sentences in a dialogue capable of extracting emotion characteristics is obtained through training, then training data and a model for defining layered stacked graph convolution dialogue emotion recognition are established, model parameters are updated by using the training data, if the condition of model convergence is not met, calculation and updating of the model parameters are continued, if the condition of model convergence is met, a testing stage is entered, test data are input, an output result of model calculation is output, and the whole process is ended. It should be noted that the model convergence condition herein includes not only the number of times the training reaches the set value or the degree of decrease of the training error stabilizes to a certain range, but also a threshold value of the error between the predicted value and the true value may be set, and when the error of the model is smaller than the given threshold value, it may be determined that the training is stopped. In the definition of the model loss function, a cross entropy loss function suitable for multi-classification is used, or other improvement methods suitable for multi-classification models are used. In the aspect of updating parameters of the model, an RMSprob (root Mean Square prediction) algorithm may be adopted, and other Gradient-based parameter optimization methods may also be adopted, including but not limited to random Gradient Descent (SGD), adaptive Gradient, Adam (adaptive motion estimation), Adamax (Adam based on a variant of an infinite norm), asgd (acquired stored Gradient prediction), RMSprob, and the like.
Referring to fig. 3, fig. 3 is a flowchart illustrating a training process of a method for emotion recognition in dialog according to an embodiment of the present application.
Based on the above description, a neural network is constructed according to the content of the present invention to perform emotion recognition, so as to describe in detail the specific implementation of the present invention. It should be noted that the embodiments described herein are only for explaining the present invention, and are not limited to the present invention.
A multimodal emotion recognition data set IEMOCAP (interactive emotion binary motion capture database) is downloaded, which contains the process of two-person conversation and tags the emotion. The IEMOCAP data set contains 2199 self-portrait video clips, which are divided proportionally into three parts as a whole: training set (80%), validation set (10%) and test set (10%). Here, using the text of the speech transcription as the processed data, the tags of all the samples are defined as one of the following six kinds for the tags of the sample sentences: prior to the following operations, the data set is divided into a training set, a validation set, and a test set.
According to the part of the step 1, a simple neural network is constructed as a feature extraction network, the network is trained by using training data and sentence labels, after the training is finished, the network is used for extracting the features of all data, and the features in the process are obtained
Figure 219122DEST_PATH_IMAGE089
Constructing a network structure according to the calculation method in the steps 2 to 4, wherein the number of heads in the multi-head attention is set to be 3, inputting training data, performing forward calculation to obtain emotion recognition output of a final model
Figure 359247DEST_PATH_IMAGE090
The cross entropy loss function described above measures the output prediction value of the model and the tag value in the data set during the training process.
According to the parameter optimization method, a proper optimization method is selected to update the parameters needing to be updated in the model according to the actual implementation situation. In this implementation, the parameter is updated by using an Adam method.
During the training process, the parameters are updated on the training set first, and after adjusting the parameters one Epoch (one training) for the whole training set each time, the loss calculation is performed on the verification set and recorded, and the number of epochs of the training is set, here to 120. And selecting the model with the minimum loss on the verification set as the model of the final training output.
And inputting the test data into the trained model for forward calculation to obtain the final emotion recognition output.
Compared with the existing conversational emotion recognition method, the proposed conversational emotion recognition method of the layered stack graph convolution has the following significant advantages: improving the judgment of emotional expression by using the convolution structure of the layered stack diagram; fusing information of surrounding nodes in the dialogue graph by using a single-parameter attention mechanism; and performing emotional characteristic aggregation globally through a Transformer structure neutron attention mechanism calculation method.
In summary, in the embodiment, the plurality of sentence characteristic data are extracted first, then modeling is performed to obtain the context characteristic expression and the plurality of speaker characteristic expressions, finally emotion classification is performed based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain the emotion recognition result, the context characteristic and the speaker characteristic are added in the emotion recognition process, and the accuracy of conversation emotion recognition is improved.
In the following, the dialogue emotion recognition apparatus provided in the embodiment of the present application is introduced, and the dialogue emotion recognition apparatus described below and the dialogue emotion recognition method described above may be referred to in correspondence with each other.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a dialog emotion recognition device according to an embodiment of the present application.
In this embodiment, the apparatus may include:
a feature extraction module 100, configured to perform sentence feature extraction on the conversational data by using a trained feature extractor to obtain a plurality of sentence feature data;
the context modeling module 200 is configured to perform context feature expression modeling on a plurality of sentence feature data based on bidirectional LSTM to obtain a plurality of context feature expressions;
the speaker modeling module 300 is configured to perform speaker feature extraction on the multiple context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain multiple speaker feature expressions;
and the emotion classification module 400 is used for performing emotion classification based on the multiple context feature expressions and the multiple speaker feature expressions to obtain an emotion recognition result.
An embodiment of the present application further provides a server, including:
a memory for storing a computer program;
a processor for implementing the steps of the dialog emotion recognition method as described in the above embodiments when the computer program is executed.
The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the dialog emotion recognition method according to the above embodiments.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The present application provides a method, apparatus, server, and computer-readable storage medium for emotion recognition in a dialog. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. A conversation emotion recognition method is characterized by comprising the following steps:
adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data;
performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions;
and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
2. The method for recognizing dialogue emotion according to claim 1, wherein the step of performing speaker feature extraction on the plurality of contextual feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions comprises:
carrying out graph construction processing based on the plurality of context feature expressions to obtain a conversation feature graph;
and carrying out speaker feature extraction on the conversation feature diagram based on a layered stack network structure to obtain the multiple speaker feature expressions.
3. The conversation emotion recognition method of claim 2, wherein performing speaker feature extraction on the conversation feature map based on a hierarchical network structure to obtain the plurality of speaker feature expressions comprises:
obtaining a feature expression of a first layer in the hierarchical stacking network structure through relational graph convolution;
obtaining a feature representation of a second layer in the hierarchical stacked network structure by an attention-seeking convolutional neural network;
acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the hierarchical stacking network structure based on an attention mechanism;
and processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain the feature expressions of the multiple speakers.
4. The method of claim 1, wherein the sentence feature extraction is performed on the dialogue data by using a trained feature extractor to obtain a plurality of sentence feature data, and the method comprises:
preprocessing the dialogue data to obtain processed dialogue data;
and adopting a trained feature extractor to extract sentence features from the processed dialogue data to obtain a plurality of sentence feature data.
5. The method of recognizing dialogue emotion according to claim 1, wherein the modeling of context feature expression for the plurality of sentence feature data based on bidirectional LSTM to obtain a plurality of context feature expressions comprises:
using a double line LSTM as a modeling tool;
and carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.
6. The method for recognizing dialogue emotion according to claim 1, wherein the training process of the feature extractor comprises:
acquiring training set data;
and training the initial feature extractor by adopting the training set to obtain the trained feature extractor.
7. The method for recognizing dialogue emotion according to claim 1, wherein the emotion classification is performed based on the plurality of contextual characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result, and the method comprises:
performing feature splicing on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain feature vectors;
and processing the feature vectors through a full connection layer to obtain the emotion recognition result.
8. A conversational emotion recognition apparatus, comprising:
the feature extraction module is used for carrying out sentence feature extraction on the dialogue data by adopting the trained feature extractor to obtain a plurality of sentence feature data;
the context modeling module is used for carrying out context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;
the speaker modeling module is used for extracting speaker characteristics of the plurality of context characteristic expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions;
and the emotion classification module is used for carrying out emotion classification on the basis of the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.
9. A server, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of conversational emotion recognition according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the dialog emotion recognition method according to any of claims 1 to 7.
CN202111648205.7A 2021-12-31 2021-12-31 Conversation emotion recognition method and related device Pending CN114020897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111648205.7A CN114020897A (en) 2021-12-31 2021-12-31 Conversation emotion recognition method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111648205.7A CN114020897A (en) 2021-12-31 2021-12-31 Conversation emotion recognition method and related device

Publications (1)

Publication Number Publication Date
CN114020897A true CN114020897A (en) 2022-02-08

Family

ID=80069434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111648205.7A Pending CN114020897A (en) 2021-12-31 2021-12-31 Conversation emotion recognition method and related device

Country Status (1)

Country Link
CN (1) CN114020897A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913590A (en) * 2022-07-15 2022-08-16 山东海量信息技术研究院 Data emotion recognition method, device and equipment and readable storage medium
CN115374281A (en) * 2022-08-30 2022-11-22 重庆理工大学 Session emotion analysis method based on multi-granularity fusion and graph convolution network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170372694A1 (en) * 2016-06-23 2017-12-28 Panasonic Intellectual Property Management Co., Ltd. Dialogue act estimation method, dialogue act estimation apparatus, and storage medium
CN111950275A (en) * 2020-08-06 2020-11-17 平安科技(深圳)有限公司 Emotion recognition method and device based on recurrent neural network and storage medium
CN113609289A (en) * 2021-07-06 2021-11-05 河南工业大学 Multi-mode dialog text-based emotion recognition method
CN113641822A (en) * 2021-08-11 2021-11-12 哈尔滨工业大学 Fine-grained emotion classification method based on graph neural network
CN113656564A (en) * 2021-07-20 2021-11-16 国网天津市电力公司 Power grid service dialogue data emotion detection method based on graph neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170372694A1 (en) * 2016-06-23 2017-12-28 Panasonic Intellectual Property Management Co., Ltd. Dialogue act estimation method, dialogue act estimation apparatus, and storage medium
CN111950275A (en) * 2020-08-06 2020-11-17 平安科技(深圳)有限公司 Emotion recognition method and device based on recurrent neural network and storage medium
CN113609289A (en) * 2021-07-06 2021-11-05 河南工业大学 Multi-mode dialog text-based emotion recognition method
CN113656564A (en) * 2021-07-20 2021-11-16 国网天津市电力公司 Power grid service dialogue data emotion detection method based on graph neural network
CN113641822A (en) * 2021-08-11 2021-11-12 哈尔滨工业大学 Fine-grained emotion classification method based on graph neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913590A (en) * 2022-07-15 2022-08-16 山东海量信息技术研究院 Data emotion recognition method, device and equipment and readable storage medium
CN115374281A (en) * 2022-08-30 2022-11-22 重庆理工大学 Session emotion analysis method based on multi-granularity fusion and graph convolution network
CN115374281B (en) * 2022-08-30 2024-04-05 重庆理工大学 Session emotion analysis method based on multi-granularity fusion and graph convolution network

Similar Documents

Publication Publication Date Title
CN111275085B (en) Online short video multi-modal emotion recognition method based on attention fusion
KR102071582B1 (en) Method and apparatus for classifying a class to which a sentence belongs by using deep neural network
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN109299267B (en) Emotion recognition and prediction method for text conversation
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN111353029B (en) Semantic matching-based multi-turn spoken language understanding method
CN114020897A (en) Conversation emotion recognition method and related device
CN113220886A (en) Text classification method, text classification model training method and related equipment
Li et al. Learning fine-grained cross modality excitement for speech emotion recognition
CN111027292B (en) Method and system for generating limited sampling text sequence
CN114818703B (en) Multi-intention recognition method and system based on BERT language model and TextCNN model
CN111159375A (en) Text processing method and device
CN116168324A (en) Video emotion recognition method based on cyclic interaction transducer and dimension cross fusion
CN111046178A (en) Text sequence generation method and system
CN110992943B (en) Semantic understanding method and system based on word confusion network
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
Chen et al. Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network
CN115497465A (en) Voice interaction method and device, electronic equipment and storage medium
CN115587337A (en) Method, device and storage medium for recognizing abnormal sound of vehicle door
CN111914553A (en) Financial information negative subject judgment method based on machine learning
CN114841151A (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN113239690A (en) Chinese text intention identification method based on integration of Bert and fully-connected neural network
CN116244474A (en) Learner learning state acquisition method based on multi-mode emotion feature fusion
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220208