CN114020897A

CN114020897A - Conversation emotion recognition method and related device

Info

Publication number: CN114020897A
Application number: CN202111648205.7A
Authority: CN
Inventors: 鲁璐; 李仁刚; 赵雅倩; 王斌强; 董刚
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-02-08

Abstract

The application discloses a conversation emotion recognition method, which comprises the following steps: adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data; performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions; carrying out speaker feature extraction on the plurality of context feature expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker feature expressions; and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result. Context characteristics and speaker characteristics are added in the emotion recognition process, so that the accuracy of conversation emotion recognition is improved. The application also discloses a relevant device which comprises the conversation emotion recognition device, the server and the computer readable storage medium, and the beneficial effects are achieved.

Description

Conversation emotion recognition method and related device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a server, and a computer-readable storage medium for recognizing conversation emotion.

Background

Along with the falling of artificial intelligence application, how to make machines more emotional also becomes an urgent problem to be solved in academic and industrial fields. In emotion recognition, one important direction is emotion recognition in conversation, because in real life, people mainly transmit emotion through conversation: the customer service answer based on artificial intelligence gradually gets rid of the traditional cold and hard language description mode, but in the conversation process, how to let the machine understand the mood of the conversation person at the moment is a serious challenge. The research content of the dialogue-based emotion recognition task is to research the emotion change of a speaker in a dialogue process and recognize emotion information contained in each sentence of speaking.

In the related art, Graph Neural Network GNN (Graph Neural Network) is used to model the context relationship in the dialog process, and the individual characteristics of the speaker in the dialog process can be captured through explicit relationship modeling. GNN-based conversational emotion recognition schemes generally include four components: feature extraction, dialogue context modeling expression, speaker modeling expression, feature fusion and emotion classification. However, the graph convolution operation used in the conventional conversational emotion recognition has a simple structure and insufficient capability of extracting discrimination information, and reduces the accuracy of conversational emotion recognition.

Therefore, how to improve the accuracy of conversational emotion recognition is a key issue that those skilled in the art are interested in.

Disclosure of Invention

The purpose of the present application is to provide a dialogue emotion recognition method, a dialogue emotion recognition device, a server, and a computer-readable storage medium, which are capable of sufficiently modeling discrimination information and improving the accuracy of dialogue emotion recognition.

In order to solve the above technical problem, the present application provides a dialog emotion recognition method, including:

adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data;

performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;

carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions;

and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.

Optionally, the speaker feature extraction is performed on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions, including:

carrying out graph construction processing based on the plurality of context feature expressions to obtain a conversation feature graph;

and carrying out speaker characteristic extraction on the conversation characteristic diagram based on a layered stack network structure to obtain the plurality of speaker characteristic tables.

Optionally, performing speaker feature extraction on the dialog feature graph based on a layered stack network structure to obtain the multiple speaker feature expressions, including:

obtaining a feature expression of a first layer in the hierarchical stacking network structure through relational graph convolution;

obtaining a feature representation of a second layer in the hierarchical stacked network structure by an attention-seeking convolutional neural network;

acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the hierarchical stacking network structure based on an attention mechanism;

and processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain the feature expressions of the multiple speakers.

Optionally, performing sentence feature extraction on the dialogue data by using a trained feature extractor to obtain a plurality of sentence feature data, including:

preprocessing the dialogue data to obtain processed dialogue data;

and adopting a trained feature extractor to extract sentence features from the processed dialogue data to obtain a plurality of sentence feature data.

Optionally, performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions, including:

using a double line LSTM as a modeling tool;

and carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.

Optionally, the training process of the feature extractor includes:

acquiring training set data;

and training the initial feature extractor by adopting the training set to obtain the trained feature extractor.

Optionally, performing emotion classification based on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain an emotion recognition result, including:

performing feature splicing on the plurality of context feature expressions and the plurality of speaker feature expressions to obtain feature vectors;

and processing the feature vectors through a full connection layer to obtain the emotion recognition result.

The present application also provides a dialogue emotion recognition apparatus, including:

the feature extraction module is used for carrying out sentence feature extraction on the dialogue data by adopting the trained feature extractor to obtain a plurality of sentence feature data;

the context modeling module is used for carrying out context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions;

the speaker modeling module is used for extracting speaker characteristics of the plurality of context characteristic expressions based on the graph neural network and the layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions;

and the emotion classification module is used for carrying out emotion classification on the basis of the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.

The present application further provides a server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the dialog emotion recognition method as described above when executing the computer program.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, realizes the steps of the dialog emotion recognition method as described above.

The application provides a conversation emotion recognition method, which comprises the following steps: adopting a trained feature extractor to extract sentence features from the dialogue data to obtain a plurality of sentence feature data; performing context feature expression modeling on the plurality of sentence feature data based on the bidirectional LSTM to obtain a plurality of context feature expressions; carrying out speaker feature extraction on the plurality of context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions; and carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.

The method comprises the steps of extracting a plurality of sentence characteristic data, modeling to obtain context characteristic expressions and a plurality of speaker characteristic expressions, and classifying emotions based on the context characteristic expressions and the speaker characteristic expressions to obtain emotion recognition results.

The application also provides a conversation emotion recognition device, a server and a computer readable storage medium, which have the beneficial effects and are not described in detail herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart of a method for emotion recognition in a dialog according to an embodiment of the present application;

FIG. 2 is a flowchart of another emotion recognition method for dialog provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a training process of a method for emotion recognition in dialog according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a dialogue emotion recognition apparatus according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a conversation emotion recognition method, a conversation emotion recognition device, a server and a computer readable storage medium, so as to improve the accuracy of conversation emotion recognition.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the related art, the graph neural network GNN is used to model the context relationship in the dialog process, and the individual characteristics of the speaker in the dialog process can be captured through the explicit relationship modeling. GNN-based conversational emotion recognition schemes generally include four components: feature extraction, dialogue context modeling expression, speaker modeling expression, feature fusion and emotion classification. However, the graph convolution operation used in the conventional dialogue emotion recognition has a simple structure and insufficient capability of extracting discrimination information, and reduces the accuracy of the dialogue emotion recognition.

Therefore, the method for recognizing the conversation emotion comprises the steps of extracting a plurality of sentence characteristic data, modeling to obtain context characteristic expressions and a plurality of speaker characteristic expressions, classifying emotion based on the context characteristic expressions and the speaker characteristic expressions to obtain emotion recognition results, adding the context characteristic and the speaker characteristic in the emotion recognition process, and improving the accuracy of conversation emotion recognition.

The following describes a method for recognizing dialogue emotion according to an embodiment.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for emotion recognition in a dialog according to an embodiment of the present application.

In this embodiment, the method may include:

s101, sentence feature extraction is carried out on the dialogue data by adopting a trained feature extractor to obtain a plurality of sentence feature data;

it can be seen that this step is intended to perform sentence feature extraction on the conversational data using a trained feature extractor to obtain a plurality of sentence feature data.

That is, the initial expression of sentence level in the dialogue is extracted by adopting a pre-training mode, pre-training is carried out by constructing a network structure of a convolutional layer, a maximum pooling layer and a full connection layer and using a data set with emotion marking, the trained model is used as a feature extractor for carrying out initial feature extraction, and initial features which can be analyzed are extracted.

Further, the step may include:

step 1, preprocessing dialogue data to obtain processed dialogue data;

and 2, carrying out sentence characteristic extraction on the processed dialogue data by adopting the trained characteristic extractor to obtain a plurality of sentence characteristic data.

It can be seen that the present alternative is mainly to illustrate how feature extraction is performed. In the alternative, the dialogue data is preprocessed to obtain processed dialogue data, and a trained feature extractor is used for sentence feature extraction of the processed dialogue data to obtain a plurality of sentence feature data.

S102, performing context feature expression modeling on the feature data of the sentences based on the bidirectional LSTM to obtain a plurality of context feature expressions;

on the basis of S101, the step aims to perform context feature expression modeling on a plurality of sentence feature data based on bidirectional LSTM, and obtain a plurality of context feature expressions.

In the prior art, the interaction information between different sentences in the whole dialogue is modeled. In order to sufficiently consider emotional relation before and after a conversation, a bidirectional GRU (Gated recursive Unit) is used as a modeling tool.

In this step, in order to improve the modeling performance, bidirectional LSTM (Long Short-Term Memory) is used for modeling.

Further, the step may include:

step 1, using a double-line LSTM as a modeling tool;

and 2, carrying out context information expression processing on the plurality of sentence characteristic data by using the modeling tool to obtain a plurality of context characteristic expressions.

It can be seen that the present alternative scheme mainly illustrates how to extract context features. In the alternative, the double-line LSTM is used as a modeling tool, and the modeling tool is used to perform context information expression processing on the plurality of sentence feature data to obtain a plurality of context feature expressions.

S103, speaker feature extraction is carried out on the plurality of context feature expressions based on the graph neural network and the layered stack graph convolution structure, and a plurality of speaker feature expressions are obtained;

on the basis of S102, the step aims to extract the speaker characteristics of a plurality of context characteristic expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker characteristic expressions.

The information modeled at this stage is mainly the characteristics of different speakers, which may be contained in a conversation, and the emotion-related characteristics of different speakers are different, and in order to model such characteristics, the scholars propose to use GNN to capture such difference characteristics. Firstly, a bidirectional graph structure is constructed according to a conversation, each node in the graph represents an expression characteristic corresponding to a certain sentence in the conversation, edges in the graph are defined as connections between any two nodes, and in order to reduce the calculation amount, windows are defined to limit the number of the connections between the nodes. Then, based on the fact that the emotion of different speakers in the conversation changes, the edges in the graph are divided into different categories according to the dependency relationship of the speakers and the time sequence relationship of the conversation, and finally, feature extraction is carried out by using a two-layer graph convolution operation.

Further, in the present embodiment, a discriminant feature extraction is further performed based on the convolution structure of the hierarchical stacked graph, so as to improve the accuracy of speaker feature extraction.

Further, the step may include:

step 1, carrying out graph construction processing based on a plurality of context feature expressions to obtain a conversation feature graph;

and 2, carrying out speaker feature extraction on the dialogue feature graph based on the layered stack network structure to obtain a plurality of speaker feature tables.

It can be seen that the present alternative is primarily illustrative of how speaker characteristics can be extracted. In the alternative, graph construction processing is performed based on a plurality of context feature expressions to obtain a conversation feature graph, and speaker feature extraction is performed on the conversation feature graph based on a layered stacking network structure to obtain a plurality of speaker feature tables.

Further, step 2 in the last alternative may include:

step 1, obtaining a feature expression of a first layer in a layered stack network structure through convolution of a relation graph;

step 2, acquiring feature expression of a second layer in the layered stack network structure through an attention-seeking convolutional neural network;

step 3, acquiring feature expression of a third layer, feature expression of a fourth layer and feature expression of a fifth layer in the layered stack network structure based on an attention mechanism;

and 4, processing the feature expression of the third layer, the feature expression of the fourth layer and the feature expression of the fifth layer based on a multi-head attention mechanism to obtain a plurality of speaker feature tables.

And S104, carrying out emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result.

On the basis of S103, the step aims to carry out emotion classification based on a plurality of context characteristic expressions and a plurality of speaker characteristic expressions to obtain an emotion recognition result. That is, the characteristics of the dialogue context modeling expression and the speaker modeling expression are fused and sent into a classifier consisting of a full connection layer with a softmax activation function for final emotion classification.

Further, the step may include:

step 1, performing feature splicing on a plurality of context feature expressions and a plurality of speaker feature expressions to obtain feature vectors;

and 2, processing the characteristic vectors through the full connection layer to obtain an emotion recognition result.

It can be seen that the present alternative is mainly illustrative of how sentiment classification may be performed. In the alternative scheme, a plurality of context feature expressions and a plurality of speaker feature expressions are subjected to feature splicing to obtain feature vectors, and the feature vectors are processed through a full connection layer to obtain emotion recognition results.

In addition, the method in this embodiment may further include:

step 1, acquiring training set data;

and 2, training the initial feature extractor by adopting a training set to obtain the trained feature extractor.

It can be seen that this alternative also illustrates how the training is performed. In this alternative, training set data is obtained, and the training set is used to train the initial feature extractor to obtain a trained feature extractor.

In summary, in the embodiment, the plurality of sentence characteristic data are extracted first, then the context characteristic expression and the plurality of speaker characteristic expressions are obtained through modeling, finally the emotion recognition result is obtained through emotion classification based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions, the context characteristic and the speaker characteristic are added in the emotion recognition process, and the accuracy of conversation emotion recognition is improved.

The following further describes a method for recognizing dialog emotion according to a specific embodiment.

Referring to fig. 2, fig. 2 is a flowchart illustrating another emotion recognition method according to an embodiment of the present disclosure.

The embodiment provides a dialog emotion recognition method based on layered stack graph convolution. Firstly, extracting initial emotion feature information in a conversation sentence, then constructing an emotion recognition network based on layered stack graph convolution, carrying out supervised training on network parameters by using data labels, and after the training is finished, inputting data to carry out prediction output of emotion recognition.

The core of the technical scheme is the emotion recognition network based on the layered stack graph convolution, as shown in figure 2, the emotion recognition network backbone framework structure based on the layered stack graph convolution comprises a feature extraction layer, a dialogue context modeling expression layer, a speaker modeling expression layer, a feature fusion and emotion classification layer. The method comprises the steps that a feature extraction layer mainly extracts features of an input dialogue content at a single sentence level, then a dialogue context modeling expression layer models information of the sentence level features at the dialogue context level to obtain features with contexts, then the dialogue context modeling expression layer enters a speaker modeling expression layer, the layer needs to build a graph-based layered stack graph convolution structure to model and express information of characteristics of a speaker, and finally output features of the two expression layers are spliced and fused to be sent to a classifier to obtain a final emotion recognition result.

For symbolic description of emotion recognition tasks in a conversation, use is made here of

To represent a set of dialogs, where N represents the number of sentences contained in the dialog,

a sentence in a dialog is represented and,

representing sentences

Corresponding to the speaker, the emotion recognition task in the conversation aims to build a model to predict each

Corresponding label

。

Based on the above description, the operation process in this embodiment may include:

and step 1, feature extraction. And extracting targeted emotion relevant features by adopting a pre-training strategy. Specifically, a simple network composed of a convolutional layer, a maximum pooling layer and a full-link layer is constructed, a word vector model GloVe processed by natural language is used as initialization of word features in an input sentence, emotion classification training is carried out on an emotion recognition data set, finally, the trained network is used as a feature extractor, and finally, the features of the full-link layer are extracted and used as output of a feature extraction stage. To reduce the use of symbols, it is still used here

To represent the corresponding characteristics of the sentence.

And 2, modeling and expressing the conversation context. All sentences are treated equally in the dialogue context modeling expression, and the distinction of specific speakers is not carried out. The expression of context information in a dialog is performed by using a bi-directional LSTM as a modeling tool, and the process can be expressed as:

。

wherein the content of the first and second substances,

the representation contains a feature representation of the dialog context information,

representing indices of different directions of the bi-directional LSTM,

representing a sentence in a dialogue, particularly at a boundary

And

initialization is performed using an all zero vector.

And 3, modeling and expressing the speaker. In the conversation process, different speakers have mutual influence, meanwhile, the speakers also have the characteristic that the emotion of the speakers keeps unchanged in a short time, and the relationship among the different speakers is modeled by using a graph. Use of

To represent a constructed graph in which

Represents a collection of vertices in the graph,

represents a collection of edges in the graph,

representing a collection of relationships between two vertices in the graph.

Vertices representing two sentences in a conversation

And

in relation to (2)

The definition of (c) can be considered from two aspects: the sentence corresponds to the speaker and the relative position relationship between the two sentences in the conversation process. Specifically, for example, a dialog containing two speakers a and B, optionally two sentences, the corresponding speaker may have four cases: AA, BB, AB, BA, where AA represents that both sentences were spoken by speaker a, other similar reason. The relative position relationship in the dialog can be defined as front and back. In combination, all possible relationship cases exist as 8, which are respectively expressed by integers 0 to 7. For the edges in the figure, although there is a certain relation between any two sentences in the dialogue in advance, a front and back range is defined from the viewpoint of calculation amount, and a vertex is pointed out

Consider only its front

A vertex

And a rear face

A vertex

And itself

The edges of the connection therebetween.

After the graph construction is completed, further discriminant feature extraction is carried out through the structure based on the convolution of the layered stack graph. In particular, the vertices in the graph

Corresponding features are expressed using context corresponding to sentences

To initialize, obtain a representation of a first layer in a hierarchical stack by a relational graph convolution operation

：

。

Wherein the content of the first and second substances,

representation and vertex

Coincidence relation

In combination with the sequence numbers of all the vertices,

represents a normalized constant, numerically

The number of elements in the set, ReLU for activation function, takes on non-negative values, subscripts

Representing the corresponding parameter matrix to be trained.

At a second layer, which outputs information around vertices in the graph using a single parameter controlled Attention (AM) graph convolution neural network to dynamically aggregate information around vertices in the graph

The specific calculation process of (a) can be expressed as:

；

；

。

wherein the content of the first and second substances,

indicating the serial number

The set of vertex numbers in the neighborhood in the graph of the corresponding vertex,

is a union operation of sets, with vertices added

Connection to itself, cosine distance

Wherein

Represents

L2 norm.

Representing the sequence number after normalization

The characteristic and the sequence number of the vertex of

The degree of correlation between the features of the vertices of (b),

representing the weight that the layer needs to learn,

representing the denominator in the normalization calculation.

The next three layers are all based on the transform calculation mode to perform further aggregation of features in the graph, a structure of residual connection is adopted between the layers, here, transfonv is used to represent a layer of calculation process with transform as a main calculation mode, and then the transform structure of the three-layer stack can be formally described as follows:

，

，

。

wherein the content of the first and second substances,

representing the entire vertex output set of the different convolutional layers, for a total of five-layer graph convolutional operations, the TransConv computation process using the multi-head attention mechanism involves retrieving vertex features

Key-value vertex features

And numerical vertex features

Retrieving vertex features based on currently interesting vertices

Calculating, wherein key value vertex characteristics and numerical value vertex characteristics are calculated based on neighborhood vertexes, and the calculation modes of the three characteristics are similar:

，

，

。

wherein, with subscripts

And

representing the weight matrix and the bias that need to be trained.

In order to weight and combine vertex features of different neighborhoods, weights are calculated according to the vertex features of retrieval and the vertex features of key values

：

，

。

Wherein the subscript c represents the index of the number of heads in the multi-head attention mechanism, and the number of heads in the multi-head attention mechanism is represented by

It is shown that,

representing the vector dimension corresponding to each attention head in the attention mechanism. After the weights are obtained, the aggregated numerical vertex features can be weighted and calculated, and the output of the multi-head attention structure is spliced as follows:

。

wherein the content of the first and second substances,

representing the operation of splicing the vectors.

Subsequently, the output of the TransConv layer is obtained by using a gated residual concatenation mode

：

；

；

。

Wherein, with subscripts

And

representing the weight matrix and bias to be trained, sigmoid and ReLU representing the activation function, laylernenorm representing the skin normalization operation,

it represents the residual stitching operation performed on features separated by semicolons within parentheses.

The above is the computation process involving TransConv in the three layers, and in the last fifth layer, vector stitching and nonlinear mapping ReLU are removed:

，

。

wherein the content of the first and second substances,

other symbols have the same meaning as the above symbols, and through the above feature processing, the vertex feature of the final fifth-layer output can be expressed as

。

Step 4, feature fusion and emotion classification: the feature fusion here is the output after the concatenation convolution operation

And original sentence expression

Finally, probability distribution of different categories can be obtained through the full connection layer, and the index category corresponding to the maximum probability is the output category:

，

，

，

。

wherein, with subscripts

And

representing the weight matrix and the bias that need to be trained,

the feature vector representing the final splice is then,

is a dense feature that is subject to a non-linear mapping,

a class probability vector representing the final output,

is the prediction category of the final output.

Parameters in the updated model can be calculated using a back-propagation algorithm (based on SGD) by calculating cross-entropy loss functions with the real classes. With the convergence of the model, the updating of the parameters is finished, the parameters are fixed to be a feasible solution of the proposed algorithm, and the test of the test data can be carried out.

In the specific implementation process, the implementation process can be divided into three parts of training, verifying and testing, after the start, training of a pre-training network for emotion recognition is firstly established, the part of the step 1 is mainly referred to, initial expression of sentences in a dialogue capable of extracting emotion characteristics is obtained through training, then training data and a model for defining layered stacked graph convolution dialogue emotion recognition are established, model parameters are updated by using the training data, if the condition of model convergence is not met, calculation and updating of the model parameters are continued, if the condition of model convergence is met, a testing stage is entered, test data are input, an output result of model calculation is output, and the whole process is ended. It should be noted that the model convergence condition herein includes not only the number of times the training reaches the set value or the degree of decrease of the training error stabilizes to a certain range, but also a threshold value of the error between the predicted value and the true value may be set, and when the error of the model is smaller than the given threshold value, it may be determined that the training is stopped. In the definition of the model loss function, a cross entropy loss function suitable for multi-classification is used, or other improvement methods suitable for multi-classification models are used. In the aspect of updating parameters of the model, an RMSprob (root Mean Square prediction) algorithm may be adopted, and other Gradient-based parameter optimization methods may also be adopted, including but not limited to random Gradient Descent (SGD), adaptive Gradient, Adam (adaptive motion estimation), Adamax (Adam based on a variant of an infinite norm), asgd (acquired stored Gradient prediction), RMSprob, and the like.

Referring to fig. 3, fig. 3 is a flowchart illustrating a training process of a method for emotion recognition in dialog according to an embodiment of the present application.

Based on the above description, a neural network is constructed according to the content of the present invention to perform emotion recognition, so as to describe in detail the specific implementation of the present invention. It should be noted that the embodiments described herein are only for explaining the present invention, and are not limited to the present invention.

A multimodal emotion recognition data set IEMOCAP (interactive emotion binary motion capture database) is downloaded, which contains the process of two-person conversation and tags the emotion. The IEMOCAP data set contains 2199 self-portrait video clips, which are divided proportionally into three parts as a whole: training set (80%), validation set (10%) and test set (10%). Here, using the text of the speech transcription as the processed data, the tags of all the samples are defined as one of the following six kinds for the tags of the sample sentences: prior to the following operations, the data set is divided into a training set, a validation set, and a test set.

According to the part of the step 1, a simple neural network is constructed as a feature extraction network, the network is trained by using training data and sentence labels, after the training is finished, the network is used for extracting the features of all data, and the features in the process are obtained

。

Constructing a network structure according to the calculation method in the steps 2 to 4, wherein the number of heads in the multi-head attention is set to be 3, inputting training data, performing forward calculation to obtain emotion recognition output of a final model

。

The cross entropy loss function described above measures the output prediction value of the model and the tag value in the data set during the training process.

According to the parameter optimization method, a proper optimization method is selected to update the parameters needing to be updated in the model according to the actual implementation situation. In this implementation, the parameter is updated by using an Adam method.

During the training process, the parameters are updated on the training set first, and after adjusting the parameters one Epoch (one training) for the whole training set each time, the loss calculation is performed on the verification set and recorded, and the number of epochs of the training is set, here to 120. And selecting the model with the minimum loss on the verification set as the model of the final training output.

And inputting the test data into the trained model for forward calculation to obtain the final emotion recognition output.

Compared with the existing conversational emotion recognition method, the proposed conversational emotion recognition method of the layered stack graph convolution has the following significant advantages: improving the judgment of emotional expression by using the convolution structure of the layered stack diagram; fusing information of surrounding nodes in the dialogue graph by using a single-parameter attention mechanism; and performing emotional characteristic aggregation globally through a Transformer structure neutron attention mechanism calculation method.

In summary, in the embodiment, the plurality of sentence characteristic data are extracted first, then modeling is performed to obtain the context characteristic expression and the plurality of speaker characteristic expressions, finally emotion classification is performed based on the plurality of context characteristic expressions and the plurality of speaker characteristic expressions to obtain the emotion recognition result, the context characteristic and the speaker characteristic are added in the emotion recognition process, and the accuracy of conversation emotion recognition is improved.

In the following, the dialogue emotion recognition apparatus provided in the embodiment of the present application is introduced, and the dialogue emotion recognition apparatus described below and the dialogue emotion recognition method described above may be referred to in correspondence with each other.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a dialog emotion recognition device according to an embodiment of the present application.

In this embodiment, the apparatus may include:

a feature extraction module 100, configured to perform sentence feature extraction on the conversational data by using a trained feature extractor to obtain a plurality of sentence feature data;

the context modeling module 200 is configured to perform context feature expression modeling on a plurality of sentence feature data based on bidirectional LSTM to obtain a plurality of context feature expressions;

the speaker modeling module 300 is configured to perform speaker feature extraction on the multiple context feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain multiple speaker feature expressions;

and the emotion classification module 400 is used for performing emotion classification based on the multiple context feature expressions and the multiple speaker feature expressions to obtain an emotion recognition result.

An embodiment of the present application further provides a server, including:

a memory for storing a computer program;

a processor for implementing the steps of the dialog emotion recognition method as described in the above embodiments when the computer program is executed.

The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the dialog emotion recognition method according to the above embodiments.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The present application provides a method, apparatus, server, and computer-readable storage medium for emotion recognition in a dialog. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A conversation emotion recognition method is characterized by comprising the following steps:

2. The method for recognizing dialogue emotion according to claim 1, wherein the step of performing speaker feature extraction on the plurality of contextual feature expressions based on a graph neural network and a layered stack graph convolution structure to obtain a plurality of speaker feature expressions comprises:

and carrying out speaker feature extraction on the conversation feature diagram based on a layered stack network structure to obtain the multiple speaker feature expressions.

3. The conversation emotion recognition method of claim 2, wherein performing speaker feature extraction on the conversation feature map based on a hierarchical network structure to obtain the plurality of speaker feature expressions comprises:

4. The method of claim 1, wherein the sentence feature extraction is performed on the dialogue data by using a trained feature extractor to obtain a plurality of sentence feature data, and the method comprises:

preprocessing the dialogue data to obtain processed dialogue data;

5. The method of recognizing dialogue emotion according to claim 1, wherein the modeling of context feature expression for the plurality of sentence feature data based on bidirectional LSTM to obtain a plurality of context feature expressions comprises:

using a double line LSTM as a modeling tool;

6. The method for recognizing dialogue emotion according to claim 1, wherein the training process of the feature extractor comprises:

acquiring training set data;

7. The method for recognizing dialogue emotion according to claim 1, wherein the emotion classification is performed based on the plurality of contextual characteristic expressions and the plurality of speaker characteristic expressions to obtain an emotion recognition result, and the method comprises:

8. A conversational emotion recognition apparatus, comprising:

9. A server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method of conversational emotion recognition according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the dialog emotion recognition method according to any of claims 1 to 7.