CN113095357A

CN113095357A - Multi-mode emotion recognition method and system based on attention mechanism and GMN

Info

Publication number: CN113095357A
Application number: CN202110239787.7A
Authority: CN
Inventors: 曹叶文; 陈炜青; 周冠群
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-07-09

Abstract

The invention discloses a multimode emotion recognition method and a multimode emotion recognition system based on an attention mechanism and GMN (Gaussian mixture network), wherein acquired videos to be recognized are preprocessed to obtain texts, voices and facial expression characteristics; concurrently inputting the text, the voice and the facial expression features into an LSTMs model of the trained multi-modal emotion recognition network, and outputting a first feature vector; carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a trained gated memory network GMN, and outputting a second feature vector; the trained global attention mechanism network GTAN performs weighted summation on the memory output values of all timestamps under each LSTM model to obtain a third feature vector; fusing the first, second and third feature vectors to obtain fused feature vectors; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.

Description

Multi-mode emotion recognition method and system based on attention mechanism and GMN

Technical Field

The invention relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system based on an attention mechanism and GMN.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the development of artificial intelligence, in order to better serve people, a machine is expected to better identify the real expression of people, so that services meeting the expectations of people are provided, and the call sound of human-computer interaction is higher and higher. However, most of the so-called intelligent terminals can only perform simple execution tasks, and cannot achieve real human-computer interaction. The key point of realizing real human-computer interaction is to make the intelligent terminal correctly recognize the emotion shown by people, which is called emotion recognition. Emotional expression is an important part in the process of human development and communication. The invention can carry out emotion recognition between people through voice tone change, expression words, facial expressions and limb actions of people. In the field of artificial intelligence, emotion recognition is an important technology relating to human-computer interaction, integrates multiple subjects such as voice signal processing, psychology, mode recognition, video image processing and the like, and can be applied to various fields such as education, traffic, medical treatment and the like.

Emotion recognition essentially belongs to pattern recognition in computer technology, and the invention needs data acquisition and subsequent data processing on human emotion-expressing information. The most common sources of data in life are audio and video, and psychological studies have shown that facial expressions in video and speech and text in audio play a crucial role in the expression of human emotions. The emotion recognition method based on audio is generally speech emotion recognition, and the emotion recognition method based on video is generally facial expression recognition. In the development process of the emotion recognition technology, although two single-mode emotion recognition, namely voice emotion recognition based on audio and facial expression recognition based on video, are greatly developed, human emotion is formed by combining multi-mode information from the aspect of emotion information, information among the modes has complementarity, and the multi-mode information can be fully utilized by the emotion recognition of audio and video fusion. Therefore, multi-modal emotion recognition becomes an important research point.

Multi-modal emotion recognition was initially explored using classifiers such as Support Vector Machines (SVM), linear regression, and logistic regression. In the early multi-modal emotion recognition method, for a video signal, an optical flow method is used to detect the movement and the moving speed of key points (such as mouth corners, eyebrow internal corners and the like) of a face, and a KNN algorithm is used to judge the emotion type of a video modality. In addition, for the voice signal, the emotion category of the voice modality is judged by using the pitch characteristics of the voice and the HMM algorithm. And finally, weighting and combining the video modal emotion category and the audio modal emotion category to obtain a final recognition result. Still other methods combine video, audio and text forms, and use multi-kernel learning (MKL) in support vector machine SVM to merge three modes, thereby obtaining higher recognition accuracy. Methods produced in recent years include an emotion recognition method using a mel-frequency spectrogram as an input of an audio signal to the CNN, a face frame as an input of a video signal to the 3D CNN, and an emotion recognition method that fuses audio features of a voice signal, dense features of an image frame, and CNN-based features of the image frame at a score level to recognize emotion, and the like.

Although the multi-modal emotion recognition can well overcome the defects that information is single and cannot be complemented in single-modal emotion recognition, how to process and fuse information of different modalities is a problem which is difficult to solve. The traditional multi-modal information fusion method framework comprises data layer fusion, feature layer fusion and decision layer fusion. The three multi-modal emotion recognition frameworks are all thousands of years old. However, in practical tasks, practical problems need to be considered to select the best fusion mode. The text decides to process text information, audio signals and video signals by adopting a deep learning feature layer fusion mode.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a multi-modal emotion recognition method and system based on an attention mechanism and GMN;

in a first aspect, the invention provides a multi-modal emotion recognition method based on an attention mechanism and GMN;

the multimode emotion recognition method based on the attention mechanism and the GMN comprises the following steps:

preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;

the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;

carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;

the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;

the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.

In a second aspect, the invention provides a multi-modal emotion recognition system based on an attention mechanism and GMN;

an attention mechanism and GMN based multi-modal emotion recognition system, comprising:

a pre-processing module configured to: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;

a first feature vector acquisition module configured to: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;

a second feature vector acquisition module configured to: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;

a third feature vector acquisition module configured to: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;

an emotion recognition module configured to: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.

In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

for the defects of insufficient information and poor robustness of traditional single-mode emotion recognition, multi-mode emotion recognition has the advantages of sufficient information and complementary modes. For the complementation of information, it is important that the feature information of different modalities will not affect each other, and in view of this problem, the double attention mechanism can solve the problem well. The invention can obtain the optimal contribution ratio among the modes by utilizing the attention mechanism to the weight distribution of each mode, so that the information is fully fused and interacted across the modes, but the information between the modes is not mutually exclusive. In addition, the invention can store the information after interaction in the gated memory network by using the gated memory network, so that the information can be maximally utilized. Moreover, the invention can also highlight the specific part containing the strong emotional characteristics by learning a string of weight parameters through a weighted aggregation strategy of an attention mechanism, then learning the importance degree of each frame output from an LSTM output sequence, and then combining the importance degrees.

The invention relates to a multimode emotion recognition method based on a double attention mechanism and a gated memory network. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs system are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of the first embodiment;

FIG. 2 is a diagram illustrating a network connection relationship according to the first embodiment;

fig. 3 is a schematic diagram of a network connection of the DTAN model according to the first embodiment;

fig. 4 is a schematic diagram of network connection of a GMN model according to a first embodiment;

fig. 5 is a schematic diagram of a network connection of the GTAN model according to the first embodiment.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment provides a multi-modal emotion recognition method based on an attention mechanism and GMN;

as shown in fig. 2, the method for multi-modal emotion recognition based on attention mechanism and GMN includes:

s101: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;

s102: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;

s103: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;

s104: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;

s105: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.

As one or more embodiments, the S101: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized; the method comprises the following specific steps:

separating the video to be identified to obtain an audio signal and a video signal;

carrying out voice recognition on the audio signal to obtain text information;

performing feature extraction on the text information to obtain text features;

carrying out feature extraction on the audio signal to obtain a voice feature;

and carrying out feature extraction on the video signal to obtain facial expression features.

Exemplarily, a video to be recognized is separated to obtain an audio signal and a video signal; the original video is subjected to audio-video separation by using a rapid video converter.

Illustratively, the voice recognition of the audio signal results in text information; the method specifically comprises the following steps:

and using flying automatic voice recognition software to perform automatic voice recognition of the audio frequency and obtain text data from the audio frequency.

Further, performing feature extraction on the text information to obtain text features; the method specifically comprises the following steps:

and performing Word vectorization feature extraction on the text information by using a Global (Global Vectors for Word retrieval) model to obtain text features.

Exemplarily, the feature extraction is performed on the text information to obtain text features; the method specifically comprises the following steps:

for text data, a 300-dimensional pre-trained Glove model is used for embedding, each word obtains a 300-dimensional text feature, and finally a T300 feature vector matrix is obtained.

Further, performing feature extraction on the audio signal to obtain a voice feature; the method specifically comprises the following steps:

and performing feature extraction on the audio signal by using a speech processing algorithm library Covarep to obtain speech features.

Illustratively, the feature extraction is performed on the audio signal to obtain a voice feature; the method specifically comprises the following steps:

for audio data, firstly, the audio data is segmented according to the frequency of 100 frames per second, and then the audio signal is subjected to a Covarep feature extraction tool to obtain T₁X 74 eigenvector matrix.

Further, performing feature extraction on the video signal to obtain facial expression features; the method specifically comprises the following steps:

and carrying out face contour recognition, face key point extraction, face contour correction, sight line estimation, head posture and facial motion unit feature extraction on the video signal to obtain facial expression features.

Illustratively, the feature extraction is performed on the video signal to obtain facial expression features; the method specifically comprises the following steps:

for video data, an openface2.0 facial behavior analysis tool is used for feature extraction. Inputting complete video data into an Openface2.0 tool can obtain 68 face key points, face shape parameters, head pose estimation, sight line estimation, face behavior units, Hog characteristics and the like, and finally obtain T₂X 711 eigenvector matrix.

As one or more embodiments, the S101: after the step of preprocessing the acquired video to be recognized to obtain the text feature, the voice feature and the facial expression feature of the video to be recognized, the step S102: the method comprises the following steps of concurrently inputting text features, voice features and facial expression features of a video to be recognized into corresponding LSTMs models, and outputting a first feature vector, wherein the method also comprises the following steps of:

s101-2: and performing data alignment and standardization processing on all the obtained features.

The information of the three modalities is aligned using the pennsylvania university voice tag Forced alignment tool (P2 FA).

The features are aligned in the time dimension, so that information interaction can be performed among the modalities conveniently. The aligned data is characterized by the text l ═ T × d_l,d_l300, speech a. T × d_a,d_a74, video v ═ T × d_v,d_v＝711。

For the overall data after alignment, N ═ l, a, v }, and the data needs to be normalized, so as to normalize each dimension of the feature to a specific interval, and change the dimensional expression into a dimensionless expression. The method employed in the present invention is Z-score feature normalization, also known as standard deviation normalization. The formula is as follows:

where x is the input sample, μ is the mean of all sample data, and σ is the standard deviation of all sample data. The data after standardization is beneficial to accelerating the convergence speed based on the gradient descent method or the random gradient descent method, and the accuracy of the model can be improved.

As one or more embodiments, the multi-modal emotion recognition network has a network structure comprising: LSTMs model (Long Short Term Memory Networks), DTAN model (Delta-Time Attention Network, Delta-Time improved Attention mechanism), GMN model (Gated Memory Network), and GTAN model (Global-Time Attention Network);

the LSTMs model is connected with the GMN model through the DTAN model, and the GMN model is connected with the fusion module;

the LSTMs model is connected with the fusion module;

the LSTMs model is connected with the GTAN model, and the GTAN model is connected with the fusion module;

the fusion module is connected with the first full-connection layer, the first full-connection layer is connected with the second full-connection layer, and the second full-connection layer is connected with the output layer.

Wherein, the LSTMs model comprises: a first LSTM model, a second LSTM model and a third LSTM model which are parallel; each LSTM model comprises a plurality of memories connected in series; memory refers to c in the LSTM model^tA unit for storing data information at time t;

the DTAN model comprises: sequentially connected fully-connected neural network D_aThe softmax network and the first multiplier; the fully-connected neural network D_aThe full-connection layer FC1, the Dropout layer and the full-connection layer FC2 are sequentially connected; wherein, the full connection layer F2 is connected with the softmax network; the input end of the full connection layer FC1 is connected with the input end of the first multiplier; the output terminal of the first multiplier is used as the output terminal of the DTAN model.

The working principle of the DTAN model is as follows: memory cascading LSTMs models at times t-1 and t, namely c^[t ^-1,t]Passed to a trainable fully-connected neural network

Then using softmax network, pair D_aRegularizes its range to (0, 1)]To obtain the trans-modal attention coefficient a at the same time^[t-1,t]。

As shown in FIG. 3, pre _ c _ l, pre _ c _ a, and pre _ c _ v are respectively the LSTM memory outputs of text, speech and image at time t-1, and c is the concatenation of the three^t-1；

c _ l, c _ a and c _ v are respectively text, speech and image output from LSTM memory at t moment, and c is formed after cascading the three^t(ii) a C is to^t-1And c^tCascade to c^[t-1,t]C is mixing^[t-1,t]Input into a fully-connected layer FC1 of a fully-connected neural network;

fully connected neural network

Two full-link layers FC1 and FC2, and one Dropout layer to prevent overfitting;

softmax layer for scoring the activation scores of the memory of LSTM of each modality at times t and t-1, i.e. the attention coefficient a^[t-1,t]；

A is to^[t-1,t]And c^[t-1,t]Multiplying the corresponding points to obtain a first weighted feature

An indicator indicates that the two vectors have the same dimension size, and the indicator indicates that the two vectors have the same dimension size.

Wherein, the GMN model comprises: d_uThe network(s) of the network(s),

network and

a network;

D_uthe network(s) of the network(s),

network and

the networks are all fully connected neural networks;

D_ua network, comprising: the full connection layer FC3, the Dropout layer and the full connection layer FC4 are connected in sequence;

a network, comprising: the full connection layer FC5, the Dropout layer and the full connection layer FC6 are connected in sequence;

a network, comprising: the full connection layer FC7, the Dropout layer and the full connection layer FC8 are connected in sequence;

the input ends of the full connection layer FC3, the full connection layer FC5 and the full connection layer FC7 are all connected with the output end of the DTAN model;

the full connection layer FC4 is connected with the input end of the second multiplier through a first sigma function;

the full connection layer FC6 is connected with the input end of the third multiplier through a second sigma function;

the full connection layer FC8 is connected with the input end of the third multiplier through a tanh function;

the output end of the second multiplier and the output end of the third multiplier are both connected with the input end of the fourth multiplier, and the output end of the fourth multiplier is connected with the input end of the second multiplier.

Wherein D_uCross-modality update component for generating multimodal gated memory networks

For controlling the retention gate gamma₁The purpose is to remember the current state of the multimodal gated memory network,

for controlling the updating gate gamma₂Aim at updating the component

To remember to update the memory numbers of the multimodal gated memory network.

As shown in figure 4 of the drawings,

is the output of the DTAN, i.e. the first weighted feature; d_uThe network(s) of the network(s),

network and

the networks are all full-connection neural networks, and full-connection layers FC3 and FC4 belong to D_uNetwork, full connectivity layers FC5 and FC6 belong to

Network, full connectivity layers FC7 and FC8 belong to

A network, a Dropout layer to prevent overfitting; the corresponding points of the two vectors with the same dimension are multiplied, and the sign obtains a vector with the same dimension as the two input vectors; at each time stamp t recurred throughout the network, a reservation gate is used

And a retrofit gate

And current modal interaction update component

To update u^t，u^tAt time 0, initialization is required.

Wherein, the GTAN model comprises: connected in series

Networks and softmax networks;

networks, including being juxtaposedMultipliers p1, p2 and p3 …, multiplier pn;

each multiplier of the network is connected with the softmax function through the corresponding sum function;

the input end of each multiplier is connected with the input end of the multiplier pn;

the softmax function is followed by a weighting unit α_n。

The GTAN model operates on the principle that the output is output for each time of the LSTM of each mode

Composed matrix H_nWeight assignment of attention mechanism α_nTo obtain the final output z_n，

Where n is an index of three modes, and n is 1,2, 3. The goal is to mine the best allocation coefficient for the entire time series to highlight the most emotional frames and supplement the DTAN and GMN information. As shown in fig. 5.

Wherein the content of the first and second substances,

for the output of LSTM at each time, n is an index of three modalities, where n is 1,2, and 3, and represents text, speech, and image, that is, the three modalities perform the above steps to obtain respective z vectors, and then concatenate the three z vectors to obtain a final z vector; alpha is alpha_nOutput of each time of the nth mode

The weight distribution coefficient of (1); an indicator indicates that the two vectors have the same dimension size, and the indicator indicates that the two vectors have the same dimension size.

Illustratively, LSTMs consists of a plurality of Long Short Term Memory (LSTM) networks, one for each modality. The modality corresponding to the first LSTM model is a text feature, the modality corresponding to the second LSTM model is a voice feature and the modality corresponding to the third LSTM model is a facial expression feature; each LSTM encodes its modality-specific dynamics and interactions.

Illustratively, Delta-Time improves attention mechanism DTAN. The Delta-Time Attention Network (DTAN) Network is an improved Attention mechanism and aims to find out the memory information interaction and Time interaction between different modes in the LSTM system.

Illustratively, a multimodal gated memory network GMN is used. A multimodal Gated Memory Network (GMN) is a storage module for storing cross-time interaction and cross-modality interaction information.

Illustratively, the global attention mechanism is GTAN. Global-Time extension (GTAN) is the optimal allocation coefficient for mining the entire Time series to highlight the most emotional frames and to supplement information for the features obtained by DTAN and GMN.

In one or more embodiments, the multi-modal emotion recognition network, the training step includes:

constructing a training set, wherein the training set is a text feature, a voice feature and a facial expression feature which correspond to the same video of a known emotion category label;

inputting the text features of the training set into a first LSTM model; at the same time, the user can select the desired position,

inputting the voice characteristics of the training set into a second LSTM model; at the same time, the user can select the desired position,

inputting facial expression features of the training set into a third LSTM model;

using the known emotion category labels as output values of the multi-mode emotion recognition network;

training a multi-mode emotion recognition network; and obtaining the trained multi-modal emotion recognition network.

Illustratively, for each modality-characterized sequence, the long short term memory network (LSTM) encodes the features of each modality over time. At each input timestamp, feature information from each modality is input into the assigned respective LSTM model.

As one or more embodiments, the S102: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output; the method comprises the following specific steps:

inputting text characteristics of a video to be recognized into a first LSTM model, and outputting a first coding vector by the first LSTM model; at the same time, the user can select the desired position,

inputting the voice characteristics of the video to be recognized into a second LSTM model, and outputting a second coding vector by the second LSTM model; at the same time, the user can select the desired position,

inputting facial expression characteristics of a video to be recognized into a third LSTM model, and outputting a third coding vector by the third LSTM model;

and splicing the first, second and third coded vectors to obtain a first feature vector.

As one or more embodiments, the S103: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; the method comprises the following specific steps:

the input to the DTAN is a cascade of memories at times t-1 and t, denoted c^[t-1,t]. These memories are passed to a trainable fully-connected neural network

To obtain the attention coefficient a^[t-1,t]。

a^[t-1,t]＝softmax(D_a(c^[t-1,t]))

a^[t-1,t]Is the softmax function activation score, i.e., the weighting value, for each LSTM memory at time t-1 and t. At D_aApplying softmax to c on the output layer of^[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]To (c) to (d); the output of the DTAN is defined as:

wherein the content of the first and second substances,

is the memory reserved after the memory of the LSTM passes through the DTAN, i.e. the first weighted feature.

DTAN is also able to discover modal interactions that occur at different timestamps because it involves the memory c in LSTM systems. These memories may carry information about the inputs observed across different timestamps.

Illustratively, the goal of DTAN is to outline the cross-modality interaction between different modality memories in an LSTMs system at timestamp t. Therefore, at time t, the LSTM memory c is paired^tThe cascade of (2) uses an attention mechanism to automatically assign the weight coefficients. Modal interaction is achieved by assigning high coefficients to the modality that dominates the emotional effect at the time stamp t and low coefficients to other modalities. However, only memory c in LSTM is used at time t^tIt is not desirable to perform coefficient assignment. Memory c with time t-1 added is also required^t-1The DTAN can now freely retain a constant size in the memory information of the LSTM system and assign high coefficients to them only when they are about to change.

As one or more embodiments, the S103: inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector; wherein t is a positive integer; the method comprises the following specific steps:

first weighting the features

For use as trainable neural networks

To generate a cross-modality update component for a multimodal gated memory network

d_memIs a dimension of a multimodal gated memory network;

the multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma₁,γ₂Referred to as retention gate and update gate, respectively; at each time stamp t, γ₁What the current state of the multimodal gated memory network to remember is, and gamma₂Then it is based on the updated component

To remember what the memory of the multimodal gating memory network is updated; gamma ray₁And gamma₂Each consisting of two trainable neural networks

Controlling; output using DTAN

As gating mechanisms for multimodal gated memory networks

The input formula of (a) is:

at each time stamp t of the whole network recursion, a reservation gate γ is used₁And updating the gate gamma₂And current modal interaction update component

Update u by the following formula:

illustratively, u of the multimodal gating memorisation network GMN is a neural component that stores a history of interactions across time. It acts as a supplemental memory to the memory in the LSTM system. The output of the DTAN is passed directly to the multimodal gated memory network to represent cross-modal interactions made up of key dimensions of different modalities in the LSTM memory system.

As one or more embodiments, the S104: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network performs weighted summation on memories of all timestamps under each LSTM model to obtain a third feature vector; the method comprises the following specific steps:

output for each time instant of LSTM system of each modality

Composed matrix H_nAutomatic weight assignment alpha for attention mechanism_nTo obtain a final output z_n，

The formula is as follows:

α_nfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]And (3) removing the solvent. Final output

Is an information supplementary vector output by DTAN, GMN and LSTM.

Illustratively, for global attention mechanism (GTAN), the goal is to mine the best allocation coefficients for the entire time series and supplement the information for DTAN and GMN. And carrying out automatic weight distribution of an Attention mechanism on a matrix formed by the output of each time of the LSTM system of each mode to obtain final output.

As one or more embodiments, the S105: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; performing emotion recognition on the fused feature vector to obtain an emotion recognition result; the method comprises the following specific steps:

the trained multi-mode emotion recognition network splices the first, second and third feature vectors to obtain a fusion feature vector;

and performing emotion recognition on the fusion feature vectors, and classifying by using a full-connection neural network to obtain an emotion recognition result.

And cascading the output u of the gate control memory network GMN, the output of the LSTM last moment of each mode and the output of the global attention mechanism of each mode. And then, for the cascaded result, obtaining a final emotion prediction result after passing through two layers of fully-connected neural networks FC.

The Loss function used by the multi-modal emotion recognition network is the L1Loss function of Loss. L1Loss represents the absolute value of the difference between the predicted value and the true tag value, also known as the manhattan distance. The expression is as follows:

wherein y is_iIn the form of an actual value of the value,

is a predicted value.

Training the whole emotion recognition network, and comprehensively evaluating the performance of the emotion recognition network, wherein the evaluation standard is as follows: binary Accuracy, F1 Score, weighted Accuracy, MAE and r coefficients.

As shown in FIG. 1, the multi-modal emotion recognition method based on a double attention mechanism and a gated memory network is disclosed. The invention uses the MOSI dataset to validate the proposed algorithm, comprising in particular the following steps:

step 1, the MOSI data set provides data of three modes of text, audio and video. Therefore, no additional audio-video separation and automatic voice recognition operation are needed. For the MOSI data set, the division of a training set, a verification set and a test set is standard division of a comparison experiment, wherein the training set is 1284 samples, the verification set is 229 samples and the test set is 686 samples.

The emotional tag of the data set is a linear range of-3 to +3 going from strong negative to strong positive. The intensity was annotated by an online worker from Amazon Mechanical turn. For each video, the annotator has seven choices: strong positive (labeled +3), positive (+2), weak positive (+1), neutral (0), weak negative (-1), negative (-2)), strong negative (-3). In addition, for emotion recognition, the emotion recognition method expresses positive emotion between 0 and 3 and negative emotion between-3 and 0. Further, the present invention can find that the number of positive emotions is 1176 and the number of negative emotions is 1023 by two classifications.

Step 2, embedding the text data by using a 300-dimensional pre-training Glove model, obtaining 300-dimensional text characteristics of each word, and finally obtaining a T multiplied by 300 characteristic vector matrix; for audio data, firstly, the audio data is segmented according to the frequency of 100 frames per second, and then the audio signal is subjected to a Covarep feature extraction tool to obtain T₁A feature vector matrix of x 74; for video data, an openface2.0 facial behavior analysis tool is used for feature extraction. Inputting complete video data into an Openface2.0 tool can obtain 68 face key points, face shape parameters, head pose estimation, sight line estimation, face behavior units, Hog characteristics and the like, and finally obtain T₂X 711 eigenvector matrix.

And step 3, the extracted features of the three modes of the text, the voice and the video are not aligned in a time sense, and the features only play corresponding roles in the respective modes. The information of the three modalities is aligned using P2FA forced alignment tools. Finally, the aligned data is characterized by the text l ═ T × d_l,d_l300, speechTone a ═ T × d_a,d_a74, video v ═ T × d_v,d_v＝711。

For the overall data after alignment, N ═ { l, a, v }, the present invention also needs to normalize the data. The method employed in the present invention is Z-score feature normalization, also known as standard deviation normalization. The formula is as follows:

where x is the input sample, μ is the mean of all sample data, and σ is the standard deviation of all sample data.

And 4, after the text, the audio and the video are subjected to feature extraction, inputting feature information of each mode into the respective LSTM. For the input N ═ { l, a, v }, the present invention defines the input of the nth modality as

Wherein

Is the input dimension of the nth input modality. For the nth mode, the memory of the signed LSTM is denoted as

The output of each LSTM is defined as

Indicates the memory c in the nth LSTM_nDimension (d) of (a).

Wherein the update rule of the nth LSTM is:

wherein i_n,f_n,o_nThe input gate, the forgetting gate and the output gate of the nth LSTM are respectively. m is_nIt is the storage update of the nth LSTM at time t, which indicates an element product, σ is a Sigmoid activation function.

Step 5. the goal of the DTAN is to outline the cross-modality interaction between different modality memories in the LSTMs system at timestamp t. The invention uses the LSTM memory c at time t^tAnd LSTM memory c at time t-1^t-1The coefficient assignment is made so that the DTAN is free to keep a constant size in the memory information of the LSTM system and to assign high coefficients to them only when they are going to change.

To obtain the attention coefficient a^[t-1,t]。

a^[t-1,t]＝softmax(D_a(c^[t-1,t]))

a^[t-1,t]Is the softmax function activation score at t-1 and t for each LSTM memory. At D_aOn the output layer ofC can be paired with softmax^[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]In the meantime. The output of the DTAN is defined as:

is the memory reserved after the memory of the LSTM passes through the DTAN. DTAN is also able to discover modal interactions that occur at different timestamps because it involves the memory c in LSTM systems. These memories may carry information about the inputs observed across different timestamps.

Step 6. output of DTAN

Passed directly to the multimodal gated memory network GMN to represent which dimensions in the LSTM memory system constitute cross-modal interactions. Firstly, the first step is to

For use as trainable neural networks

d_memIs a dimension of a multimodal gated memory network.

The multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma₁,γ₂Respectively called retention gate and update gate. At each time stamp t, γ₁Multiple modes to rememberWhat the current state of the state-gated memory network is, and gamma₂Then it is based on the updated component

To remember what the memory of the multimodal gated memory network is updated. Gamma ray₁And gamma₂Each consisting of two trainable neural networks

And (5) controlling. Output using DTAN

As gating mechanisms for multimodal gated memory networks

The input formula of (a) is:

Update u by the following formula:

step 7. for the Global attention System (GTAN), the invention outputs for each moment of the LSTM system of each modality

Composed matrix H_nAutomatic weight assignment alpha for the Attention mechanism_nTo obtain a final output z_n，

The formula is as follows:

Is an excellent information supplementary vector output by DTAN, GMN and LSTM.

Step 8, for the output u of the multi-mode gating memory network GMN, the invention cascades the output u with the last moment of the LSTM and the last output of the GTAN of each mode, and the specific formula is as follows:

r^T＝[u^T,h^T,z_n],n∈N

then for the concatenated result r^TIn other words, a final emotion prediction result is obtained after two layers of fully-connected neural networks FC

The formula is as follows:

wherein W₁And W₂Two trainable matrices of a fully connected neural network, respectively. The Loss function used by the model is the L1Loss function. L1Loss represents the absolute value of the difference between the predicted value and the true tag value, also known as the manhattan distance. The expression is as follows:

wherein y is_iIn the form of an actual value of the value,

is a predicted value.

Step 9, for the MOSI dataset, the evaluation criteria are Binary Accuracy and F1 Score, multi-classification weighted Accuracy, mean absolute error MAE, and r coefficient, and the formula is as follows:

wherein

Wherein y is_iIn the form of an actual value of the value,

for predictive value, weighted accuracy is the usual accuracy, calculated as part of the correct answer for all examples. The larger the accuracy, the better the recognition effect.

Wherein TP is the number predicted to be positive, actually positive; FP is the number predicted to be positive, actually negative; TN is the number predicted to be negative, actually negative; FN is the number predicted to be negative and actually positive, F1-Score is the harmonic mean of precision and call. For F1-Score, the larger the value, the better the recognition.

Wherein is the number of samples in the test set, y_iIn the form of an actual value of the value,

is a predicted value. For MAE, the smaller the value, the better the recognition effect.

Wherein

The loss generated using the model is represented,

the loss resulting from using the mean is indicated. The closer to 1 for r, the better the recognition effect.

TABLE 1 model parameter Table

Parameter(s)	Value of
		Learning rate	0.005
Optimizer	Adam
		Size of batch	56
Dropout coefficient	0.25
		Number of iterations	3000
weight_decay	0.1
		grad_clip_value	1.0
hidden_sizes	32,64,32 (text, voice, video)

Comparative experiment:

single-mode emotion recognition:

the method carries out single-mode emotion recognition aiming at three modes of text, voice and video respectively, and carries out comparison experiments with subsequent dual-mode and multi-mode emotion recognition. The invention uses LSTM and FC models without DTAN, GTAN and GMN for single-mode emotion recognition.

Table 2 shows the results of single-modality emotion recognition on MOSI data sets. The invention can obviously see that in the single-mode emotion recognition only using LSTM and FC, the voice and video effects are not good as the text effects, the accuracy of the classification is only 54.9% and 54.6%, and the text can reach 71.1%.

TABLE 2 MOSI data set single modal emotion recognition results

And (3) bimodal emotion recognition:

the invention then performs bimodal emotion recognition, table 3 is the MOSI data set bimodal emotion recognition result, and LSTM-A, V in table 3 is the model after LSTM is used for cascade connection and FC processing in the two modes of voice and video. LSTM-A, T is a speech and text modality, and LSTM-V, T is a video and text modality, and the models are the same. The invention can see that the BA of LSTM-A, V is 57.4%, the BA of LSTM-A, T is 73.0%, and the accuracy of LSTM-V, T is 70.9%. Compared with single-mode emotion recognition, the method has the advantages that the improvement of the double-mode emotion recognition effect is quite remarkable, the single-mode BA of voice and video accounts for 54.9 percent and 54.6 percent, and the combined result reaches 57.4 percent. In addition, the invention can be seen that the identification results with text modes in the bimodal emotion identification are relatively high and are consistent with the emotion identification result of a single mode, and the results are improved.

TABLE 3 MOSI data set bimodal emotion recognition results

Multi-modal emotion recognition:

following is multimodal emotion recognition. Table 4 shows the comparison of the multi-modal emotion recognition results of the MOSI dataset using the present model with the single modality. Experiments show that the multi-modal emotion recognition based on the model is obviously improved on various evaluation indexes such as BA, F1, Ac-7, MAE, r and the like.

TABLE 4 MOSI data set multimodal emotion recognition results

Table 5 is the fusion Matrix of the test set on the MOSI data set, and Table 6 is the report of the results of the two classifications of the MOSI data set.

TABLE 5 MOSI test set Confusion Matrix

Emotion categories	Negtive (prediction)	Positive (prediction)	Total of
				Negtive (true)	306	73	379
Positive (true)	104	203	307
				Total of	410	276	686

TABLE 6 MOSI dataset binary report

Evaluation index	Precision	Recall	F1-score
				Results	74.1	74.1	74.1

Ablation experiment

In order to verify the double attention mechanism and the function of the gate control memory network, the invention carries out ablation contrast experiment on the double attention mechanism and the gate control memory network.

Network (no DTAN-mem, no GTAN) in Table 7 refers to using only LSTM and FC for tri-modal emotion recognition, without using DTAN, GMN and GTAN components. According to the invention, the recognition effect of the multi-modal emotion recognition by only using the LSTM and the FC is obviously reduced compared with that of the final model. Wherein BA, F1 and Ac-7 are respectively reduced by 3.3 percent, 3.3 percent and 4.5 percent, MAE is increased by 0.094, and r is reduced by 0.074. But the recognition effect is still improved compared with the previous single-modal and dual-modal emotion recognition. This shows that the introduction of additional modalities can significantly improve the emotion recognition accuracy, and the recognition effect of the component using the chapter is better.

The Network (no-GTAN) is a model in the current chapter without using a GTAN component, and the emotion recognition effect of the model is reduced compared with that of a final model. Wherein BA, F1 and Accuracy-7 are respectively reduced by 1.3%, 1.3% and 6.2%, MAE is increased by 0.074, and r is reduced by 0.035. For the GTAN component, the global Attention mechanism can mine the optimal distribution coefficient of the whole time sequence to highlight the frames with most emotional colors, and supplement the information of the DTAN-GMN and LSTMs coding systems.

The Network (no-DTAN-mem) is a model in the current chapter without using DTAN and GMN components, and the emotion recognition effect is found to be reduced compared with that of a final model in the current chapter. Wherein BA, F1 and Accuracy-7 are respectively reduced by 2.0%, 2.0% and 1.1%, MAE is increased by 0.023, and r is reduced by 0.039. For DTAN and GMN components, the optimal distribution coefficient of information between different modes at the same time can be mined to highlight the mode with emotional color at the current time, and information supplement is performed on a GTAN and LSTMs coding system.

TABLE 7 MOSI data set ablation experimental comparison

In conclusion, the multimode emotion recognition method based on the double attention mechanism and the gated memory network greatly improves the emotion recognition performance. And not only the frames with large emotion colors at different moments in a single mode are highlighted, but also the information interaction at the same moment in different modes is considered. Meanwhile, the emotion recognition performance is improved by using few parameters. In addition, the emotion information rich in the semantics is fully utilized by combining the semantics, and the method is greatly helpful for overall emotion recognition.

A multi-modal emotion recognition method based on a double attention mechanism and a gated memory network comprises the steps of preprocessing and feature extraction of multi-modal emotion data, model design based on the double attention mechanism and the gated memory network, and a fusion layer. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.

The invention has the beneficial effects that: the invention relates to a multimode emotion recognition method based on a double attention mechanism and a gated memory network. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs system are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.

The second embodiment provides a multi-modal emotion recognition system based on an attention mechanism and GMN;

It should be noted here that the preprocessing module, the first feature vector obtaining module, the second feature vector obtaining module, the third feature vector obtaining module and the emotion recognition module correspond to steps S101 to S105 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

The third embodiment of the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

The fourth embodiment also provides a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multimode emotion recognition method based on the attention mechanism and GMN is characterized by comprising the following steps:

2. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized; the method comprises the following specific steps:

carrying out voice recognition on the audio signal to obtain text information;

performing feature extraction on the text information to obtain text features;

carrying out feature extraction on the audio signal to obtain a voice feature;

3. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: after the step of preprocessing the acquired video to be recognized to obtain the text feature, the voice feature and the facial expression feature of the video to be recognized, concurrently inputting the text feature, the voice feature and the facial expression feature of the video to be recognized into the corresponding LSTMs model, and before the step of outputting the first feature vector, the method further comprises the following steps:

and performing data alignment and standardization processing on all the obtained features.

4. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the multi-mode emotion recognition network comprises a network structure as follows: LSTMs model, DTAN model, GMN model and GTAN model;

the LSTMs model is connected with the fusion module;

5. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the multi-mode emotion recognition network comprises the following training steps:

6. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output; the method comprises the following specific steps:

7. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising:

carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; the method comprises the following specific steps:

the input to the DTAN is a cascade of memories at times t-1 and t, denoted c^[t-1,t](ii) a These memories are passed to a trainable fully-connected neural network

To obtain the attention coefficient a^[t-1,t]；

a^[t-1,t]＝softmax(D_a(c^[t-1,t]))

a^[t-1,t]The activation score of the softmax function at t-1 and t moment memorized by each LSTM, namely a weighted value; at D_aApplying softmax to c on the output layer of^[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]To (c) to (d); the output of the DTAN is defined as:

wherein the content of the first and second substances,

is a memory reserved after the memory of the LSTM passes through the DTAN, namely the characteristic after the first weighting;

alternatively, the first and second electrodes may be,

inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector; wherein t is a positive integer; the method comprises the following specific steps:

first weighting the features

For use as trainable neural networks

d_memIs a dimension of a multimodal gated memory network;

Controlling; output using DTAN

As multimodal gating memoryOf gating mechanisms of the network

The input formula of (a) is:

Update u by the following formula:

alternatively, the first and second electrodes may be,

the trained global attention mechanism network GTAN of the multi-modal emotion recognition network performs weighted summation on memories of all timestamps under each LSTM model to obtain a third feature vector; the method comprises the following specific steps:

output for each time instant of LSTM system of each modality

The formula is as follows:

α_nfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]A (c) is added; final output

Is an information supplementary vector output by DTAN, GMN and LSTM;

alternatively, the first and second electrodes may be,

the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; performing emotion recognition on the fused feature vector to obtain an emotion recognition result; the method comprises the following specific steps:

8. The multimode emotion recognition system based on the attention mechanism and the GMN is characterized by comprising the following components:

9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.