CN113095357A - Multi-mode emotion recognition method and system based on attention mechanism and GMN - Google Patents

Multi-mode emotion recognition method and system based on attention mechanism and GMN Download PDF

Info

Publication number
CN113095357A
CN113095357A CN202110239787.7A CN202110239787A CN113095357A CN 113095357 A CN113095357 A CN 113095357A CN 202110239787 A CN202110239787 A CN 202110239787A CN 113095357 A CN113095357 A CN 113095357A
Authority
CN
China
Prior art keywords
emotion recognition
network
feature vector
model
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110239787.7A
Other languages
Chinese (zh)
Inventor
曹叶文
陈炜青
周冠群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202110239787.7A priority Critical patent/CN113095357A/en
Publication of CN113095357A publication Critical patent/CN113095357A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multimode emotion recognition method and a multimode emotion recognition system based on an attention mechanism and GMN (Gaussian mixture network), wherein acquired videos to be recognized are preprocessed to obtain texts, voices and facial expression characteristics; concurrently inputting the text, the voice and the facial expression features into an LSTMs model of the trained multi-modal emotion recognition network, and outputting a first feature vector; carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a trained gated memory network GMN, and outputting a second feature vector; the trained global attention mechanism network GTAN performs weighted summation on the memory output values of all timestamps under each LSTM model to obtain a third feature vector; fusing the first, second and third feature vectors to obtain fused feature vectors; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.

Description

Multi-mode emotion recognition method and system based on attention mechanism and GMN
Technical Field
The invention relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system based on an attention mechanism and GMN.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the development of artificial intelligence, in order to better serve people, a machine is expected to better identify the real expression of people, so that services meeting the expectations of people are provided, and the call sound of human-computer interaction is higher and higher. However, most of the so-called intelligent terminals can only perform simple execution tasks, and cannot achieve real human-computer interaction. The key point of realizing real human-computer interaction is to make the intelligent terminal correctly recognize the emotion shown by people, which is called emotion recognition. Emotional expression is an important part in the process of human development and communication. The invention can carry out emotion recognition between people through voice tone change, expression words, facial expressions and limb actions of people. In the field of artificial intelligence, emotion recognition is an important technology relating to human-computer interaction, integrates multiple subjects such as voice signal processing, psychology, mode recognition, video image processing and the like, and can be applied to various fields such as education, traffic, medical treatment and the like.
Emotion recognition essentially belongs to pattern recognition in computer technology, and the invention needs data acquisition and subsequent data processing on human emotion-expressing information. The most common sources of data in life are audio and video, and psychological studies have shown that facial expressions in video and speech and text in audio play a crucial role in the expression of human emotions. The emotion recognition method based on audio is generally speech emotion recognition, and the emotion recognition method based on video is generally facial expression recognition. In the development process of the emotion recognition technology, although two single-mode emotion recognition, namely voice emotion recognition based on audio and facial expression recognition based on video, are greatly developed, human emotion is formed by combining multi-mode information from the aspect of emotion information, information among the modes has complementarity, and the multi-mode information can be fully utilized by the emotion recognition of audio and video fusion. Therefore, multi-modal emotion recognition becomes an important research point.
Multi-modal emotion recognition was initially explored using classifiers such as Support Vector Machines (SVM), linear regression, and logistic regression. In the early multi-modal emotion recognition method, for a video signal, an optical flow method is used to detect the movement and the moving speed of key points (such as mouth corners, eyebrow internal corners and the like) of a face, and a KNN algorithm is used to judge the emotion type of a video modality. In addition, for the voice signal, the emotion category of the voice modality is judged by using the pitch characteristics of the voice and the HMM algorithm. And finally, weighting and combining the video modal emotion category and the audio modal emotion category to obtain a final recognition result. Still other methods combine video, audio and text forms, and use multi-kernel learning (MKL) in support vector machine SVM to merge three modes, thereby obtaining higher recognition accuracy. Methods produced in recent years include an emotion recognition method using a mel-frequency spectrogram as an input of an audio signal to the CNN, a face frame as an input of a video signal to the 3D CNN, and an emotion recognition method that fuses audio features of a voice signal, dense features of an image frame, and CNN-based features of the image frame at a score level to recognize emotion, and the like.
Although the multi-modal emotion recognition can well overcome the defects that information is single and cannot be complemented in single-modal emotion recognition, how to process and fuse information of different modalities is a problem which is difficult to solve. The traditional multi-modal information fusion method framework comprises data layer fusion, feature layer fusion and decision layer fusion. The three multi-modal emotion recognition frameworks are all thousands of years old. However, in practical tasks, practical problems need to be considered to select the best fusion mode. The text decides to process text information, audio signals and video signals by adopting a deep learning feature layer fusion mode.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a multi-modal emotion recognition method and system based on an attention mechanism and GMN;
in a first aspect, the invention provides a multi-modal emotion recognition method based on an attention mechanism and GMN;
the multimode emotion recognition method based on the attention mechanism and the GMN comprises the following steps:
preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
In a second aspect, the invention provides a multi-modal emotion recognition system based on an attention mechanism and GMN;
an attention mechanism and GMN based multi-modal emotion recognition system, comprising:
a pre-processing module configured to: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
a first feature vector acquisition module configured to: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
a second feature vector acquisition module configured to: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
a third feature vector acquisition module configured to: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
an emotion recognition module configured to: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
for the defects of insufficient information and poor robustness of traditional single-mode emotion recognition, multi-mode emotion recognition has the advantages of sufficient information and complementary modes. For the complementation of information, it is important that the feature information of different modalities will not affect each other, and in view of this problem, the double attention mechanism can solve the problem well. The invention can obtain the optimal contribution ratio among the modes by utilizing the attention mechanism to the weight distribution of each mode, so that the information is fully fused and interacted across the modes, but the information between the modes is not mutually exclusive. In addition, the invention can store the information after interaction in the gated memory network by using the gated memory network, so that the information can be maximally utilized. Moreover, the invention can also highlight the specific part containing the strong emotional characteristics by learning a string of weight parameters through a weighted aggregation strategy of an attention mechanism, then learning the importance degree of each frame output from an LSTM output sequence, and then combining the importance degrees.
The invention relates to a multimode emotion recognition method based on a double attention mechanism and a gated memory network. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs system are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2 is a diagram illustrating a network connection relationship according to the first embodiment;
fig. 3 is a schematic diagram of a network connection of the DTAN model according to the first embodiment;
fig. 4 is a schematic diagram of network connection of a GMN model according to a first embodiment;
fig. 5 is a schematic diagram of a network connection of the GTAN model according to the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a multi-modal emotion recognition method based on an attention mechanism and GMN;
as shown in fig. 2, the method for multi-modal emotion recognition based on attention mechanism and GMN includes:
s101: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
s102: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
s103: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
s104: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
s105: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
As one or more embodiments, the S101: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized; the method comprises the following specific steps:
separating the video to be identified to obtain an audio signal and a video signal;
carrying out voice recognition on the audio signal to obtain text information;
performing feature extraction on the text information to obtain text features;
carrying out feature extraction on the audio signal to obtain a voice feature;
and carrying out feature extraction on the video signal to obtain facial expression features.
Exemplarily, a video to be recognized is separated to obtain an audio signal and a video signal; the original video is subjected to audio-video separation by using a rapid video converter.
Illustratively, the voice recognition of the audio signal results in text information; the method specifically comprises the following steps:
and using flying automatic voice recognition software to perform automatic voice recognition of the audio frequency and obtain text data from the audio frequency.
Further, performing feature extraction on the text information to obtain text features; the method specifically comprises the following steps:
and performing Word vectorization feature extraction on the text information by using a Global (Global Vectors for Word retrieval) model to obtain text features.
Exemplarily, the feature extraction is performed on the text information to obtain text features; the method specifically comprises the following steps:
for text data, a 300-dimensional pre-trained Glove model is used for embedding, each word obtains a 300-dimensional text feature, and finally a T300 feature vector matrix is obtained.
Further, performing feature extraction on the audio signal to obtain a voice feature; the method specifically comprises the following steps:
and performing feature extraction on the audio signal by using a speech processing algorithm library Covarep to obtain speech features.
Illustratively, the feature extraction is performed on the audio signal to obtain a voice feature; the method specifically comprises the following steps:
for audio data, firstly, the audio data is segmented according to the frequency of 100 frames per second, and then the audio signal is subjected to a Covarep feature extraction tool to obtain T1X 74 eigenvector matrix.
Further, performing feature extraction on the video signal to obtain facial expression features; the method specifically comprises the following steps:
and carrying out face contour recognition, face key point extraction, face contour correction, sight line estimation, head posture and facial motion unit feature extraction on the video signal to obtain facial expression features.
Illustratively, the feature extraction is performed on the video signal to obtain facial expression features; the method specifically comprises the following steps:
for video data, an openface2.0 facial behavior analysis tool is used for feature extraction. Inputting complete video data into an Openface2.0 tool can obtain 68 face key points, face shape parameters, head pose estimation, sight line estimation, face behavior units, Hog characteristics and the like, and finally obtain T2X 711 eigenvector matrix.
As one or more embodiments, the S101: after the step of preprocessing the acquired video to be recognized to obtain the text feature, the voice feature and the facial expression feature of the video to be recognized, the step S102: the method comprises the following steps of concurrently inputting text features, voice features and facial expression features of a video to be recognized into corresponding LSTMs models, and outputting a first feature vector, wherein the method also comprises the following steps of:
s101-2: and performing data alignment and standardization processing on all the obtained features.
The information of the three modalities is aligned using the pennsylvania university voice tag Forced alignment tool (P2 FA).
The features are aligned in the time dimension, so that information interaction can be performed among the modalities conveniently. The aligned data is characterized by the text l ═ T × dl,dl300, speech a. T × da,da74, video v ═ T × dv,dv=711。
For the overall data after alignment, N ═ l, a, v }, and the data needs to be normalized, so as to normalize each dimension of the feature to a specific interval, and change the dimensional expression into a dimensionless expression. The method employed in the present invention is Z-score feature normalization, also known as standard deviation normalization. The formula is as follows:
Figure BDA0002961673080000061
where x is the input sample, μ is the mean of all sample data, and σ is the standard deviation of all sample data. The data after standardization is beneficial to accelerating the convergence speed based on the gradient descent method or the random gradient descent method, and the accuracy of the model can be improved.
As one or more embodiments, the multi-modal emotion recognition network has a network structure comprising: LSTMs model (Long Short Term Memory Networks), DTAN model (Delta-Time Attention Network, Delta-Time improved Attention mechanism), GMN model (Gated Memory Network), and GTAN model (Global-Time Attention Network);
the LSTMs model is connected with the GMN model through the DTAN model, and the GMN model is connected with the fusion module;
the LSTMs model is connected with the fusion module;
the LSTMs model is connected with the GTAN model, and the GTAN model is connected with the fusion module;
the fusion module is connected with the first full-connection layer, the first full-connection layer is connected with the second full-connection layer, and the second full-connection layer is connected with the output layer.
Wherein, the LSTMs model comprises: a first LSTM model, a second LSTM model and a third LSTM model which are parallel; each LSTM model comprises a plurality of memories connected in series; memory refers to c in the LSTM modeltA unit for storing data information at time t;
the DTAN model comprises: sequentially connected fully-connected neural network DaThe softmax network and the first multiplier; the fully-connected neural network DaThe full-connection layer FC1, the Dropout layer and the full-connection layer FC2 are sequentially connected; wherein, the full connection layer F2 is connected with the softmax network; the input end of the full connection layer FC1 is connected with the input end of the first multiplier; the output terminal of the first multiplier is used as the output terminal of the DTAN model.
The working principle of the DTAN model is as follows: memory cascading LSTMs models at times t-1 and t, namely c[t -1,t]Passed to a trainable fully-connected neural network
Figure BDA0002961673080000071
Then using softmax network, pair DaRegularizes its range to (0, 1)]To obtain the trans-modal attention coefficient a at the same time[t-1,t]
As shown in FIG. 3, pre _ c _ l, pre _ c _ a, and pre _ c _ v are respectively the LSTM memory outputs of text, speech and image at time t-1, and c is the concatenation of the threet-1
c _ l, c _ a and c _ v are respectively text, speech and image output from LSTM memory at t moment, and c is formed after cascading the threet(ii) a C is tot-1And ctCascade to c[t-1,t]C is mixing[t-1,t]Input into a fully-connected layer FC1 of a fully-connected neural network;
fully connected neural network
Figure BDA0002961673080000072
Two full-link layers FC1 and FC2, and one Dropout layer to prevent overfitting;
softmax layer for scoring the activation scores of the memory of LSTM of each modality at times t and t-1, i.e. the attention coefficient a[t-1,t]
A is to[t-1,t]And c[t-1,t]Multiplying the corresponding points to obtain a first weighted feature
Figure BDA0002961673080000073
An indicator indicates that the two vectors have the same dimension size, and the indicator indicates that the two vectors have the same dimension size.
Wherein, the GMN model comprises: duThe network(s) of the network(s),
Figure BDA0002961673080000074
network and
Figure BDA0002961673080000075
a network;
Duthe network(s) of the network(s),
Figure BDA0002961673080000076
network and
Figure BDA0002961673080000077
the networks are all fully connected neural networks;
Dua network, comprising: the full connection layer FC3, the Dropout layer and the full connection layer FC4 are connected in sequence;
Figure BDA0002961673080000078
a network, comprising: the full connection layer FC5, the Dropout layer and the full connection layer FC6 are connected in sequence;
Figure BDA0002961673080000079
a network, comprising: the full connection layer FC7, the Dropout layer and the full connection layer FC8 are connected in sequence;
the input ends of the full connection layer FC3, the full connection layer FC5 and the full connection layer FC7 are all connected with the output end of the DTAN model;
the full connection layer FC4 is connected with the input end of the second multiplier through a first sigma function;
the full connection layer FC6 is connected with the input end of the third multiplier through a second sigma function;
the full connection layer FC8 is connected with the input end of the third multiplier through a tanh function;
the output end of the second multiplier and the output end of the third multiplier are both connected with the input end of the fourth multiplier, and the output end of the fourth multiplier is connected with the input end of the second multiplier.
Figure BDA0002961673080000081
Figure BDA0002961673080000082
Figure BDA0002961673080000083
Wherein DuCross-modality update component for generating multimodal gated memory networks
Figure BDA0002961673080000084
Figure BDA0002961673080000085
For controlling the retention gate gamma1The purpose is to remember the current state of the multimodal gated memory network,
Figure BDA0002961673080000086
for controlling the updating gate gamma2Aim at updating the component
Figure BDA0002961673080000087
To remember to update the memory numbers of the multimodal gated memory network.
As shown in figure 4 of the drawings,
Figure BDA0002961673080000088
is the output of the DTAN, i.e. the first weighted feature; duThe network(s) of the network(s),
Figure BDA0002961673080000089
network and
Figure BDA00029616730800000810
the networks are all full-connection neural networks, and full-connection layers FC3 and FC4 belong to DuNetwork, full connectivity layers FC5 and FC6 belong to
Figure BDA00029616730800000811
Network, full connectivity layers FC7 and FC8 belong to
Figure BDA00029616730800000812
A network, a Dropout layer to prevent overfitting; the corresponding points of the two vectors with the same dimension are multiplied, and the sign obtains a vector with the same dimension as the two input vectors; at each time stamp t recurred throughout the network, a reservation gate is used
Figure BDA00029616730800000813
And a retrofit gate
Figure BDA00029616730800000814
And current modal interaction update component
Figure BDA00029616730800000815
To update ut,utAt time 0, initialization is required.
Wherein, the GTAN model comprises: connected in series
Figure BDA00029616730800000816
Networks and softmax networks;
Figure BDA00029616730800000817
networks, including being juxtaposedMultipliers p1, p2 and p3 …, multiplier pn;
Figure BDA00029616730800000818
each multiplier of the network is connected with the softmax function through the corresponding sum function;
the input end of each multiplier is connected with the input end of the multiplier pn;
the softmax function is followed by a weighting unit αn
The GTAN model operates on the principle that the output is output for each time of the LSTM of each mode
Figure BDA00029616730800000819
Composed matrix HnWeight assignment of attention mechanism αnTo obtain the final output zn
Figure BDA0002961673080000091
Where n is an index of three modes, and n is 1,2, 3. The goal is to mine the best allocation coefficient for the entire time series to highlight the most emotional frames and supplement the DTAN and GMN information. As shown in fig. 5.
Wherein the content of the first and second substances,
Figure BDA0002961673080000092
for the output of LSTM at each time, n is an index of three modalities, where n is 1,2, and 3, and represents text, speech, and image, that is, the three modalities perform the above steps to obtain respective z vectors, and then concatenate the three z vectors to obtain a final z vector; alpha is alphanOutput of each time of the nth mode
Figure BDA0002961673080000093
The weight distribution coefficient of (1); an indicator indicates that the two vectors have the same dimension size, and the indicator indicates that the two vectors have the same dimension size.
Illustratively, LSTMs consists of a plurality of Long Short Term Memory (LSTM) networks, one for each modality. The modality corresponding to the first LSTM model is a text feature, the modality corresponding to the second LSTM model is a voice feature and the modality corresponding to the third LSTM model is a facial expression feature; each LSTM encodes its modality-specific dynamics and interactions.
Illustratively, Delta-Time improves attention mechanism DTAN. The Delta-Time Attention Network (DTAN) Network is an improved Attention mechanism and aims to find out the memory information interaction and Time interaction between different modes in the LSTM system.
Illustratively, a multimodal gated memory network GMN is used. A multimodal Gated Memory Network (GMN) is a storage module for storing cross-time interaction and cross-modality interaction information.
Illustratively, the global attention mechanism is GTAN. Global-Time extension (GTAN) is the optimal allocation coefficient for mining the entire Time series to highlight the most emotional frames and to supplement information for the features obtained by DTAN and GMN.
In one or more embodiments, the multi-modal emotion recognition network, the training step includes:
constructing a training set, wherein the training set is a text feature, a voice feature and a facial expression feature which correspond to the same video of a known emotion category label;
inputting the text features of the training set into a first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the training set into a second LSTM model; at the same time, the user can select the desired position,
inputting facial expression features of the training set into a third LSTM model;
using the known emotion category labels as output values of the multi-mode emotion recognition network;
training a multi-mode emotion recognition network; and obtaining the trained multi-modal emotion recognition network.
Illustratively, for each modality-characterized sequence, the long short term memory network (LSTM) encodes the features of each modality over time. At each input timestamp, feature information from each modality is input into the assigned respective LSTM model.
As one or more embodiments, the S102: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output; the method comprises the following specific steps:
inputting text characteristics of a video to be recognized into a first LSTM model, and outputting a first coding vector by the first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the video to be recognized into a second LSTM model, and outputting a second coding vector by the second LSTM model; at the same time, the user can select the desired position,
inputting facial expression characteristics of a video to be recognized into a third LSTM model, and outputting a third coding vector by the third LSTM model;
and splicing the first, second and third coded vectors to obtain a first feature vector.
As one or more embodiments, the S103: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; the method comprises the following specific steps:
the input to the DTAN is a cascade of memories at times t-1 and t, denoted c[t-1,t]. These memories are passed to a trainable fully-connected neural network
Figure BDA0002961673080000101
To obtain the attention coefficient a[t-1,t]
a[t-1,t]=softmax(Da(c[t-1,t]))
a[t-1,t]Is the softmax function activation score, i.e., the weighting value, for each LSTM memory at time t-1 and t. At DaApplying softmax to c on the output layer of[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]To (c) to (d); the output of the DTAN is defined as:
Figure BDA0002961673080000102
wherein the content of the first and second substances,
Figure BDA0002961673080000103
is the memory reserved after the memory of the LSTM passes through the DTAN, i.e. the first weighted feature.
DTAN is also able to discover modal interactions that occur at different timestamps because it involves the memory c in LSTM systems. These memories may carry information about the inputs observed across different timestamps.
Illustratively, the goal of DTAN is to outline the cross-modality interaction between different modality memories in an LSTMs system at timestamp t. Therefore, at time t, the LSTM memory c is pairedtThe cascade of (2) uses an attention mechanism to automatically assign the weight coefficients. Modal interaction is achieved by assigning high coefficients to the modality that dominates the emotional effect at the time stamp t and low coefficients to other modalities. However, only memory c in LSTM is used at time ttIt is not desirable to perform coefficient assignment. Memory c with time t-1 added is also requiredt-1The DTAN can now freely retain a constant size in the memory information of the LSTM system and assign high coefficients to them only when they are about to change.
As one or more embodiments, the S103: inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector; wherein t is a positive integer; the method comprises the following specific steps:
first weighting the features
Figure BDA0002961673080000111
For use as trainable neural networks
Figure BDA0002961673080000112
To generate a cross-modality update component for a multimodal gated memory network
Figure BDA0002961673080000113
dmemIs a dimension of a multimodal gated memory network;
Figure BDA0002961673080000114
the multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma12Referred to as retention gate and update gate, respectively; at each time stamp t, γ1What the current state of the multimodal gated memory network to remember is, and gamma2Then it is based on the updated component
Figure BDA0002961673080000115
To remember what the memory of the multimodal gating memory network is updated; gamma ray1And gamma2Each consisting of two trainable neural networks
Figure BDA0002961673080000116
Controlling; output using DTAN
Figure BDA0002961673080000117
As gating mechanisms for multimodal gated memory networks
Figure BDA0002961673080000118
The input formula of (a) is:
Figure BDA0002961673080000119
at each time stamp t of the whole network recursion, a reservation gate γ is used1And updating the gate gamma2And current modal interaction update component
Figure BDA00029616730800001110
Update u by the following formula:
Figure BDA00029616730800001111
illustratively, u of the multimodal gating memorisation network GMN is a neural component that stores a history of interactions across time. It acts as a supplemental memory to the memory in the LSTM system. The output of the DTAN is passed directly to the multimodal gated memory network to represent cross-modal interactions made up of key dimensions of different modalities in the LSTM memory system.
As one or more embodiments, the S104: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network performs weighted summation on memories of all timestamps under each LSTM model to obtain a third feature vector; the method comprises the following specific steps:
output for each time instant of LSTM system of each modality
Figure BDA00029616730800001112
Composed matrix HnAutomatic weight assignment alpha for attention mechanismnTo obtain a final output zn
Figure BDA0002961673080000121
The formula is as follows:
Figure BDA0002961673080000122
Figure BDA0002961673080000123
αnfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]And (3) removing the solvent. Final output
Figure BDA0002961673080000124
Is an information supplementary vector output by DTAN, GMN and LSTM.
Illustratively, for global attention mechanism (GTAN), the goal is to mine the best allocation coefficients for the entire time series and supplement the information for DTAN and GMN. And carrying out automatic weight distribution of an Attention mechanism on a matrix formed by the output of each time of the LSTM system of each mode to obtain final output.
As one or more embodiments, the S105: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; performing emotion recognition on the fused feature vector to obtain an emotion recognition result; the method comprises the following specific steps:
the trained multi-mode emotion recognition network splices the first, second and third feature vectors to obtain a fusion feature vector;
and performing emotion recognition on the fusion feature vectors, and classifying by using a full-connection neural network to obtain an emotion recognition result.
And cascading the output u of the gate control memory network GMN, the output of the LSTM last moment of each mode and the output of the global attention mechanism of each mode. And then, for the cascaded result, obtaining a final emotion prediction result after passing through two layers of fully-connected neural networks FC.
The Loss function used by the multi-modal emotion recognition network is the L1Loss function of Loss. L1Loss represents the absolute value of the difference between the predicted value and the true tag value, also known as the manhattan distance. The expression is as follows:
Figure BDA0002961673080000125
wherein y isiIn the form of an actual value of the value,
Figure BDA0002961673080000126
is a predicted value.
Training the whole emotion recognition network, and comprehensively evaluating the performance of the emotion recognition network, wherein the evaluation standard is as follows: binary Accuracy, F1 Score, weighted Accuracy, MAE and r coefficients.
As shown in FIG. 1, the multi-modal emotion recognition method based on a double attention mechanism and a gated memory network is disclosed. The invention uses the MOSI dataset to validate the proposed algorithm, comprising in particular the following steps:
step 1, the MOSI data set provides data of three modes of text, audio and video. Therefore, no additional audio-video separation and automatic voice recognition operation are needed. For the MOSI data set, the division of a training set, a verification set and a test set is standard division of a comparison experiment, wherein the training set is 1284 samples, the verification set is 229 samples and the test set is 686 samples.
The emotional tag of the data set is a linear range of-3 to +3 going from strong negative to strong positive. The intensity was annotated by an online worker from Amazon Mechanical turn. For each video, the annotator has seven choices: strong positive (labeled +3), positive (+2), weak positive (+1), neutral (0), weak negative (-1), negative (-2)), strong negative (-3). In addition, for emotion recognition, the emotion recognition method expresses positive emotion between 0 and 3 and negative emotion between-3 and 0. Further, the present invention can find that the number of positive emotions is 1176 and the number of negative emotions is 1023 by two classifications.
Step 2, embedding the text data by using a 300-dimensional pre-training Glove model, obtaining 300-dimensional text characteristics of each word, and finally obtaining a T multiplied by 300 characteristic vector matrix; for audio data, firstly, the audio data is segmented according to the frequency of 100 frames per second, and then the audio signal is subjected to a Covarep feature extraction tool to obtain T1A feature vector matrix of x 74; for video data, an openface2.0 facial behavior analysis tool is used for feature extraction. Inputting complete video data into an Openface2.0 tool can obtain 68 face key points, face shape parameters, head pose estimation, sight line estimation, face behavior units, Hog characteristics and the like, and finally obtain T2X 711 eigenvector matrix.
And step 3, the extracted features of the three modes of the text, the voice and the video are not aligned in a time sense, and the features only play corresponding roles in the respective modes. The information of the three modalities is aligned using P2FA forced alignment tools. Finally, the aligned data is characterized by the text l ═ T × dl,dl300, speechTone a ═ T × da,da74, video v ═ T × dv,dv=711。
For the overall data after alignment, N ═ { l, a, v }, the present invention also needs to normalize the data. The method employed in the present invention is Z-score feature normalization, also known as standard deviation normalization. The formula is as follows:
Figure BDA0002961673080000131
where x is the input sample, μ is the mean of all sample data, and σ is the standard deviation of all sample data.
And 4, after the text, the audio and the video are subjected to feature extraction, inputting feature information of each mode into the respective LSTM. For the input N ═ { l, a, v }, the present invention defines the input of the nth modality as
Figure BDA0002961673080000132
Wherein
Figure BDA0002961673080000133
Is the input dimension of the nth input modality. For the nth mode, the memory of the signed LSTM is denoted as
Figure BDA0002961673080000134
The output of each LSTM is defined as
Figure BDA0002961673080000135
Figure BDA0002961673080000136
Indicates the memory c in the nth LSTMnDimension (d) of (a).
Wherein the update rule of the nth LSTM is:
Figure BDA0002961673080000141
Figure BDA0002961673080000142
Figure BDA0002961673080000143
Figure BDA0002961673080000144
Figure BDA0002961673080000145
Figure BDA0002961673080000146
wherein in,fn,onThe input gate, the forgetting gate and the output gate of the nth LSTM are respectively. m isnIt is the storage update of the nth LSTM at time t, which indicates an element product, σ is a Sigmoid activation function.
Step 5. the goal of the DTAN is to outline the cross-modality interaction between different modality memories in the LSTMs system at timestamp t. The invention uses the LSTM memory c at time ttAnd LSTM memory c at time t-1t-1The coefficient assignment is made so that the DTAN is free to keep a constant size in the memory information of the LSTM system and to assign high coefficients to them only when they are going to change.
The input to the DTAN is a cascade of memories at times t-1 and t, denoted c[t-1,t]. These memories are passed to a trainable fully-connected neural network
Figure BDA0002961673080000147
To obtain the attention coefficient a[t-1,t]
a[t-1,t]=softmax(Da(c[t-1,t]))
a[t-1,t]Is the softmax function activation score at t-1 and t for each LSTM memory. At DaOn the output layer ofC can be paired with softmax[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]In the meantime. The output of the DTAN is defined as:
Figure BDA0002961673080000148
Figure BDA0002961673080000149
is the memory reserved after the memory of the LSTM passes through the DTAN. DTAN is also able to discover modal interactions that occur at different timestamps because it involves the memory c in LSTM systems. These memories may carry information about the inputs observed across different timestamps.
Step 6. output of DTAN
Figure BDA00029616730800001410
Passed directly to the multimodal gated memory network GMN to represent which dimensions in the LSTM memory system constitute cross-modal interactions. Firstly, the first step is to
Figure BDA00029616730800001411
For use as trainable neural networks
Figure BDA00029616730800001412
To generate a cross-modality update component for a multimodal gated memory network
Figure BDA00029616730800001413
dmemIs a dimension of a multimodal gated memory network.
Figure BDA0002961673080000151
The multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma12Respectively called retention gate and update gate. At each time stamp t, γ1Multiple modes to rememberWhat the current state of the state-gated memory network is, and gamma2Then it is based on the updated component
Figure BDA0002961673080000152
To remember what the memory of the multimodal gated memory network is updated. Gamma ray1And gamma2Each consisting of two trainable neural networks
Figure BDA0002961673080000153
And (5) controlling. Output using DTAN
Figure BDA0002961673080000154
As gating mechanisms for multimodal gated memory networks
Figure BDA0002961673080000155
The input formula of (a) is:
Figure BDA0002961673080000156
at each time stamp t of the whole network recursion, a reservation gate γ is used1And updating the gate gamma2And current modal interaction update component
Figure BDA0002961673080000157
Update u by the following formula:
Figure BDA0002961673080000158
step 7. for the Global attention System (GTAN), the invention outputs for each moment of the LSTM system of each modality
Figure BDA0002961673080000159
Composed matrix HnAutomatic weight assignment alpha for the Attention mechanismnTo obtain a final output zn
Figure BDA00029616730800001510
The formula is as follows:
Figure BDA00029616730800001511
Figure BDA00029616730800001512
αnfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]And (3) removing the solvent. Final output
Figure BDA00029616730800001513
Is an excellent information supplementary vector output by DTAN, GMN and LSTM.
Step 8, for the output u of the multi-mode gating memory network GMN, the invention cascades the output u with the last moment of the LSTM and the last output of the GTAN of each mode, and the specific formula is as follows:
rT=[uT,hT,zn],n∈N
then for the concatenated result rTIn other words, a final emotion prediction result is obtained after two layers of fully-connected neural networks FC
Figure BDA0002961673080000161
The formula is as follows:
Figure BDA0002961673080000162
wherein W1And W2Two trainable matrices of a fully connected neural network, respectively. The Loss function used by the model is the L1Loss function. L1Loss represents the absolute value of the difference between the predicted value and the true tag value, also known as the manhattan distance. The expression is as follows:
Figure BDA0002961673080000163
wherein y isiIn the form of an actual value of the value,
Figure BDA0002961673080000164
is a predicted value.
Step 9, for the MOSI dataset, the evaluation criteria are Binary Accuracy and F1 Score, multi-classification weighted Accuracy, mean absolute error MAE, and r coefficient, and the formula is as follows:
Figure BDA0002961673080000165
wherein
Figure BDA0002961673080000166
Wherein y isiIn the form of an actual value of the value,
Figure BDA0002961673080000167
for predictive value, weighted accuracy is the usual accuracy, calculated as part of the correct answer for all examples. The larger the accuracy, the better the recognition effect.
Figure BDA0002961673080000168
Figure BDA0002961673080000169
Figure BDA00029616730800001610
Wherein TP is the number predicted to be positive, actually positive; FP is the number predicted to be positive, actually negative; TN is the number predicted to be negative, actually negative; FN is the number predicted to be negative and actually positive, F1-Score is the harmonic mean of precision and call. For F1-Score, the larger the value, the better the recognition.
Figure BDA00029616730800001611
Wherein is the number of samples in the test set, yiIn the form of an actual value of the value,
Figure BDA00029616730800001612
is a predicted value. For MAE, the smaller the value, the better the recognition effect.
Figure BDA00029616730800001613
Wherein
Figure BDA0002961673080000171
The loss generated using the model is represented,
Figure BDA0002961673080000172
the loss resulting from using the mean is indicated. The closer to 1 for r, the better the recognition effect.
TABLE 1 model parameter Table
Parameter(s) Value of
Learning rate 0.005
Optimizer Adam
Size of batch 56
Dropout coefficient 0.25
Number of iterations 3000
weight_decay 0.1
grad_clip_value 1.0
hidden_sizes 32,64,32 (text, voice, video)
Comparative experiment:
single-mode emotion recognition:
the method carries out single-mode emotion recognition aiming at three modes of text, voice and video respectively, and carries out comparison experiments with subsequent dual-mode and multi-mode emotion recognition. The invention uses LSTM and FC models without DTAN, GTAN and GMN for single-mode emotion recognition.
Table 2 shows the results of single-modality emotion recognition on MOSI data sets. The invention can obviously see that in the single-mode emotion recognition only using LSTM and FC, the voice and video effects are not good as the text effects, the accuracy of the classification is only 54.9% and 54.6%, and the text can reach 71.1%.
TABLE 2 MOSI data set single modal emotion recognition results
Figure BDA0002961673080000173
And (3) bimodal emotion recognition:
the invention then performs bimodal emotion recognition, table 3 is the MOSI data set bimodal emotion recognition result, and LSTM-A, V in table 3 is the model after LSTM is used for cascade connection and FC processing in the two modes of voice and video. LSTM-A, T is a speech and text modality, and LSTM-V, T is a video and text modality, and the models are the same. The invention can see that the BA of LSTM-A, V is 57.4%, the BA of LSTM-A, T is 73.0%, and the accuracy of LSTM-V, T is 70.9%. Compared with single-mode emotion recognition, the method has the advantages that the improvement of the double-mode emotion recognition effect is quite remarkable, the single-mode BA of voice and video accounts for 54.9 percent and 54.6 percent, and the combined result reaches 57.4 percent. In addition, the invention can be seen that the identification results with text modes in the bimodal emotion identification are relatively high and are consistent with the emotion identification result of a single mode, and the results are improved.
TABLE 3 MOSI data set bimodal emotion recognition results
Figure BDA0002961673080000181
Multi-modal emotion recognition:
following is multimodal emotion recognition. Table 4 shows the comparison of the multi-modal emotion recognition results of the MOSI dataset using the present model with the single modality. Experiments show that the multi-modal emotion recognition based on the model is obviously improved on various evaluation indexes such as BA, F1, Ac-7, MAE, r and the like.
TABLE 4 MOSI data set multimodal emotion recognition results
Figure BDA0002961673080000182
Figure BDA0002961673080000191
Table 5 is the fusion Matrix of the test set on the MOSI data set, and Table 6 is the report of the results of the two classifications of the MOSI data set.
TABLE 5 MOSI test set Confusion Matrix
Emotion categories Negtive (prediction) Positive (prediction) Total of
Negtive (true) 306 73 379
Positive (true) 104 203 307
Total of 410 276 686
TABLE 6 MOSI dataset binary report
Evaluation index Precision Recall F1-score
Results 74.1 74.1 74.1
Ablation experiment
In order to verify the double attention mechanism and the function of the gate control memory network, the invention carries out ablation contrast experiment on the double attention mechanism and the gate control memory network.
Network (no DTAN-mem, no GTAN) in Table 7 refers to using only LSTM and FC for tri-modal emotion recognition, without using DTAN, GMN and GTAN components. According to the invention, the recognition effect of the multi-modal emotion recognition by only using the LSTM and the FC is obviously reduced compared with that of the final model. Wherein BA, F1 and Ac-7 are respectively reduced by 3.3 percent, 3.3 percent and 4.5 percent, MAE is increased by 0.094, and r is reduced by 0.074. But the recognition effect is still improved compared with the previous single-modal and dual-modal emotion recognition. This shows that the introduction of additional modalities can significantly improve the emotion recognition accuracy, and the recognition effect of the component using the chapter is better.
The Network (no-GTAN) is a model in the current chapter without using a GTAN component, and the emotion recognition effect of the model is reduced compared with that of a final model. Wherein BA, F1 and Accuracy-7 are respectively reduced by 1.3%, 1.3% and 6.2%, MAE is increased by 0.074, and r is reduced by 0.035. For the GTAN component, the global Attention mechanism can mine the optimal distribution coefficient of the whole time sequence to highlight the frames with most emotional colors, and supplement the information of the DTAN-GMN and LSTMs coding systems.
The Network (no-DTAN-mem) is a model in the current chapter without using DTAN and GMN components, and the emotion recognition effect is found to be reduced compared with that of a final model in the current chapter. Wherein BA, F1 and Accuracy-7 are respectively reduced by 2.0%, 2.0% and 1.1%, MAE is increased by 0.023, and r is reduced by 0.039. For DTAN and GMN components, the optimal distribution coefficient of information between different modes at the same time can be mined to highlight the mode with emotional color at the current time, and information supplement is performed on a GTAN and LSTMs coding system.
TABLE 7 MOSI data set ablation experimental comparison
Figure BDA0002961673080000201
In conclusion, the multimode emotion recognition method based on the double attention mechanism and the gated memory network greatly improves the emotion recognition performance. And not only the frames with large emotion colors at different moments in a single mode are highlighted, but also the information interaction at the same moment in different modes is considered. Meanwhile, the emotion recognition performance is improved by using few parameters. In addition, the emotion information rich in the semantics is fully utilized by combining the semantics, and the method is greatly helpful for overall emotion recognition.
A multi-modal emotion recognition method based on a double attention mechanism and a gated memory network comprises the steps of preprocessing and feature extraction of multi-modal emotion data, model design based on the double attention mechanism and the gated memory network, and a fusion layer. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.
The invention has the beneficial effects that: the invention relates to a multimode emotion recognition method based on a double attention mechanism and a gated memory network. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs system are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.
The second embodiment provides a multi-modal emotion recognition system based on an attention mechanism and GMN;
an attention mechanism and GMN based multi-modal emotion recognition system, comprising:
a pre-processing module configured to: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
a first feature vector acquisition module configured to: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
a second feature vector acquisition module configured to: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
a third feature vector acquisition module configured to: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
an emotion recognition module configured to: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
It should be noted here that the preprocessing module, the first feature vector obtaining module, the second feature vector obtaining module, the third feature vector obtaining module and the emotion recognition module correspond to steps S101 to S105 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
The third embodiment of the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
The fourth embodiment also provides a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The multimode emotion recognition method based on the attention mechanism and GMN is characterized by comprising the following steps:
preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
2. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized; the method comprises the following specific steps:
separating the video to be identified to obtain an audio signal and a video signal;
carrying out voice recognition on the audio signal to obtain text information;
performing feature extraction on the text information to obtain text features;
carrying out feature extraction on the audio signal to obtain a voice feature;
and carrying out feature extraction on the video signal to obtain facial expression features.
3. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: after the step of preprocessing the acquired video to be recognized to obtain the text feature, the voice feature and the facial expression feature of the video to be recognized, concurrently inputting the text feature, the voice feature and the facial expression feature of the video to be recognized into the corresponding LSTMs model, and before the step of outputting the first feature vector, the method further comprises the following steps:
and performing data alignment and standardization processing on all the obtained features.
4. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the multi-mode emotion recognition network comprises a network structure as follows: LSTMs model, DTAN model, GMN model and GTAN model;
the LSTMs model is connected with the GMN model through the DTAN model, and the GMN model is connected with the fusion module;
the LSTMs model is connected with the fusion module;
the LSTMs model is connected with the GTAN model, and the GTAN model is connected with the fusion module;
the fusion module is connected with the first full-connection layer, the first full-connection layer is connected with the second full-connection layer, and the second full-connection layer is connected with the output layer.
5. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the multi-mode emotion recognition network comprises the following training steps:
constructing a training set, wherein the training set is a text feature, a voice feature and a facial expression feature which correspond to the same video of a known emotion category label;
inputting the text features of the training set into a first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the training set into a second LSTM model; at the same time, the user can select the desired position,
inputting facial expression features of the training set into a third LSTM model;
using the known emotion category labels as output values of the multi-mode emotion recognition network;
training a multi-mode emotion recognition network; and obtaining the trained multi-modal emotion recognition network.
6. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output; the method comprises the following specific steps:
inputting text characteristics of a video to be recognized into a first LSTM model, and outputting a first coding vector by the first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the video to be recognized into a second LSTM model, and outputting a second coding vector by the second LSTM model; at the same time, the user can select the desired position,
inputting facial expression characteristics of a video to be recognized into a third LSTM model, and outputting a third coding vector by the third LSTM model;
and splicing the first, second and third coded vectors to obtain a first feature vector.
7. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising:
carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; the method comprises the following specific steps:
the input to the DTAN is a cascade of memories at times t-1 and t, denoted c[t-1,t](ii) a These memories are passed to a trainable fully-connected neural network
Figure FDA0002961673070000021
To obtain the attention coefficient a[t-1,t]
a[t-1,t]=softmax(Da(c[t-1,t]))
a[t-1,t]The activation score of the softmax function at t-1 and t moment memorized by each LSTM, namely a weighted value; at DaApplying softmax to c on the output layer of[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]To (c) to (d); the output of the DTAN is defined as:
Figure FDA0002961673070000031
wherein the content of the first and second substances,
Figure FDA0002961673070000032
is a memory reserved after the memory of the LSTM passes through the DTAN, namely the characteristic after the first weighting;
alternatively, the first and second electrodes may be,
inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector; wherein t is a positive integer; the method comprises the following specific steps:
first weighting the features
Figure FDA0002961673070000033
For use as trainable neural networks
Figure FDA0002961673070000034
To generate a cross-modality update component for a multimodal gated memory network
Figure FDA0002961673070000035
dmemIs a dimension of a multimodal gated memory network;
Figure FDA0002961673070000036
the multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma12Referred to as retention gate and update gate, respectively; at each time stamp t, γ1What the current state of the multimodal gated memory network to remember is, and gamma2Then it is based on the updated component
Figure FDA0002961673070000037
To remember what the memory of the multimodal gating memory network is updated; gamma ray1And gamma2Each consisting of two trainable neural networks
Figure FDA0002961673070000038
Controlling; output using DTAN
Figure FDA0002961673070000039
As multimodal gating memoryOf gating mechanisms of the network
Figure FDA00029616730700000310
The input formula of (a) is:
Figure FDA00029616730700000311
at each time stamp t of the whole network recursion, a reservation gate γ is used1And updating the gate gamma2And current modal interaction update component
Figure FDA00029616730700000312
Update u by the following formula:
Figure FDA00029616730700000313
alternatively, the first and second electrodes may be,
the trained global attention mechanism network GTAN of the multi-modal emotion recognition network performs weighted summation on memories of all timestamps under each LSTM model to obtain a third feature vector; the method comprises the following specific steps:
output for each time instant of LSTM system of each modality
Figure FDA0002961673070000041
Composed matrix HnAutomatic weight assignment alpha for attention mechanismnTo obtain a final output zn
Figure FDA0002961673070000042
The formula is as follows:
Figure FDA0002961673070000043
Figure FDA0002961673070000044
αnfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]A (c) is added; final output
Figure FDA0002961673070000045
Is an information supplementary vector output by DTAN, GMN and LSTM;
alternatively, the first and second electrodes may be,
the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; performing emotion recognition on the fused feature vector to obtain an emotion recognition result; the method comprises the following specific steps:
the trained multi-mode emotion recognition network splices the first, second and third feature vectors to obtain a fusion feature vector;
and performing emotion recognition on the fusion feature vectors, and classifying by using a full-connection neural network to obtain an emotion recognition result.
8. The multimode emotion recognition system based on the attention mechanism and the GMN is characterized by comprising the following components:
a pre-processing module configured to: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
a first feature vector acquisition module configured to: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
a second feature vector acquisition module configured to: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
a third feature vector acquisition module configured to: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
an emotion recognition module configured to: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
CN202110239787.7A 2021-03-04 2021-03-04 Multi-mode emotion recognition method and system based on attention mechanism and GMN Pending CN113095357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110239787.7A CN113095357A (en) 2021-03-04 2021-03-04 Multi-mode emotion recognition method and system based on attention mechanism and GMN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110239787.7A CN113095357A (en) 2021-03-04 2021-03-04 Multi-mode emotion recognition method and system based on attention mechanism and GMN

Publications (1)

Publication Number Publication Date
CN113095357A true CN113095357A (en) 2021-07-09

Family

ID=76666377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110239787.7A Pending CN113095357A (en) 2021-03-04 2021-03-04 Multi-mode emotion recognition method and system based on attention mechanism and GMN

Country Status (1)

Country Link
CN (1) CN113095357A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255635A (en) * 2021-07-19 2021-08-13 中国科学院自动化研究所 Multi-mode fused psychological stress analysis method
CN113674767A (en) * 2021-10-09 2021-11-19 复旦大学 Depression state identification method based on multi-modal fusion
CN113723463A (en) * 2021-08-02 2021-11-30 北京工业大学 Emotion classification method and device
CN114155882A (en) * 2021-11-30 2022-03-08 浙江大学 Method and device for judging road rage emotion based on voice recognition
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114648805A (en) * 2022-05-18 2022-06-21 华中科技大学 Course video sight correction model, training method thereof and sight drop point estimation method
CN115271002A (en) * 2022-09-29 2022-11-01 广东机电职业技术学院 Identification method, first-aid decision method, medium and life health intelligent monitoring system
WO2023050708A1 (en) * 2021-09-29 2023-04-06 苏州浪潮智能科技有限公司 Emotion recognition method and apparatus, device, and readable storage medium
CN116070169A (en) * 2023-01-28 2023-05-05 天翼云科技有限公司 Model training method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
CN111898670A (en) * 2020-07-24 2020-11-06 深圳市声希科技有限公司 Multi-mode emotion recognition method, device, equipment and storage medium
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
CN111898670A (en) * 2020-07-24 2020-11-06 深圳市声希科技有限公司 Multi-mode emotion recognition method, device, equipment and storage medium
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈炜青: ""基于深度学习的多模态情感识别研究"", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255635B (en) * 2021-07-19 2021-10-15 中国科学院自动化研究所 Multi-mode fused psychological stress analysis method
CN113255635A (en) * 2021-07-19 2021-08-13 中国科学院自动化研究所 Multi-mode fused psychological stress analysis method
CN113723463A (en) * 2021-08-02 2021-11-30 北京工业大学 Emotion classification method and device
WO2023050708A1 (en) * 2021-09-29 2023-04-06 苏州浪潮智能科技有限公司 Emotion recognition method and apparatus, device, and readable storage medium
CN113674767A (en) * 2021-10-09 2021-11-19 复旦大学 Depression state identification method based on multi-modal fusion
CN114155882A (en) * 2021-11-30 2022-03-08 浙江大学 Method and device for judging road rage emotion based on voice recognition
CN114155882B (en) * 2021-11-30 2023-08-22 浙江大学 Method and device for judging emotion of road anger based on voice recognition
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114218380B (en) * 2021-12-03 2022-07-29 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114648805A (en) * 2022-05-18 2022-06-21 华中科技大学 Course video sight correction model, training method thereof and sight drop point estimation method
CN115271002B (en) * 2022-09-29 2023-02-17 广东机电职业技术学院 Identification method, first-aid decision method, medium and life health intelligent monitoring system
CN115271002A (en) * 2022-09-29 2022-11-01 广东机电职业技术学院 Identification method, first-aid decision method, medium and life health intelligent monitoring system
CN116070169A (en) * 2023-01-28 2023-05-05 天翼云科技有限公司 Model training method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113095357A (en) Multi-mode emotion recognition method and system based on attention mechanism and GMN
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
CN111275085B (en) Online short video multi-modal emotion recognition method based on attention fusion
Pandey et al. Deep learning techniques for speech emotion recognition: A review
CN109003625B (en) Speech emotion recognition method and system based on ternary loss
Cai et al. Multi-modal emotion recognition from speech and facial expression based on deep learning
Deng et al. Multimodal utterance-level affect analysis using visual, audio and text features
CN108804453A (en) A kind of video and audio recognition methods and device
WO2023050708A1 (en) Emotion recognition method and apparatus, device, and readable storage medium
CN113065344A (en) Cross-corpus emotion recognition method based on transfer learning and attention mechanism
CN114091466A (en) Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
Cai et al. Multimodal sentiment analysis based on recurrent neural network and multimodal attention
Wang et al. A novel multiface recognition method with short training time and lightweight based on ABASNet and H-softmax
Sahu et al. Modeling feature representations for affective speech using generative adversarial networks
CN113870863A (en) Voiceprint recognition method and device, storage medium and electronic equipment
Sun et al. EmotionNAS: Two-stream Architecture Search for Speech Emotion Recognition
Gong et al. Human interaction recognition based on deep learning and HMM
Bakhshi et al. Multimodal emotion recognition based on speech and physiological signals using deep neural networks
Khalane et al. Context-aware multimodal emotion recognition
CN116167015A (en) Dimension emotion analysis method based on joint cross attention mechanism
CN113626553B (en) Cascade binary Chinese entity relation extraction method based on pre-training model
CN114693949A (en) Multi-modal evaluation object extraction method based on regional perception alignment network
Ng et al. The investigation of different loss functions with capsule networks for speech emotion recognition
Torabi et al. Action classification and highlighting in videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210709