CN113095357A - Multi-mode emotion recognition method and system based on attention mechanism and GMN - Google Patents
Multi-mode emotion recognition method and system based on attention mechanism and GMN Download PDFInfo
- Publication number
- CN113095357A CN113095357A CN202110239787.7A CN202110239787A CN113095357A CN 113095357 A CN113095357 A CN 113095357A CN 202110239787 A CN202110239787 A CN 202110239787A CN 113095357 A CN113095357 A CN 113095357A
- Authority
- CN
- China
- Prior art keywords
- emotion recognition
- network
- feature vector
- model
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 161
- 238000000034 method Methods 0.000 title claims abstract description 83
- 230000007246 mechanism Effects 0.000 title claims abstract description 57
- 230000015654 memory Effects 0.000 claims abstract description 123
- 239000013598 vector Substances 0.000 claims abstract description 117
- 230000008921 facial expression Effects 0.000 claims abstract description 38
- 230000004927 fusion Effects 0.000 claims abstract description 27
- 230000008451 emotion Effects 0.000 claims description 26
- 230000003993 interaction Effects 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 22
- 238000013528 artificial neural network Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 18
- 238000007781 pre-processing Methods 0.000 claims description 15
- 230000005236 sound signal Effects 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 230000014759 maintenance of location Effects 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 3
- LCXOEQDSHXTPFA-UHFFFAOYSA-N 2-[(1-aminonaphthalen-2-yl)disulfanyl]naphthalen-1-amine Chemical compound C1=CC2=CC=CC=C2C(N)=C1SSC1=CC=C(C=CC=C2)C2=C1N LCXOEQDSHXTPFA-UHFFFAOYSA-N 0.000 claims 6
- 239000000203 mixture Substances 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 14
- 230000002996 emotional effect Effects 0.000 description 8
- 230000014509 gene expression Effects 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000002902 bimodal effect Effects 0.000 description 5
- 239000013589 supplement Substances 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 230000006399 behavior Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 238000002679 ablation Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000001815 facial effect Effects 0.000 description 3
- 210000003128 head Anatomy 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 239000002904 solvent Substances 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 210000004709 eyebrow Anatomy 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000153 supplemental effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multimode emotion recognition method and a multimode emotion recognition system based on an attention mechanism and GMN (Gaussian mixture network), wherein acquired videos to be recognized are preprocessed to obtain texts, voices and facial expression characteristics; concurrently inputting the text, the voice and the facial expression features into an LSTMs model of the trained multi-modal emotion recognition network, and outputting a first feature vector; carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a trained gated memory network GMN, and outputting a second feature vector; the trained global attention mechanism network GTAN performs weighted summation on the memory output values of all timestamps under each LSTM model to obtain a third feature vector; fusing the first, second and third feature vectors to obtain fused feature vectors; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
Description
Technical Field
The invention relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system based on an attention mechanism and GMN.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the development of artificial intelligence, in order to better serve people, a machine is expected to better identify the real expression of people, so that services meeting the expectations of people are provided, and the call sound of human-computer interaction is higher and higher. However, most of the so-called intelligent terminals can only perform simple execution tasks, and cannot achieve real human-computer interaction. The key point of realizing real human-computer interaction is to make the intelligent terminal correctly recognize the emotion shown by people, which is called emotion recognition. Emotional expression is an important part in the process of human development and communication. The invention can carry out emotion recognition between people through voice tone change, expression words, facial expressions and limb actions of people. In the field of artificial intelligence, emotion recognition is an important technology relating to human-computer interaction, integrates multiple subjects such as voice signal processing, psychology, mode recognition, video image processing and the like, and can be applied to various fields such as education, traffic, medical treatment and the like.
Emotion recognition essentially belongs to pattern recognition in computer technology, and the invention needs data acquisition and subsequent data processing on human emotion-expressing information. The most common sources of data in life are audio and video, and psychological studies have shown that facial expressions in video and speech and text in audio play a crucial role in the expression of human emotions. The emotion recognition method based on audio is generally speech emotion recognition, and the emotion recognition method based on video is generally facial expression recognition. In the development process of the emotion recognition technology, although two single-mode emotion recognition, namely voice emotion recognition based on audio and facial expression recognition based on video, are greatly developed, human emotion is formed by combining multi-mode information from the aspect of emotion information, information among the modes has complementarity, and the multi-mode information can be fully utilized by the emotion recognition of audio and video fusion. Therefore, multi-modal emotion recognition becomes an important research point.
Multi-modal emotion recognition was initially explored using classifiers such as Support Vector Machines (SVM), linear regression, and logistic regression. In the early multi-modal emotion recognition method, for a video signal, an optical flow method is used to detect the movement and the moving speed of key points (such as mouth corners, eyebrow internal corners and the like) of a face, and a KNN algorithm is used to judge the emotion type of a video modality. In addition, for the voice signal, the emotion category of the voice modality is judged by using the pitch characteristics of the voice and the HMM algorithm. And finally, weighting and combining the video modal emotion category and the audio modal emotion category to obtain a final recognition result. Still other methods combine video, audio and text forms, and use multi-kernel learning (MKL) in support vector machine SVM to merge three modes, thereby obtaining higher recognition accuracy. Methods produced in recent years include an emotion recognition method using a mel-frequency spectrogram as an input of an audio signal to the CNN, a face frame as an input of a video signal to the 3D CNN, and an emotion recognition method that fuses audio features of a voice signal, dense features of an image frame, and CNN-based features of the image frame at a score level to recognize emotion, and the like.
Although the multi-modal emotion recognition can well overcome the defects that information is single and cannot be complemented in single-modal emotion recognition, how to process and fuse information of different modalities is a problem which is difficult to solve. The traditional multi-modal information fusion method framework comprises data layer fusion, feature layer fusion and decision layer fusion. The three multi-modal emotion recognition frameworks are all thousands of years old. However, in practical tasks, practical problems need to be considered to select the best fusion mode. The text decides to process text information, audio signals and video signals by adopting a deep learning feature layer fusion mode.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a multi-modal emotion recognition method and system based on an attention mechanism and GMN;
in a first aspect, the invention provides a multi-modal emotion recognition method based on an attention mechanism and GMN;
the multimode emotion recognition method based on the attention mechanism and the GMN comprises the following steps:
preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
In a second aspect, the invention provides a multi-modal emotion recognition system based on an attention mechanism and GMN;
an attention mechanism and GMN based multi-modal emotion recognition system, comprising:
a pre-processing module configured to: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
a first feature vector acquisition module configured to: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
a second feature vector acquisition module configured to: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
a third feature vector acquisition module configured to: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
an emotion recognition module configured to: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
for the defects of insufficient information and poor robustness of traditional single-mode emotion recognition, multi-mode emotion recognition has the advantages of sufficient information and complementary modes. For the complementation of information, it is important that the feature information of different modalities will not affect each other, and in view of this problem, the double attention mechanism can solve the problem well. The invention can obtain the optimal contribution ratio among the modes by utilizing the attention mechanism to the weight distribution of each mode, so that the information is fully fused and interacted across the modes, but the information between the modes is not mutually exclusive. In addition, the invention can store the information after interaction in the gated memory network by using the gated memory network, so that the information can be maximally utilized. Moreover, the invention can also highlight the specific part containing the strong emotional characteristics by learning a string of weight parameters through a weighted aggregation strategy of an attention mechanism, then learning the importance degree of each frame output from an LSTM output sequence, and then combining the importance degrees.
The invention relates to a multimode emotion recognition method based on a double attention mechanism and a gated memory network. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs system are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2 is a diagram illustrating a network connection relationship according to the first embodiment;
fig. 3 is a schematic diagram of a network connection of the DTAN model according to the first embodiment;
fig. 4 is a schematic diagram of network connection of a GMN model according to a first embodiment;
fig. 5 is a schematic diagram of a network connection of the GTAN model according to the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a multi-modal emotion recognition method based on an attention mechanism and GMN;
as shown in fig. 2, the method for multi-modal emotion recognition based on attention mechanism and GMN includes:
s101: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
s102: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
s103: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
s104: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
s105: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
As one or more embodiments, the S101: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized; the method comprises the following specific steps:
separating the video to be identified to obtain an audio signal and a video signal;
carrying out voice recognition on the audio signal to obtain text information;
performing feature extraction on the text information to obtain text features;
carrying out feature extraction on the audio signal to obtain a voice feature;
and carrying out feature extraction on the video signal to obtain facial expression features.
Exemplarily, a video to be recognized is separated to obtain an audio signal and a video signal; the original video is subjected to audio-video separation by using a rapid video converter.
Illustratively, the voice recognition of the audio signal results in text information; the method specifically comprises the following steps:
and using flying automatic voice recognition software to perform automatic voice recognition of the audio frequency and obtain text data from the audio frequency.
Further, performing feature extraction on the text information to obtain text features; the method specifically comprises the following steps:
and performing Word vectorization feature extraction on the text information by using a Global (Global Vectors for Word retrieval) model to obtain text features.
Exemplarily, the feature extraction is performed on the text information to obtain text features; the method specifically comprises the following steps:
for text data, a 300-dimensional pre-trained Glove model is used for embedding, each word obtains a 300-dimensional text feature, and finally a T300 feature vector matrix is obtained.
Further, performing feature extraction on the audio signal to obtain a voice feature; the method specifically comprises the following steps:
and performing feature extraction on the audio signal by using a speech processing algorithm library Covarep to obtain speech features.
Illustratively, the feature extraction is performed on the audio signal to obtain a voice feature; the method specifically comprises the following steps:
for audio data, firstly, the audio data is segmented according to the frequency of 100 frames per second, and then the audio signal is subjected to a Covarep feature extraction tool to obtain T1X 74 eigenvector matrix.
Further, performing feature extraction on the video signal to obtain facial expression features; the method specifically comprises the following steps:
and carrying out face contour recognition, face key point extraction, face contour correction, sight line estimation, head posture and facial motion unit feature extraction on the video signal to obtain facial expression features.
Illustratively, the feature extraction is performed on the video signal to obtain facial expression features; the method specifically comprises the following steps:
for video data, an openface2.0 facial behavior analysis tool is used for feature extraction. Inputting complete video data into an Openface2.0 tool can obtain 68 face key points, face shape parameters, head pose estimation, sight line estimation, face behavior units, Hog characteristics and the like, and finally obtain T2X 711 eigenvector matrix.
As one or more embodiments, the S101: after the step of preprocessing the acquired video to be recognized to obtain the text feature, the voice feature and the facial expression feature of the video to be recognized, the step S102: the method comprises the following steps of concurrently inputting text features, voice features and facial expression features of a video to be recognized into corresponding LSTMs models, and outputting a first feature vector, wherein the method also comprises the following steps of:
s101-2: and performing data alignment and standardization processing on all the obtained features.
The information of the three modalities is aligned using the pennsylvania university voice tag Forced alignment tool (P2 FA).
The features are aligned in the time dimension, so that information interaction can be performed among the modalities conveniently. The aligned data is characterized by the text l ═ T × dl,dl300, speech a. T × da,da74, video v ═ T × dv,dv=711。
For the overall data after alignment, N ═ l, a, v }, and the data needs to be normalized, so as to normalize each dimension of the feature to a specific interval, and change the dimensional expression into a dimensionless expression. The method employed in the present invention is Z-score feature normalization, also known as standard deviation normalization. The formula is as follows:
where x is the input sample, μ is the mean of all sample data, and σ is the standard deviation of all sample data. The data after standardization is beneficial to accelerating the convergence speed based on the gradient descent method or the random gradient descent method, and the accuracy of the model can be improved.
As one or more embodiments, the multi-modal emotion recognition network has a network structure comprising: LSTMs model (Long Short Term Memory Networks), DTAN model (Delta-Time Attention Network, Delta-Time improved Attention mechanism), GMN model (Gated Memory Network), and GTAN model (Global-Time Attention Network);
the LSTMs model is connected with the GMN model through the DTAN model, and the GMN model is connected with the fusion module;
the LSTMs model is connected with the fusion module;
the LSTMs model is connected with the GTAN model, and the GTAN model is connected with the fusion module;
the fusion module is connected with the first full-connection layer, the first full-connection layer is connected with the second full-connection layer, and the second full-connection layer is connected with the output layer.
Wherein, the LSTMs model comprises: a first LSTM model, a second LSTM model and a third LSTM model which are parallel; each LSTM model comprises a plurality of memories connected in series; memory refers to c in the LSTM modeltA unit for storing data information at time t;
the DTAN model comprises: sequentially connected fully-connected neural network DaThe softmax network and the first multiplier; the fully-connected neural network DaThe full-connection layer FC1, the Dropout layer and the full-connection layer FC2 are sequentially connected; wherein, the full connection layer F2 is connected with the softmax network; the input end of the full connection layer FC1 is connected with the input end of the first multiplier; the output terminal of the first multiplier is used as the output terminal of the DTAN model.
The working principle of the DTAN model is as follows: memory cascading LSTMs models at times t-1 and t, namely c[t -1,t]Passed to a trainable fully-connected neural networkThen using softmax network, pair DaRegularizes its range to (0, 1)]To obtain the trans-modal attention coefficient a at the same time[t-1,t]。
As shown in FIG. 3, pre _ c _ l, pre _ c _ a, and pre _ c _ v are respectively the LSTM memory outputs of text, speech and image at time t-1, and c is the concatenation of the threet-1;
c _ l, c _ a and c _ v are respectively text, speech and image output from LSTM memory at t moment, and c is formed after cascading the threet(ii) a C is tot-1And ctCascade to c[t-1,t]C is mixing[t-1,t]Input into a fully-connected layer FC1 of a fully-connected neural network;
fully connected neural networkTwo full-link layers FC1 and FC2, and one Dropout layer to prevent overfitting;
softmax layer for scoring the activation scores of the memory of LSTM of each modality at times t and t-1, i.e. the attention coefficient a[t-1,t];
A is to[t-1,t]And c[t-1,t]Multiplying the corresponding points to obtain a first weighted featureAn indicator indicates that the two vectors have the same dimension size, and the indicator indicates that the two vectors have the same dimension size.
Dua network, comprising: the full connection layer FC3, the Dropout layer and the full connection layer FC4 are connected in sequence;
a network, comprising: the full connection layer FC5, the Dropout layer and the full connection layer FC6 are connected in sequence;
a network, comprising: the full connection layer FC7, the Dropout layer and the full connection layer FC8 are connected in sequence;
the input ends of the full connection layer FC3, the full connection layer FC5 and the full connection layer FC7 are all connected with the output end of the DTAN model;
the full connection layer FC4 is connected with the input end of the second multiplier through a first sigma function;
the full connection layer FC6 is connected with the input end of the third multiplier through a second sigma function;
the full connection layer FC8 is connected with the input end of the third multiplier through a tanh function;
the output end of the second multiplier and the output end of the third multiplier are both connected with the input end of the fourth multiplier, and the output end of the fourth multiplier is connected with the input end of the second multiplier.
Wherein DuCross-modality update component for generating multimodal gated memory networks For controlling the retention gate gamma1The purpose is to remember the current state of the multimodal gated memory network,for controlling the updating gate gamma2Aim at updating the componentTo remember to update the memory numbers of the multimodal gated memory network.
As shown in figure 4 of the drawings,is the output of the DTAN, i.e. the first weighted feature; duThe network(s) of the network(s),network andthe networks are all full-connection neural networks, and full-connection layers FC3 and FC4 belong to DuNetwork, full connectivity layers FC5 and FC6 belong toNetwork, full connectivity layers FC7 and FC8 belong toA network, a Dropout layer to prevent overfitting; the corresponding points of the two vectors with the same dimension are multiplied, and the sign obtains a vector with the same dimension as the two input vectors; at each time stamp t recurred throughout the network, a reservation gate is usedAnd a retrofit gateAnd current modal interaction update componentTo update ut,utAt time 0, initialization is required.
each multiplier of the network is connected with the softmax function through the corresponding sum function;
the input end of each multiplier is connected with the input end of the multiplier pn;
the softmax function is followed by a weighting unit αn。
The GTAN model operates on the principle that the output is output for each time of the LSTM of each modeComposed matrix HnWeight assignment of attention mechanism αnTo obtain the final output zn,Where n is an index of three modes, and n is 1,2, 3. The goal is to mine the best allocation coefficient for the entire time series to highlight the most emotional frames and supplement the DTAN and GMN information. As shown in fig. 5.
Wherein the content of the first and second substances,for the output of LSTM at each time, n is an index of three modalities, where n is 1,2, and 3, and represents text, speech, and image, that is, the three modalities perform the above steps to obtain respective z vectors, and then concatenate the three z vectors to obtain a final z vector; alpha is alphanOutput of each time of the nth modeThe weight distribution coefficient of (1); an indicator indicates that the two vectors have the same dimension size, and the indicator indicates that the two vectors have the same dimension size.
Illustratively, LSTMs consists of a plurality of Long Short Term Memory (LSTM) networks, one for each modality. The modality corresponding to the first LSTM model is a text feature, the modality corresponding to the second LSTM model is a voice feature and the modality corresponding to the third LSTM model is a facial expression feature; each LSTM encodes its modality-specific dynamics and interactions.
Illustratively, Delta-Time improves attention mechanism DTAN. The Delta-Time Attention Network (DTAN) Network is an improved Attention mechanism and aims to find out the memory information interaction and Time interaction between different modes in the LSTM system.
Illustratively, a multimodal gated memory network GMN is used. A multimodal Gated Memory Network (GMN) is a storage module for storing cross-time interaction and cross-modality interaction information.
Illustratively, the global attention mechanism is GTAN. Global-Time extension (GTAN) is the optimal allocation coefficient for mining the entire Time series to highlight the most emotional frames and to supplement information for the features obtained by DTAN and GMN.
In one or more embodiments, the multi-modal emotion recognition network, the training step includes:
constructing a training set, wherein the training set is a text feature, a voice feature and a facial expression feature which correspond to the same video of a known emotion category label;
inputting the text features of the training set into a first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the training set into a second LSTM model; at the same time, the user can select the desired position,
inputting facial expression features of the training set into a third LSTM model;
using the known emotion category labels as output values of the multi-mode emotion recognition network;
training a multi-mode emotion recognition network; and obtaining the trained multi-modal emotion recognition network.
Illustratively, for each modality-characterized sequence, the long short term memory network (LSTM) encodes the features of each modality over time. At each input timestamp, feature information from each modality is input into the assigned respective LSTM model.
As one or more embodiments, the S102: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output; the method comprises the following specific steps:
inputting text characteristics of a video to be recognized into a first LSTM model, and outputting a first coding vector by the first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the video to be recognized into a second LSTM model, and outputting a second coding vector by the second LSTM model; at the same time, the user can select the desired position,
inputting facial expression characteristics of a video to be recognized into a third LSTM model, and outputting a third coding vector by the third LSTM model;
and splicing the first, second and third coded vectors to obtain a first feature vector.
As one or more embodiments, the S103: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; the method comprises the following specific steps:
the input to the DTAN is a cascade of memories at times t-1 and t, denoted c[t-1,t]. These memories are passed to a trainable fully-connected neural networkTo obtain the attention coefficient a[t-1,t]。
a[t-1,t]=softmax(Da(c[t-1,t]))
a[t-1,t]Is the softmax function activation score, i.e., the weighting value, for each LSTM memory at time t-1 and t. At DaApplying softmax to c on the output layer of[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]To (c) to (d); the output of the DTAN is defined as:
wherein the content of the first and second substances,is the memory reserved after the memory of the LSTM passes through the DTAN, i.e. the first weighted feature.
DTAN is also able to discover modal interactions that occur at different timestamps because it involves the memory c in LSTM systems. These memories may carry information about the inputs observed across different timestamps.
Illustratively, the goal of DTAN is to outline the cross-modality interaction between different modality memories in an LSTMs system at timestamp t. Therefore, at time t, the LSTM memory c is pairedtThe cascade of (2) uses an attention mechanism to automatically assign the weight coefficients. Modal interaction is achieved by assigning high coefficients to the modality that dominates the emotional effect at the time stamp t and low coefficients to other modalities. However, only memory c in LSTM is used at time ttIt is not desirable to perform coefficient assignment. Memory c with time t-1 added is also requiredt-1The DTAN can now freely retain a constant size in the memory information of the LSTM system and assign high coefficients to them only when they are about to change.
As one or more embodiments, the S103: inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector; wherein t is a positive integer; the method comprises the following specific steps:
first weighting the featuresFor use as trainable neural networksTo generate a cross-modality update component for a multimodal gated memory networkdmemIs a dimension of a multimodal gated memory network;
the multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma1,γ2Referred to as retention gate and update gate, respectively; at each time stamp t, γ1What the current state of the multimodal gated memory network to remember is, and gamma2Then it is based on the updated componentTo remember what the memory of the multimodal gating memory network is updated; gamma ray1And gamma2Each consisting of two trainable neural networksControlling; output using DTANAs gating mechanisms for multimodal gated memory networksThe input formula of (a) is:
at each time stamp t of the whole network recursion, a reservation gate γ is used1And updating the gate gamma2And current modal interaction update componentUpdate u by the following formula:
illustratively, u of the multimodal gating memorisation network GMN is a neural component that stores a history of interactions across time. It acts as a supplemental memory to the memory in the LSTM system. The output of the DTAN is passed directly to the multimodal gated memory network to represent cross-modal interactions made up of key dimensions of different modalities in the LSTM memory system.
As one or more embodiments, the S104: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network performs weighted summation on memories of all timestamps under each LSTM model to obtain a third feature vector; the method comprises the following specific steps:
output for each time instant of LSTM system of each modalityComposed matrix HnAutomatic weight assignment alpha for attention mechanismnTo obtain a final output zn,The formula is as follows:
αnfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]And (3) removing the solvent. Final outputIs an information supplementary vector output by DTAN, GMN and LSTM.
Illustratively, for global attention mechanism (GTAN), the goal is to mine the best allocation coefficients for the entire time series and supplement the information for DTAN and GMN. And carrying out automatic weight distribution of an Attention mechanism on a matrix formed by the output of each time of the LSTM system of each mode to obtain final output.
As one or more embodiments, the S105: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; performing emotion recognition on the fused feature vector to obtain an emotion recognition result; the method comprises the following specific steps:
the trained multi-mode emotion recognition network splices the first, second and third feature vectors to obtain a fusion feature vector;
and performing emotion recognition on the fusion feature vectors, and classifying by using a full-connection neural network to obtain an emotion recognition result.
And cascading the output u of the gate control memory network GMN, the output of the LSTM last moment of each mode and the output of the global attention mechanism of each mode. And then, for the cascaded result, obtaining a final emotion prediction result after passing through two layers of fully-connected neural networks FC.
The Loss function used by the multi-modal emotion recognition network is the L1Loss function of Loss. L1Loss represents the absolute value of the difference between the predicted value and the true tag value, also known as the manhattan distance. The expression is as follows:
Training the whole emotion recognition network, and comprehensively evaluating the performance of the emotion recognition network, wherein the evaluation standard is as follows: binary Accuracy, F1 Score, weighted Accuracy, MAE and r coefficients.
As shown in FIG. 1, the multi-modal emotion recognition method based on a double attention mechanism and a gated memory network is disclosed. The invention uses the MOSI dataset to validate the proposed algorithm, comprising in particular the following steps:
The emotional tag of the data set is a linear range of-3 to +3 going from strong negative to strong positive. The intensity was annotated by an online worker from Amazon Mechanical turn. For each video, the annotator has seven choices: strong positive (labeled +3), positive (+2), weak positive (+1), neutral (0), weak negative (-1), negative (-2)), strong negative (-3). In addition, for emotion recognition, the emotion recognition method expresses positive emotion between 0 and 3 and negative emotion between-3 and 0. Further, the present invention can find that the number of positive emotions is 1176 and the number of negative emotions is 1023 by two classifications.
And step 3, the extracted features of the three modes of the text, the voice and the video are not aligned in a time sense, and the features only play corresponding roles in the respective modes. The information of the three modalities is aligned using P2FA forced alignment tools. Finally, the aligned data is characterized by the text l ═ T × dl,dl300, speechTone a ═ T × da,da74, video v ═ T × dv,dv=711。
For the overall data after alignment, N ═ { l, a, v }, the present invention also needs to normalize the data. The method employed in the present invention is Z-score feature normalization, also known as standard deviation normalization. The formula is as follows:where x is the input sample, μ is the mean of all sample data, and σ is the standard deviation of all sample data.
And 4, after the text, the audio and the video are subjected to feature extraction, inputting feature information of each mode into the respective LSTM. For the input N ═ { l, a, v }, the present invention defines the input of the nth modality asWhereinIs the input dimension of the nth input modality. For the nth mode, the memory of the signed LSTM is denoted asThe output of each LSTM is defined as Indicates the memory c in the nth LSTMnDimension (d) of (a).
Wherein the update rule of the nth LSTM is:
wherein in,fn,onThe input gate, the forgetting gate and the output gate of the nth LSTM are respectively. m isnIt is the storage update of the nth LSTM at time t, which indicates an element product, σ is a Sigmoid activation function.
Step 5. the goal of the DTAN is to outline the cross-modality interaction between different modality memories in the LSTMs system at timestamp t. The invention uses the LSTM memory c at time ttAnd LSTM memory c at time t-1t-1The coefficient assignment is made so that the DTAN is free to keep a constant size in the memory information of the LSTM system and to assign high coefficients to them only when they are going to change.
The input to the DTAN is a cascade of memories at times t-1 and t, denoted c[t-1,t]. These memories are passed to a trainable fully-connected neural networkTo obtain the attention coefficient a[t-1,t]。
a[t-1,t]=softmax(Da(c[t-1,t]))
a[t-1,t]Is the softmax function activation score at t-1 and t for each LSTM memory. At DaOn the output layer ofC can be paired with softmax[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]In the meantime. The output of the DTAN is defined as:
is the memory reserved after the memory of the LSTM passes through the DTAN. DTAN is also able to discover modal interactions that occur at different timestamps because it involves the memory c in LSTM systems. These memories may carry information about the inputs observed across different timestamps.
Step 6. output of DTANPassed directly to the multimodal gated memory network GMN to represent which dimensions in the LSTM memory system constitute cross-modal interactions. Firstly, the first step is toFor use as trainable neural networksTo generate a cross-modality update component for a multimodal gated memory networkdmemIs a dimension of a multimodal gated memory network.
The multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma1,γ2Respectively called retention gate and update gate. At each time stamp t, γ1Multiple modes to rememberWhat the current state of the state-gated memory network is, and gamma2Then it is based on the updated componentTo remember what the memory of the multimodal gated memory network is updated. Gamma ray1And gamma2Each consisting of two trainable neural networksAnd (5) controlling. Output using DTANAs gating mechanisms for multimodal gated memory networksThe input formula of (a) is:
at each time stamp t of the whole network recursion, a reservation gate γ is used1And updating the gate gamma2And current modal interaction update componentUpdate u by the following formula:
step 7. for the Global attention System (GTAN), the invention outputs for each moment of the LSTM system of each modalityComposed matrix HnAutomatic weight assignment alpha for the Attention mechanismnTo obtain a final output zn,The formula is as follows:
αnfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]And (3) removing the solvent. Final outputIs an excellent information supplementary vector output by DTAN, GMN and LSTM.
Step 8, for the output u of the multi-mode gating memory network GMN, the invention cascades the output u with the last moment of the LSTM and the last output of the GTAN of each mode, and the specific formula is as follows:
rT=[uT,hT,zn],n∈N
then for the concatenated result rTIn other words, a final emotion prediction result is obtained after two layers of fully-connected neural networks FCThe formula is as follows:
wherein W1And W2Two trainable matrices of a fully connected neural network, respectively. The Loss function used by the model is the L1Loss function. L1Loss represents the absolute value of the difference between the predicted value and the true tag value, also known as the manhattan distance. The expression is as follows:wherein y isiIn the form of an actual value of the value,is a predicted value.
Step 9, for the MOSI dataset, the evaluation criteria are Binary Accuracy and F1 Score, multi-classification weighted Accuracy, mean absolute error MAE, and r coefficient, and the formula is as follows:
Wherein y isiIn the form of an actual value of the value,for predictive value, weighted accuracy is the usual accuracy, calculated as part of the correct answer for all examples. The larger the accuracy, the better the recognition effect.
Wherein TP is the number predicted to be positive, actually positive; FP is the number predicted to be positive, actually negative; TN is the number predicted to be negative, actually negative; FN is the number predicted to be negative and actually positive, F1-Score is the harmonic mean of precision and call. For F1-Score, the larger the value, the better the recognition.
Wherein is the number of samples in the test set, yiIn the form of an actual value of the value,is a predicted value. For MAE, the smaller the value, the better the recognition effect.
WhereinThe loss generated using the model is represented,the loss resulting from using the mean is indicated. The closer to 1 for r, the better the recognition effect.
TABLE 1 model parameter Table
Parameter(s) | Value of |
Learning rate | 0.005 |
Optimizer | Adam |
Size of batch | 56 |
Dropout coefficient | 0.25 |
Number of iterations | 3000 |
weight_decay | 0.1 |
grad_clip_value | 1.0 |
hidden_sizes | 32,64,32 (text, voice, video) |
Comparative experiment:
single-mode emotion recognition:
the method carries out single-mode emotion recognition aiming at three modes of text, voice and video respectively, and carries out comparison experiments with subsequent dual-mode and multi-mode emotion recognition. The invention uses LSTM and FC models without DTAN, GTAN and GMN for single-mode emotion recognition.
Table 2 shows the results of single-modality emotion recognition on MOSI data sets. The invention can obviously see that in the single-mode emotion recognition only using LSTM and FC, the voice and video effects are not good as the text effects, the accuracy of the classification is only 54.9% and 54.6%, and the text can reach 71.1%.
TABLE 2 MOSI data set single modal emotion recognition results
And (3) bimodal emotion recognition:
the invention then performs bimodal emotion recognition, table 3 is the MOSI data set bimodal emotion recognition result, and LSTM-A, V in table 3 is the model after LSTM is used for cascade connection and FC processing in the two modes of voice and video. LSTM-A, T is a speech and text modality, and LSTM-V, T is a video and text modality, and the models are the same. The invention can see that the BA of LSTM-A, V is 57.4%, the BA of LSTM-A, T is 73.0%, and the accuracy of LSTM-V, T is 70.9%. Compared with single-mode emotion recognition, the method has the advantages that the improvement of the double-mode emotion recognition effect is quite remarkable, the single-mode BA of voice and video accounts for 54.9 percent and 54.6 percent, and the combined result reaches 57.4 percent. In addition, the invention can be seen that the identification results with text modes in the bimodal emotion identification are relatively high and are consistent with the emotion identification result of a single mode, and the results are improved.
TABLE 3 MOSI data set bimodal emotion recognition results
Multi-modal emotion recognition:
following is multimodal emotion recognition. Table 4 shows the comparison of the multi-modal emotion recognition results of the MOSI dataset using the present model with the single modality. Experiments show that the multi-modal emotion recognition based on the model is obviously improved on various evaluation indexes such as BA, F1, Ac-7, MAE, r and the like.
TABLE 4 MOSI data set multimodal emotion recognition results
Table 5 is the fusion Matrix of the test set on the MOSI data set, and Table 6 is the report of the results of the two classifications of the MOSI data set.
TABLE 5 MOSI test set Confusion Matrix
Emotion categories | Negtive (prediction) | Positive (prediction) | Total of |
Negtive (true) | 306 | 73 | 379 |
Positive (true) | 104 | 203 | 307 |
Total of | 410 | 276 | 686 |
TABLE 6 MOSI dataset binary report
Evaluation index | Precision | Recall | F1-score |
Results | 74.1 | 74.1 | 74.1 |
Ablation experiment
In order to verify the double attention mechanism and the function of the gate control memory network, the invention carries out ablation contrast experiment on the double attention mechanism and the gate control memory network.
Network (no DTAN-mem, no GTAN) in Table 7 refers to using only LSTM and FC for tri-modal emotion recognition, without using DTAN, GMN and GTAN components. According to the invention, the recognition effect of the multi-modal emotion recognition by only using the LSTM and the FC is obviously reduced compared with that of the final model. Wherein BA, F1 and Ac-7 are respectively reduced by 3.3 percent, 3.3 percent and 4.5 percent, MAE is increased by 0.094, and r is reduced by 0.074. But the recognition effect is still improved compared with the previous single-modal and dual-modal emotion recognition. This shows that the introduction of additional modalities can significantly improve the emotion recognition accuracy, and the recognition effect of the component using the chapter is better.
The Network (no-GTAN) is a model in the current chapter without using a GTAN component, and the emotion recognition effect of the model is reduced compared with that of a final model. Wherein BA, F1 and Accuracy-7 are respectively reduced by 1.3%, 1.3% and 6.2%, MAE is increased by 0.074, and r is reduced by 0.035. For the GTAN component, the global Attention mechanism can mine the optimal distribution coefficient of the whole time sequence to highlight the frames with most emotional colors, and supplement the information of the DTAN-GMN and LSTMs coding systems.
The Network (no-DTAN-mem) is a model in the current chapter without using DTAN and GMN components, and the emotion recognition effect is found to be reduced compared with that of a final model in the current chapter. Wherein BA, F1 and Accuracy-7 are respectively reduced by 2.0%, 2.0% and 1.1%, MAE is increased by 0.023, and r is reduced by 0.039. For DTAN and GMN components, the optimal distribution coefficient of information between different modes at the same time can be mined to highlight the mode with emotional color at the current time, and information supplement is performed on a GTAN and LSTMs coding system.
TABLE 7 MOSI data set ablation experimental comparison
In conclusion, the multimode emotion recognition method based on the double attention mechanism and the gated memory network greatly improves the emotion recognition performance. And not only the frames with large emotion colors at different moments in a single mode are highlighted, but also the information interaction at the same moment in different modes is considered. Meanwhile, the emotion recognition performance is improved by using few parameters. In addition, the emotion information rich in the semantics is fully utilized by combining the semantics, and the method is greatly helpful for overall emotion recognition.
A multi-modal emotion recognition method based on a double attention mechanism and a gated memory network comprises the steps of preprocessing and feature extraction of multi-modal emotion data, model design based on the double attention mechanism and the gated memory network, and a fusion layer. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.
The invention has the beneficial effects that: the invention relates to a multimode emotion recognition method based on a double attention mechanism and a gated memory network. The influence of information of a single mode at different moments on emotion recognition is considered, and the influence of different modes on emotion recognition at the same moment is also considered. Finally, the gated memory network, the global attention mechanism and the output information of the LSTMs system are fused, information complementation is carried out, and a good emotion recognition effect can be obtained, so that the method has a good application prospect.
The second embodiment provides a multi-modal emotion recognition system based on an attention mechanism and GMN;
an attention mechanism and GMN based multi-modal emotion recognition system, comprising:
a pre-processing module configured to: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
a first feature vector acquisition module configured to: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
a second feature vector acquisition module configured to: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
a third feature vector acquisition module configured to: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
an emotion recognition module configured to: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
It should be noted here that the preprocessing module, the first feature vector obtaining module, the second feature vector obtaining module, the third feature vector obtaining module and the emotion recognition module correspond to steps S101 to S105 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
The third embodiment of the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
The fourth embodiment also provides a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. The multimode emotion recognition method based on the attention mechanism and GMN is characterized by comprising the following steps:
preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
2. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized; the method comprises the following specific steps:
separating the video to be identified to obtain an audio signal and a video signal;
carrying out voice recognition on the audio signal to obtain text information;
performing feature extraction on the text information to obtain text features;
carrying out feature extraction on the audio signal to obtain a voice feature;
and carrying out feature extraction on the video signal to obtain facial expression features.
3. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: after the step of preprocessing the acquired video to be recognized to obtain the text feature, the voice feature and the facial expression feature of the video to be recognized, concurrently inputting the text feature, the voice feature and the facial expression feature of the video to be recognized into the corresponding LSTMs model, and before the step of outputting the first feature vector, the method further comprises the following steps:
and performing data alignment and standardization processing on all the obtained features.
4. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the multi-mode emotion recognition network comprises a network structure as follows: LSTMs model, DTAN model, GMN model and GTAN model;
the LSTMs model is connected with the GMN model through the DTAN model, and the GMN model is connected with the fusion module;
the LSTMs model is connected with the fusion module;
the LSTMs model is connected with the GTAN model, and the GTAN model is connected with the fusion module;
the fusion module is connected with the first full-connection layer, the first full-connection layer is connected with the second full-connection layer, and the second full-connection layer is connected with the output layer.
5. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the multi-mode emotion recognition network comprises the following training steps:
constructing a training set, wherein the training set is a text feature, a voice feature and a facial expression feature which correspond to the same video of a known emotion category label;
inputting the text features of the training set into a first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the training set into a second LSTM model; at the same time, the user can select the desired position,
inputting facial expression features of the training set into a third LSTM model;
using the known emotion category labels as output values of the multi-mode emotion recognition network;
training a multi-mode emotion recognition network; and obtaining the trained multi-modal emotion recognition network.
6. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output; the method comprises the following specific steps:
inputting text characteristics of a video to be recognized into a first LSTM model, and outputting a first coding vector by the first LSTM model; at the same time, the user can select the desired position,
inputting the voice characteristics of the video to be recognized into a second LSTM model, and outputting a second coding vector by the second LSTM model; at the same time, the user can select the desired position,
inputting facial expression characteristics of a video to be recognized into a third LSTM model, and outputting a third coding vector by the third LSTM model;
and splicing the first, second and third coded vectors to obtain a first feature vector.
7. The multi-modal emotion recognition method based on attention mechanism and GMN of claim 1, comprising:
carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; the method comprises the following specific steps:
the input to the DTAN is a cascade of memories at times t-1 and t, denoted c[t-1,t](ii) a These memories are passed to a trainable fully-connected neural networkTo obtain the attention coefficient a[t-1,t];
a[t-1,t]=softmax(Da(c[t-1,t]))
a[t-1,t]The activation score of the softmax function at t-1 and t moment memorized by each LSTM, namely a weighted value; at DaApplying softmax to c on the output layer of[t-1,t]Regularize the range of the high-value coefficients to (0, 1)]To (c) to (d); the output of the DTAN is defined as:
wherein the content of the first and second substances,is a memory reserved after the memory of the LSTM passes through the DTAN, namely the characteristic after the first weighting;
alternatively, the first and second electrodes may be,
inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector; wherein t is a positive integer; the method comprises the following specific steps:
first weighting the featuresFor use as trainable neural networksTo generate a cross-modality update component for a multimodal gated memory networkdmemIs a dimension of a multimodal gated memory network;
the multi-view gated memory is controlled by a two-gate arrangement, i.e. gamma1,γ2Referred to as retention gate and update gate, respectively; at each time stamp t, γ1What the current state of the multimodal gated memory network to remember is, and gamma2Then it is based on the updated componentTo remember what the memory of the multimodal gating memory network is updated; gamma ray1And gamma2Each consisting of two trainable neural networksControlling; output using DTANAs multimodal gating memoryOf gating mechanisms of the networkThe input formula of (a) is:
at each time stamp t of the whole network recursion, a reservation gate γ is used1And updating the gate gamma2And current modal interaction update componentUpdate u by the following formula:
alternatively, the first and second electrodes may be,
the trained global attention mechanism network GTAN of the multi-modal emotion recognition network performs weighted summation on memories of all timestamps under each LSTM model to obtain a third feature vector; the method comprises the following specific steps:
output for each time instant of LSTM system of each modalityComposed matrix HnAutomatic weight assignment alpha for attention mechanismnTo obtain a final output zn,The formula is as follows:
αnfor the automatic optimal allocation coefficients of the modality at globally different times, the softmax activation function normalizes its range to (0, 1)]A (c) is added; final outputIs an information supplementary vector output by DTAN, GMN and LSTM;
alternatively, the first and second electrodes may be,
the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; performing emotion recognition on the fused feature vector to obtain an emotion recognition result; the method comprises the following specific steps:
the trained multi-mode emotion recognition network splices the first, second and third feature vectors to obtain a fusion feature vector;
and performing emotion recognition on the fusion feature vectors, and classifying by using a full-connection neural network to obtain an emotion recognition result.
8. The multimode emotion recognition system based on the attention mechanism and the GMN is characterized by comprising the following components:
a pre-processing module configured to: preprocessing the acquired video to be recognized to obtain text characteristics, voice characteristics and facial expression characteristics of the video to be recognized;
a first feature vector acquisition module configured to: the method comprises the steps that text features, voice features and facial expression features of a video to be recognized are input into an LSTMs model of a trained multi-modal emotion recognition network in a concurrent mode, and a first feature vector is output;
a second feature vector acquisition module configured to: carrying out weighted summation on the memory output values of all adjacent time stamps of the LSTMs model to obtain a first weighted characteristic; inputting the first weighted feature into a gated memory network GMN of the trained multi-modal emotion recognition network, and outputting a second feature vector;
a third feature vector acquisition module configured to: the trained global attention mechanism network GTAN of the multi-modal emotion recognition network weights and sums the output values of the memories of all the timestamps under each LSTM model to obtain a third feature vector;
an emotion recognition module configured to: the trained multi-mode emotion recognition network fuses the first feature vector, the second feature vector and the third feature vector to obtain a fused feature vector; and performing emotion recognition on the fusion feature vector to obtain an emotion recognition result.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110239787.7A CN113095357A (en) | 2021-03-04 | 2021-03-04 | Multi-mode emotion recognition method and system based on attention mechanism and GMN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110239787.7A CN113095357A (en) | 2021-03-04 | 2021-03-04 | Multi-mode emotion recognition method and system based on attention mechanism and GMN |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113095357A true CN113095357A (en) | 2021-07-09 |
Family
ID=76666377
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110239787.7A Pending CN113095357A (en) | 2021-03-04 | 2021-03-04 | Multi-mode emotion recognition method and system based on attention mechanism and GMN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113095357A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255635A (en) * | 2021-07-19 | 2021-08-13 | 中国科学院自动化研究所 | Multi-mode fused psychological stress analysis method |
CN113674767A (en) * | 2021-10-09 | 2021-11-19 | 复旦大学 | Depression state identification method based on multi-modal fusion |
CN113723463A (en) * | 2021-08-02 | 2021-11-30 | 北京工业大学 | Emotion classification method and device |
CN114155882A (en) * | 2021-11-30 | 2022-03-08 | 浙江大学 | Method and device for judging road rage emotion based on voice recognition |
CN114218380A (en) * | 2021-12-03 | 2022-03-22 | 淮阴工学院 | Multi-mode-based cold chain loading user portrait label extraction method and device |
CN114648805A (en) * | 2022-05-18 | 2022-06-21 | 华中科技大学 | Course video sight correction model, training method thereof and sight drop point estimation method |
CN115271002A (en) * | 2022-09-29 | 2022-11-01 | 广东机电职业技术学院 | Identification method, first-aid decision method, medium and life health intelligent monitoring system |
WO2023050708A1 (en) * | 2021-09-29 | 2023-04-06 | 苏州浪潮智能科技有限公司 | Emotion recognition method and apparatus, device, and readable storage medium |
CN116070169A (en) * | 2023-01-28 | 2023-05-05 | 天翼云科技有限公司 | Model training method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111275085A (en) * | 2020-01-15 | 2020-06-12 | 重庆邮电大学 | Online short video multi-modal emotion recognition method based on attention fusion |
CN111898670A (en) * | 2020-07-24 | 2020-11-06 | 深圳市声希科技有限公司 | Multi-mode emotion recognition method, device, equipment and storage medium |
CN112348075A (en) * | 2020-11-02 | 2021-02-09 | 大连理工大学 | Multi-mode emotion recognition method based on contextual attention neural network |
-
2021
- 2021-03-04 CN CN202110239787.7A patent/CN113095357A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111275085A (en) * | 2020-01-15 | 2020-06-12 | 重庆邮电大学 | Online short video multi-modal emotion recognition method based on attention fusion |
CN111898670A (en) * | 2020-07-24 | 2020-11-06 | 深圳市声希科技有限公司 | Multi-mode emotion recognition method, device, equipment and storage medium |
CN112348075A (en) * | 2020-11-02 | 2021-02-09 | 大连理工大学 | Multi-mode emotion recognition method based on contextual attention neural network |
Non-Patent Citations (1)
Title |
---|
陈炜青: ""基于深度学习的多模态情感识别研究"", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255635B (en) * | 2021-07-19 | 2021-10-15 | 中国科学院自动化研究所 | Multi-mode fused psychological stress analysis method |
CN113255635A (en) * | 2021-07-19 | 2021-08-13 | 中国科学院自动化研究所 | Multi-mode fused psychological stress analysis method |
CN113723463A (en) * | 2021-08-02 | 2021-11-30 | 北京工业大学 | Emotion classification method and device |
WO2023050708A1 (en) * | 2021-09-29 | 2023-04-06 | 苏州浪潮智能科技有限公司 | Emotion recognition method and apparatus, device, and readable storage medium |
CN113674767A (en) * | 2021-10-09 | 2021-11-19 | 复旦大学 | Depression state identification method based on multi-modal fusion |
CN114155882A (en) * | 2021-11-30 | 2022-03-08 | 浙江大学 | Method and device for judging road rage emotion based on voice recognition |
CN114155882B (en) * | 2021-11-30 | 2023-08-22 | 浙江大学 | Method and device for judging emotion of road anger based on voice recognition |
CN114218380A (en) * | 2021-12-03 | 2022-03-22 | 淮阴工学院 | Multi-mode-based cold chain loading user portrait label extraction method and device |
CN114218380B (en) * | 2021-12-03 | 2022-07-29 | 淮阴工学院 | Multi-mode-based cold chain loading user portrait label extraction method and device |
CN114648805A (en) * | 2022-05-18 | 2022-06-21 | 华中科技大学 | Course video sight correction model, training method thereof and sight drop point estimation method |
CN115271002B (en) * | 2022-09-29 | 2023-02-17 | 广东机电职业技术学院 | Identification method, first-aid decision method, medium and life health intelligent monitoring system |
CN115271002A (en) * | 2022-09-29 | 2022-11-01 | 广东机电职业技术学院 | Identification method, first-aid decision method, medium and life health intelligent monitoring system |
CN116070169A (en) * | 2023-01-28 | 2023-05-05 | 天翼云科技有限公司 | Model training method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113095357A (en) | Multi-mode emotion recognition method and system based on attention mechanism and GMN | |
CN112348075B (en) | Multi-mode emotion recognition method based on contextual attention neural network | |
CN111275085B (en) | Online short video multi-modal emotion recognition method based on attention fusion | |
Pandey et al. | Deep learning techniques for speech emotion recognition: A review | |
CN109003625B (en) | Speech emotion recognition method and system based on ternary loss | |
Cai et al. | Multi-modal emotion recognition from speech and facial expression based on deep learning | |
Deng et al. | Multimodal utterance-level affect analysis using visual, audio and text features | |
CN108804453A (en) | A kind of video and audio recognition methods and device | |
WO2023050708A1 (en) | Emotion recognition method and apparatus, device, and readable storage medium | |
CN113065344A (en) | Cross-corpus emotion recognition method based on transfer learning and attention mechanism | |
CN114091466A (en) | Multi-modal emotion analysis method and system based on Transformer and multi-task learning | |
CN114140885A (en) | Emotion analysis model generation method and device, electronic equipment and storage medium | |
Cai et al. | Multimodal sentiment analysis based on recurrent neural network and multimodal attention | |
Wang et al. | A novel multiface recognition method with short training time and lightweight based on ABASNet and H-softmax | |
Sahu et al. | Modeling feature representations for affective speech using generative adversarial networks | |
CN113870863A (en) | Voiceprint recognition method and device, storage medium and electronic equipment | |
Sun et al. | EmotionNAS: Two-stream Architecture Search for Speech Emotion Recognition | |
Gong et al. | Human interaction recognition based on deep learning and HMM | |
Bakhshi et al. | Multimodal emotion recognition based on speech and physiological signals using deep neural networks | |
Khalane et al. | Context-aware multimodal emotion recognition | |
CN116167015A (en) | Dimension emotion analysis method based on joint cross attention mechanism | |
CN113626553B (en) | Cascade binary Chinese entity relation extraction method based on pre-training model | |
CN114693949A (en) | Multi-modal evaluation object extraction method based on regional perception alignment network | |
Ng et al. | The investigation of different loss functions with capsule networks for speech emotion recognition | |
Torabi et al. | Action classification and highlighting in videos |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210709 |