CN116052291A

CN116052291A - Multi-mode emotion recognition method based on non-aligned sequence

Info

Publication number: CN116052291A
Application number: CN202111254440.6A
Authority: CN
Inventors: 刘峰; 付子旺; 齐佳音; 周爱民; 李志斌
Original assignee: Shanghai University Of International Business And Economics; Beijing University of Posts and Telecommunications; East China Normal University
Current assignee: Shanghai University Of International Business And Economics; Beijing University of Posts and Telecommunications; East China Normal University
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2023-05-02

Abstract

The invention provides a multi-modal emotion recognition method based on a non-aligned sequence, which relates to the technical field of computer vision and comprises the following steps of: s1, respectively extracting characteristics of a target video, namely text characterization, visual characterization and audio characterization from a text mode, a visual mode and an audio mode; s2, performing feature preprocessing on the text characterization, the visual characterization and the audio characterization according to the time sequence structures of the text mode, the visual mode and the audio mode; and S3, fusing the results of the S2 through a cross-mode transducer module to obtain a high-order complementary representation. According to the method, the cross-modal fusion characterization can be learned according to the complementary features of the three modalities, the result is more robust and the processing is more efficient on the premise that the original modal features are not lost, the accuracy and the parameter quantity can be balanced, and the practical application value of multi-modal emotion recognition is improved.

Description

Multi-mode emotion recognition method based on non-aligned sequence

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-mode emotion recognition method based on a non-aligned sequence.

Background

Multimodal emotion recognition is one of the challenging tasks in multimodal machine learning, and is continuously attracting attention due to the strong robustness and remarkable performance effects of the multimodal. The goal of multi-mode emotion recognition is to recognize the emotion attitude of the current person from the video clip of the human, and mainly relates to three modes of natural language, facial expression and sound frequency. Compared with a single mode, the multi-mode system provides rich information, accords with the human expression behavior, can completely reflect the emotion states of the sequences, and has wide application prospects in various fields such as social robots, medical treatment, education quality evaluation and the like.

However, this field still faces some challenges: on one hand, due to different sampling frequencies of different modes, the three modes are often unaligned, manual alignment is usually labor-intensive, a great deal of field knowledge is needed, and the implementation difficulty is high; on the other hand, the existing model cannot achieve balance of performance and parameter quantity, and a network with outstanding performance often accompanies a large quantity of parameter quantity due to the influence of a pre-training model. Therefore, the invention is mainly focused on efficiently learning fusion characterization of different modalities in a non-aligned sequence for multimodal emotion recognition.

In existing research, human multimodal emotion recognition can be categorized into the following according to the modality fusion mode: feature level fusion, decision level fusion and hybrid model fusion. The feature level fusion is to extract and construct various modal data into corresponding modal features and then splice the modal features into a feature set integrating the modal features. The decision-level fusion is to find out the credibility of each model and then to coordinate and combine the decisions. With the development of deep learning, compared with the former two methods, the hybrid model fusion can flexibly select the fusion position, and the performance is obviously improved. In a hybrid model fusion approach based on non-aligned multimodal emotion recognition, typically Tsai et al propose Multimodal Transformer (MulT) approach to employ pairs of attention from non-aligned multimodal sequences to enhance the characterization of individual modalities; lv Fengmao et al propose integrating information of three modalities by means of a message center, and employing progressive strategies to enhance the high-order mixed modality features.

However, the current technology has the following problems: firstly, the mode fusion in a pair mode can generate redundancy of information, and the mode does not consider that three mode information should act on final emotion output at the same time; secondly, the mode of integrating information of different modes through a centralized information center is inefficient, the three modes need frequent dialogue with the center, the mode does not consider complementary information of the three modes, and the complementary information can finish information fusion of the different modes in a mode of not introducing a third party; thirdly, the existing network has very large parameter quantity of the model due to the influence of the pre-training model, and cannot be applied to actual scenes.

Disclosure of Invention

The technical problem of the invention is to provide a multi-mode emotion recognition method based on a non-aligned sequence, which can improve the relevance of three modes in the fusion process, and on the premise of not losing the original mode characteristics, the result is more robust, the processing is more efficient, the accuracy and the parameter quantity can be balanced, and the practical application value of multi-mode emotion recognition is improved.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method for multi-modal emotion recognition based on a non-aligned sequence, the method comprising the steps of: s1, respectively extracting characteristics of a target video, namely text characterization, visual characterization and audio characterization from a text mode, a visual mode and an audio mode; s2, performing feature preprocessing on the text characterization, the visual characterization and the audio characterization according to the time sequence structures of the text mode, the visual mode and the audio mode; and S3, fusing the results of the S2 through a cross-mode transducer module to obtain a high-order complementary representation. The cross-modal converter module designed by the invention can learn cross-modal fusion characterization according to the complementary features of three modes, has more robustness on the premise of not losing the features of the original modes, is more efficient in processing, can balance accuracy and parameter quantity, and improves the practical application value of multi-mode emotion recognition.

S1 specifically comprises the following steps: sending the transcription of the target video into a pre-trained Glove model to obtain a 300-dimensional text representation; obtaining facial expression unit characterization vectors, i.e., visual characterizations, of a 35-dimensional visual modality using a Facet open source tool; COVAREP was used to obtain 74-dimensional audio characterization.

S2 specifically comprises the following steps: sending the audio representation and the visual representation into a 1-dimensional time sequence convolution for pretreatment by adopting different kernel sizes; text characterization was pre-processed by two layers of BiLSTM.

The cross-modality transducer module comprises: the local time sequence learning unit is used for calculating time sequence characterization of the audio mode and the visual mode; the cross-modal feature fusion unit is used for calculating fusion characterization of an audio mode, a visual mode and a text mode; and the global self-attention characterization unit is used for calculating high-order complementary characterizations of the audio mode, the visual mode and the text mode.

S3 comprises the following steps: s31, obtaining time sequence characterization of an audio mode and a visual mode through a local time sequence learning unit in a cross-mode transducer module; s32, fusing the result of the S31 and the text representation in the S1 through a cross-modal feature fusion unit; s33, putting the result of S32 into a cross-modal transducer module again through a global self-attention characterization unit to perform operation so as to obtain high-order complementary characterization.

S3, further comprises the following steps: s5, obtaining the category output of the final emotion.

S5 further comprises: and S4, splicing the high-order complementary representation obtained in the S33, the time sequence representation of the audio mode obtained in the S31 and the time sequence representation of the visual mode obtained in the S31 to obtain a final result.

Drawings

The invention and its features, aspects and advantages will become more apparent from the detailed description of non-limiting embodiments with reference to the following drawings. Like numbers refer to like parts throughout. The drawings are not intended to be drawn to scale, emphasis instead being placed upon illustrating the principles of the invention.

Fig. 1 is a schematic diagram of a method for identifying multi-modal emotion based on a non-aligned sequence according to embodiment 1 of the present invention;

fig. 2 is a graph comparing experimental data of a multi-modal emotion recognition method based on a non-aligned sequence according to embodiment 1 of the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings and specific examples, which are not intended to limit the invention.

In the multi-modal emotion recognition method of the non-aligned sequence, which is provided by the invention, the multi-modal emotion recognition task mainly relates to a language mode (L), a visual mode (V) and an audio mode (A). The overall flow of the method is shown in fig. 1, and features of a target video, namely text characterization, visual characterization and audio characterization, are firstly extracted from a text mode, a visual mode and an audio mode respectively.

For extraction of text representations, 300-dimensional text representations are obtained by feeding the transcription of the video into a pre-trained Glove model; for extraction of visual representations, using a Facet open source tool to obtain 35-dimensional facial expression unit representation vectors; for extraction of audio characterization, COVAREP is used to obtain 74-dimensional tonesFrequency signal and defining the characterization of three modes by feature extraction as

Wherein T is _(.) Represents the length of the sequence, d _(.) Representing the dimensions of the extracted modality.

After the text characterization, the visual characterization and the audio characterization are obtained, the three characterizations are subjected to characteristic preprocessing according to time sequence structures of different modes. Aiming at audio characterization and visual characterization, in order to ensure that the input sequence can obtain the structural characteristics of adjacent elements, the characterization of the two modes adopts different kernel sizes to be sent into 1-dimensional time sequence convolution, and the specific formula is as follows:

wherein BN represents a batch normalization operation, k _{V,A} The size of the convolution kernel, d, representing the visual and audio modalities _f Representing a common dimension. For text characterization, considering that the modality itself has long-term dependence and context associated information, the BiLSTM can well capture long-term semantic information, so a two-layer BiLSTM is used for feature preprocessing:

the LN represents a layer normalization operation, and the layer normalization has strong robustness to the text of the sequence structure. Through the above operation, on one hand, the information of the adjacent elements can be inherited, and on the other hand, different modalities can be aligned in advance from the unaligned multimodal data center.

And then fusing the preprocessed result through a cross-mode transducer module, wherein the cross-mode transducer module is specifically divided into a local time sequence learning unit, a cross-mode feature fusion unit and a global self-attention characterization unit. The time sequence characterization of the audio mode and the visual mode is obtained through the local time sequence learning unit, and the processing formula is as follows:

wherein PE (T) _{v,a} ,d _f ) Representing the position code of the object to be coded,

representing the result of the position encoding, the transcoder represents the transducer encoder. We use F _{V,A} To represent the results after local timing learning.

Based on the results after the time sequence learning, a cross-modal fusion method based on residual errors is designed, and F is carried out _{V,A} And

a fusion characterization of the three modalities is obtained as input. Specifically, the mapping of the two characterizations is obtained by a linear operation, the interaction features of the two modalities are obtained by add and tanh operations, and finally the fused result is obtained by softmax, expressed as follows:

wherein L represents a linear operation,

representing the final result of cross-modal feature fusion. In this process, the final result contains not only inter-modal features, but also original intra-text-modal features, which is a high-order feature aggregate. To ensure that information is not lost, the original text representation is continually enhanced by the residual structure.

And then, putting the fused result into a transducer encoder again to obtain a final high-order complementary representation, and learning own neighbor characteristics and information. The specific formula is as follows:

wherein F is _F Representing the final result of the global self-attention characterization.

Through the three operations, the trans-former module with the cross-mode can not only obtain the characterization of the fusion mode without losing the original information, but also can efficiently process the unaligned multi-mode sequence.

Finally, a new result I= [ F ] is obtained by splicing the time sequence representation of the audio mode and the time sequence representation of the visual mode and the higher-order complementary representation _F ,F _V ,F _A ]And using a two-layer fully connected network to obtain a category output for the final emotion:

wherein d _out The category dimension representing the output emotion,

and->

Is a weight vector, b ₁ And b ₂ Is a bias vector, σ represents the ReLU nonlinear activation function.

The present invention applies this method to three public data sets IEMOCAP, CMU-MOSI, CMU-MOSEI. And experiments were performed based on two settings of word alignment, i.e. manual alignment of text modalities to audio and visual modalities and non-alignment, respectively. Parameters are respectively adjusted for three data sets in the experimental process to achieve the highest performance, an Adam optimizer is used in the training process, a 40-dimensional open dimension is set, a learning rate attenuation technology is used, and in order to prevent overfitting, dropout is designed to be 0.2. The experimental result shows that the model based on the invention achieves the best performance effect by only 0.41M with the minimum parameter, the invention can learn the cross-modal fusion characterization according to the complementary characteristics of the three modes, the result is more robust and the processing is more efficient on the premise of not losing the original modal characteristics, the method effectively balances the accuracy and the parameter, and has considerable accuracy under the condition of ensuring easy application in practice, and the invention pays attention to lightweight application under the condition of ensuring the accuracy, thereby greatly improving the practical application value of multi-modal emotion recognition. Such as shown in fig. 2.

The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A method for multi-modal emotion recognition based on a non-aligned sequence, the method comprising the steps of:

s1, respectively extracting characteristics of a target video, namely text characterization, visual characterization and audio characterization from a text mode, a visual mode and an audio mode;

s2, performing feature preprocessing on the text characterization, the visual characterization and the audio characterization according to the time sequence structures of the text mode, the visual mode and the audio mode;

and S3, fusing the results of the S2 through a cross-mode transducer module to obtain a high-order complementary representation.

2. The method for identifying multi-modal emotion based on non-aligned sequences according to claim 1, wherein S1 is specifically:

sending the transcription of the target video into a pre-trained Glove model to obtain a 300-dimensional text representation;

obtaining facial expression unit characterization vectors, i.e., visual characterizations, of a 35-dimensional visual modality using a Facet open source tool;

COVAREP was used to obtain 74-dimensional audio characterization.

3. The method for identifying multi-modal emotion based on non-aligned sequences according to claim 2, wherein S2 is specifically:

sending the audio representation and the visual representation into a 1-dimensional time sequence convolution for pretreatment by adopting different kernel sizes;

text characterization was pre-processed by two layers of BiLSTM.

4. The method for multimodal emotion recognition based on a sequence of fee according to claim 1 wherein the cross-modality transducer module comprises:

the local time sequence learning unit is used for calculating time sequence characterization of the audio mode and the visual mode;

the cross-modal feature fusion unit is used for calculating fusion characterization of an audio mode, a visual mode and a text mode;

and the global self-attention characterization unit is used for calculating high-order complementary characterizations of the audio mode, the visual mode and the text mode.

5. The non-aligned sequence-based multimodal emotion recognition method of claim 4, wherein S3 comprises:

s31, obtaining time sequence characterization of an audio mode and a visual mode through a local time sequence learning unit in a cross-mode transducer module;

s32, fusing the result of the S31 and the text representation in the S1 through a cross-modal feature fusion unit;

s33, putting the result of S32 into a cross-modal transducer module again through a global self-attention characterization unit to perform operation so as to obtain high-order complementary characterization.

6. The method for identifying multi-modal emotion based on non-aligned sequences as claimed in claim 1, wherein after S3, further comprising:

s5, obtaining the category output of the final emotion.

7. The non-aligned sequence-based multimodal emotion recognition method of claim 6, wherein prior to S5 further comprising:

and S4, splicing the high-order complementary representation obtained in the S33, the time sequence representation of the audio mode obtained in the S31 and the time sequence representation of the visual mode obtained in the S31 to obtain a final result.