CN116052291A - Multi-mode emotion recognition method based on non-aligned sequence - Google Patents

Multi-mode emotion recognition method based on non-aligned sequence Download PDF

Info

Publication number
CN116052291A
CN116052291A CN202111254440.6A CN202111254440A CN116052291A CN 116052291 A CN116052291 A CN 116052291A CN 202111254440 A CN202111254440 A CN 202111254440A CN 116052291 A CN116052291 A CN 116052291A
Authority
CN
China
Prior art keywords
mode
characterization
visual
audio
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111254440.6A
Other languages
Chinese (zh)
Inventor
刘峰
付子旺
齐佳音
周爱民
李志斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University Of International Business And Economics
Beijing University of Posts and Telecommunications
East China Normal University
Original Assignee
Shanghai University Of International Business And Economics
Beijing University of Posts and Telecommunications
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai University Of International Business And Economics, Beijing University of Posts and Telecommunications, East China Normal University filed Critical Shanghai University Of International Business And Economics
Priority to CN202111254440.6A priority Critical patent/CN116052291A/en
Publication of CN116052291A publication Critical patent/CN116052291A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-modal emotion recognition method based on a non-aligned sequence, which relates to the technical field of computer vision and comprises the following steps of: s1, respectively extracting characteristics of a target video, namely text characterization, visual characterization and audio characterization from a text mode, a visual mode and an audio mode; s2, performing feature preprocessing on the text characterization, the visual characterization and the audio characterization according to the time sequence structures of the text mode, the visual mode and the audio mode; and S3, fusing the results of the S2 through a cross-mode transducer module to obtain a high-order complementary representation. According to the method, the cross-modal fusion characterization can be learned according to the complementary features of the three modalities, the result is more robust and the processing is more efficient on the premise that the original modal features are not lost, the accuracy and the parameter quantity can be balanced, and the practical application value of multi-modal emotion recognition is improved.

Description

Multi-mode emotion recognition method based on non-aligned sequence
Technical Field
The invention relates to the technical field of computer vision, in particular to a multi-mode emotion recognition method based on a non-aligned sequence.
Background
Multimodal emotion recognition is one of the challenging tasks in multimodal machine learning, and is continuously attracting attention due to the strong robustness and remarkable performance effects of the multimodal. The goal of multi-mode emotion recognition is to recognize the emotion attitude of the current person from the video clip of the human, and mainly relates to three modes of natural language, facial expression and sound frequency. Compared with a single mode, the multi-mode system provides rich information, accords with the human expression behavior, can completely reflect the emotion states of the sequences, and has wide application prospects in various fields such as social robots, medical treatment, education quality evaluation and the like.
However, this field still faces some challenges: on one hand, due to different sampling frequencies of different modes, the three modes are often unaligned, manual alignment is usually labor-intensive, a great deal of field knowledge is needed, and the implementation difficulty is high; on the other hand, the existing model cannot achieve balance of performance and parameter quantity, and a network with outstanding performance often accompanies a large quantity of parameter quantity due to the influence of a pre-training model. Therefore, the invention is mainly focused on efficiently learning fusion characterization of different modalities in a non-aligned sequence for multimodal emotion recognition.
In existing research, human multimodal emotion recognition can be categorized into the following according to the modality fusion mode: feature level fusion, decision level fusion and hybrid model fusion. The feature level fusion is to extract and construct various modal data into corresponding modal features and then splice the modal features into a feature set integrating the modal features. The decision-level fusion is to find out the credibility of each model and then to coordinate and combine the decisions. With the development of deep learning, compared with the former two methods, the hybrid model fusion can flexibly select the fusion position, and the performance is obviously improved. In a hybrid model fusion approach based on non-aligned multimodal emotion recognition, typically Tsai et al propose Multimodal Transformer (MulT) approach to employ pairs of attention from non-aligned multimodal sequences to enhance the characterization of individual modalities; lv Fengmao et al propose integrating information of three modalities by means of a message center, and employing progressive strategies to enhance the high-order mixed modality features.
However, the current technology has the following problems: firstly, the mode fusion in a pair mode can generate redundancy of information, and the mode does not consider that three mode information should act on final emotion output at the same time; secondly, the mode of integrating information of different modes through a centralized information center is inefficient, the three modes need frequent dialogue with the center, the mode does not consider complementary information of the three modes, and the complementary information can finish information fusion of the different modes in a mode of not introducing a third party; thirdly, the existing network has very large parameter quantity of the model due to the influence of the pre-training model, and cannot be applied to actual scenes.
Disclosure of Invention
The technical problem of the invention is to provide a multi-mode emotion recognition method based on a non-aligned sequence, which can improve the relevance of three modes in the fusion process, and on the premise of not losing the original mode characteristics, the result is more robust, the processing is more efficient, the accuracy and the parameter quantity can be balanced, and the practical application value of multi-mode emotion recognition is improved.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a method for multi-modal emotion recognition based on a non-aligned sequence, the method comprising the steps of: s1, respectively extracting characteristics of a target video, namely text characterization, visual characterization and audio characterization from a text mode, a visual mode and an audio mode; s2, performing feature preprocessing on the text characterization, the visual characterization and the audio characterization according to the time sequence structures of the text mode, the visual mode and the audio mode; and S3, fusing the results of the S2 through a cross-mode transducer module to obtain a high-order complementary representation. The cross-modal converter module designed by the invention can learn cross-modal fusion characterization according to the complementary features of three modes, has more robustness on the premise of not losing the features of the original modes, is more efficient in processing, can balance accuracy and parameter quantity, and improves the practical application value of multi-mode emotion recognition.
S1 specifically comprises the following steps: sending the transcription of the target video into a pre-trained Glove model to obtain a 300-dimensional text representation; obtaining facial expression unit characterization vectors, i.e., visual characterizations, of a 35-dimensional visual modality using a Facet open source tool; COVAREP was used to obtain 74-dimensional audio characterization.
S2 specifically comprises the following steps: sending the audio representation and the visual representation into a 1-dimensional time sequence convolution for pretreatment by adopting different kernel sizes; text characterization was pre-processed by two layers of BiLSTM.
The cross-modality transducer module comprises: the local time sequence learning unit is used for calculating time sequence characterization of the audio mode and the visual mode; the cross-modal feature fusion unit is used for calculating fusion characterization of an audio mode, a visual mode and a text mode; and the global self-attention characterization unit is used for calculating high-order complementary characterizations of the audio mode, the visual mode and the text mode.
S3 comprises the following steps: s31, obtaining time sequence characterization of an audio mode and a visual mode through a local time sequence learning unit in a cross-mode transducer module; s32, fusing the result of the S31 and the text representation in the S1 through a cross-modal feature fusion unit; s33, putting the result of S32 into a cross-modal transducer module again through a global self-attention characterization unit to perform operation so as to obtain high-order complementary characterization.
S3, further comprises the following steps: s5, obtaining the category output of the final emotion.
S5 further comprises: and S4, splicing the high-order complementary representation obtained in the S33, the time sequence representation of the audio mode obtained in the S31 and the time sequence representation of the visual mode obtained in the S31 to obtain a final result.
Drawings
The invention and its features, aspects and advantages will become more apparent from the detailed description of non-limiting embodiments with reference to the following drawings. Like numbers refer to like parts throughout. The drawings are not intended to be drawn to scale, emphasis instead being placed upon illustrating the principles of the invention.
Fig. 1 is a schematic diagram of a method for identifying multi-modal emotion based on a non-aligned sequence according to embodiment 1 of the present invention;
fig. 2 is a graph comparing experimental data of a multi-modal emotion recognition method based on a non-aligned sequence according to embodiment 1 of the present invention.
Detailed Description
The invention will now be further described with reference to the accompanying drawings and specific examples, which are not intended to limit the invention.
In the multi-modal emotion recognition method of the non-aligned sequence, which is provided by the invention, the multi-modal emotion recognition task mainly relates to a language mode (L), a visual mode (V) and an audio mode (A). The overall flow of the method is shown in fig. 1, and features of a target video, namely text characterization, visual characterization and audio characterization, are firstly extracted from a text mode, a visual mode and an audio mode respectively.
For extraction of text representations, 300-dimensional text representations are obtained by feeding the transcription of the video into a pre-trained Glove model; for extraction of visual representations, using a Facet open source tool to obtain 35-dimensional facial expression unit representation vectors; for extraction of audio characterization, COVAREP is used to obtain 74-dimensional tonesFrequency signal and defining the characterization of three modes by feature extraction as
Figure BDA0003323408530000041
Wherein T is (.) Represents the length of the sequence, d (.) Representing the dimensions of the extracted modality.
After the text characterization, the visual characterization and the audio characterization are obtained, the three characterizations are subjected to characteristic preprocessing according to time sequence structures of different modes. Aiming at audio characterization and visual characterization, in order to ensure that the input sequence can obtain the structural characteristics of adjacent elements, the characterization of the two modes adopts different kernel sizes to be sent into 1-dimensional time sequence convolution, and the specific formula is as follows:
Figure BDA0003323408530000042
wherein BN represents a batch normalization operation, k {V,A} The size of the convolution kernel, d, representing the visual and audio modalities f Representing a common dimension. For text characterization, considering that the modality itself has long-term dependence and context associated information, the BiLSTM can well capture long-term semantic information, so a two-layer BiLSTM is used for feature preprocessing:
Figure BDA0003323408530000043
the LN represents a layer normalization operation, and the layer normalization has strong robustness to the text of the sequence structure. Through the above operation, on one hand, the information of the adjacent elements can be inherited, and on the other hand, different modalities can be aligned in advance from the unaligned multimodal data center.
And then fusing the preprocessed result through a cross-mode transducer module, wherein the cross-mode transducer module is specifically divided into a local time sequence learning unit, a cross-mode feature fusion unit and a global self-attention characterization unit. The time sequence characterization of the audio mode and the visual mode is obtained through the local time sequence learning unit, and the processing formula is as follows:
Figure BDA0003323408530000044
Figure BDA0003323408530000051
wherein PE (T) {v,a} ,d f ) Representing the position code of the object to be coded,
Figure BDA0003323408530000052
representing the result of the position encoding, the transcoder represents the transducer encoder. We use F {V,A} To represent the results after local timing learning.
Based on the results after the time sequence learning, a cross-modal fusion method based on residual errors is designed, and F is carried out {V,A} And
Figure BDA0003323408530000053
a fusion characterization of the three modalities is obtained as input. Specifically, the mapping of the two characterizations is obtained by a linear operation, the interaction features of the two modalities are obtained by add and tanh operations, and finally the fused result is obtained by softmax, expressed as follows:
Figure BDA0003323408530000054
wherein L represents a linear operation,
Figure BDA0003323408530000055
representing the final result of cross-modal feature fusion. In this process, the final result contains not only inter-modal features, but also original intra-text-modal features, which is a high-order feature aggregate. To ensure that information is not lost, the original text representation is continually enhanced by the residual structure.
And then, putting the fused result into a transducer encoder again to obtain a final high-order complementary representation, and learning own neighbor characteristics and information. The specific formula is as follows:
Figure BDA0003323408530000056
wherein F is F Representing the final result of the global self-attention characterization.
Through the three operations, the trans-former module with the cross-mode can not only obtain the characterization of the fusion mode without losing the original information, but also can efficiently process the unaligned multi-mode sequence.
Finally, a new result I= [ F ] is obtained by splicing the time sequence representation of the audio mode and the time sequence representation of the visual mode and the higher-order complementary representation F ,F V ,F A ]And using a two-layer fully connected network to obtain a category output for the final emotion:
Figure BDA0003323408530000057
wherein d out The category dimension representing the output emotion,
Figure BDA0003323408530000058
and->
Figure BDA0003323408530000059
Is a weight vector, b 1 And b 2 Is a bias vector, σ represents the ReLU nonlinear activation function.
The present invention applies this method to three public data sets IEMOCAP, CMU-MOSI, CMU-MOSEI. And experiments were performed based on two settings of word alignment, i.e. manual alignment of text modalities to audio and visual modalities and non-alignment, respectively. Parameters are respectively adjusted for three data sets in the experimental process to achieve the highest performance, an Adam optimizer is used in the training process, a 40-dimensional open dimension is set, a learning rate attenuation technology is used, and in order to prevent overfitting, dropout is designed to be 0.2. The experimental result shows that the model based on the invention achieves the best performance effect by only 0.41M with the minimum parameter, the invention can learn the cross-modal fusion characterization according to the complementary characteristics of the three modes, the result is more robust and the processing is more efficient on the premise of not losing the original modal characteristics, the method effectively balances the accuracy and the parameter, and has considerable accuracy under the condition of ensuring easy application in practice, and the invention pays attention to lightweight application under the condition of ensuring the accuracy, thereby greatly improving the practical application value of multi-modal emotion recognition. Such as shown in fig. 2.
The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims (7)

1. A method for multi-modal emotion recognition based on a non-aligned sequence, the method comprising the steps of:
s1, respectively extracting characteristics of a target video, namely text characterization, visual characterization and audio characterization from a text mode, a visual mode and an audio mode;
s2, performing feature preprocessing on the text characterization, the visual characterization and the audio characterization according to the time sequence structures of the text mode, the visual mode and the audio mode;
and S3, fusing the results of the S2 through a cross-mode transducer module to obtain a high-order complementary representation.
2. The method for identifying multi-modal emotion based on non-aligned sequences according to claim 1, wherein S1 is specifically:
sending the transcription of the target video into a pre-trained Glove model to obtain a 300-dimensional text representation;
obtaining facial expression unit characterization vectors, i.e., visual characterizations, of a 35-dimensional visual modality using a Facet open source tool;
COVAREP was used to obtain 74-dimensional audio characterization.
3. The method for identifying multi-modal emotion based on non-aligned sequences according to claim 2, wherein S2 is specifically:
sending the audio representation and the visual representation into a 1-dimensional time sequence convolution for pretreatment by adopting different kernel sizes;
text characterization was pre-processed by two layers of BiLSTM.
4. The method for multimodal emotion recognition based on a sequence of fee according to claim 1 wherein the cross-modality transducer module comprises:
the local time sequence learning unit is used for calculating time sequence characterization of the audio mode and the visual mode;
the cross-modal feature fusion unit is used for calculating fusion characterization of an audio mode, a visual mode and a text mode;
and the global self-attention characterization unit is used for calculating high-order complementary characterizations of the audio mode, the visual mode and the text mode.
5. The non-aligned sequence-based multimodal emotion recognition method of claim 4, wherein S3 comprises:
s31, obtaining time sequence characterization of an audio mode and a visual mode through a local time sequence learning unit in a cross-mode transducer module;
s32, fusing the result of the S31 and the text representation in the S1 through a cross-modal feature fusion unit;
s33, putting the result of S32 into a cross-modal transducer module again through a global self-attention characterization unit to perform operation so as to obtain high-order complementary characterization.
6. The method for identifying multi-modal emotion based on non-aligned sequences as claimed in claim 1, wherein after S3, further comprising:
s5, obtaining the category output of the final emotion.
7. The non-aligned sequence-based multimodal emotion recognition method of claim 6, wherein prior to S5 further comprising:
and S4, splicing the high-order complementary representation obtained in the S33, the time sequence representation of the audio mode obtained in the S31 and the time sequence representation of the visual mode obtained in the S31 to obtain a final result.
CN202111254440.6A 2021-10-27 2021-10-27 Multi-mode emotion recognition method based on non-aligned sequence Pending CN116052291A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111254440.6A CN116052291A (en) 2021-10-27 2021-10-27 Multi-mode emotion recognition method based on non-aligned sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111254440.6A CN116052291A (en) 2021-10-27 2021-10-27 Multi-mode emotion recognition method based on non-aligned sequence

Publications (1)

Publication Number Publication Date
CN116052291A true CN116052291A (en) 2023-05-02

Family

ID=86124134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111254440.6A Pending CN116052291A (en) 2021-10-27 2021-10-27 Multi-mode emotion recognition method based on non-aligned sequence

Country Status (1)

Country Link
CN (1) CN116052291A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117423168A (en) * 2023-12-19 2024-01-19 湖南三湘银行股份有限公司 User emotion recognition method and system based on multi-modal feature fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117423168A (en) * 2023-12-19 2024-01-19 湖南三湘银行股份有限公司 User emotion recognition method and system based on multi-modal feature fusion
CN117423168B (en) * 2023-12-19 2024-04-02 湖南三湘银行股份有限公司 User emotion recognition method and system based on multi-modal feature fusion

Similar Documents

Publication Publication Date Title
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
CN110609891A (en) Visual dialog generation method based on context awareness graph neural network
CN111402861B (en) Voice recognition method, device, equipment and storage medium
CN107844481B (en) Text recognition error detection method and device
CN111930918B (en) Cross-modal bilateral personalized man-machine social interaction dialog generation method and system
CN111382231B (en) Intention recognition system and method
CN112434142B (en) Method for marking training sample, server, computing equipment and storage medium
CN111027291A (en) Method and device for adding punctuation marks in text and training model and electronic equipment
CN113051368B (en) Double-tower model training method, retrieval device and electronic equipment
CN114091466A (en) Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN117521675A (en) Information processing method, device, equipment and storage medium based on large language model
CN112101044A (en) Intention identification method and device and electronic equipment
CN116304973A (en) Classroom teaching emotion recognition method and system based on multi-mode fusion
CN115964638A (en) Multi-mode social data emotion classification method, system, terminal, equipment and application
CN117746078B (en) Object detection method and system based on user-defined category
CN116542256A (en) Natural language understanding method and device integrating dialogue context information
CN114494969A (en) Emotion recognition method based on multimode voice information complementary AND gate control
CN116052291A (en) Multi-mode emotion recognition method based on non-aligned sequence
CN115186071A (en) Intention recognition method and device, electronic equipment and readable storage medium
CN117271745A (en) Information processing method and device, computing equipment and storage medium
CN111160512A (en) Method for constructing dual-discriminator dialog generation model based on generative confrontation network
CN113946670B (en) Contrast type context understanding enhancement method for dialogue emotion recognition
CN115952360A (en) Domain-adaptive cross-domain recommendation method and system based on user and article commonality modeling
CN114419514B (en) Data processing method, device, computer equipment and storage medium
CN112434133B (en) Intention classification method and device, intelligent terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination