CN115983280A - Multi-modal emotion analysis method and system for uncertain modal loss - Google Patents

Multi-modal emotion analysis method and system for uncertain modal loss Download PDF

Info

Publication number
CN115983280A
CN115983280A CN202310081044.0A CN202310081044A CN115983280A CN 115983280 A CN115983280 A CN 115983280A CN 202310081044 A CN202310081044 A CN 202310081044A CN 115983280 A CN115983280 A CN 115983280A
Authority
CN
China
Prior art keywords
modal
emotion analysis
loss
uncertain
transformer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310081044.0A
Other languages
Chinese (zh)
Other versions
CN115983280B (en
Inventor
刘志中
周斌
初佃辉
孟令强
孙宇航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai University
Original Assignee
Yantai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai University filed Critical Yantai University
Priority to CN202310081044.0A priority Critical patent/CN115983280B/en
Publication of CN115983280A publication Critical patent/CN115983280A/en
Application granted granted Critical
Publication of CN115983280B publication Critical patent/CN115983280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a multi-modal emotion analysis method and system for uncertain modal loss, and relates to the technical field of data processing, wherein the specific scheme comprises the following steps: multi-modal data with uncertain missing is acquired, including three modalities: text, visual and audio; processing the three kinds of modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification; the invention is based on the modal translation module, and translates the visual and audio modalities into the text modality, thereby improving the quality of the visual and audio modalities and capturing the deep interaction among different modalities; the complete mode is pre-trained to obtain the combined characteristic of the complete mode to guide the approach of the combined characteristic of the missing mode to the combined characteristic of the complete mode, which mode is missing is not needed to be considered, only the approach to the combined characteristic vector of the complete mode is needed, and the method has stronger universality.

Description

Multi-modal emotion analysis method and system for uncertain modal loss
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a multi-modal emotion analysis method and system for uncertain modal loss.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
At present, under the environment of informatization and intelligent rapid development, multi-modal emotion analysis (MSA) plays a great role in the fields of natural human-computer interaction, personalized advertisement, opinion mining, decision making and the like; the technology is dedicated to recognizing human emotions in different ways such as text, voice or facial expressions; in recent years, with the rapid development of mobile internet and intelligent terminals, more and more users express their viewpoints and feelings through social platforms such as twitter and microblog, and a large amount of social data is generated; data on social platforms has evolved from single text forms to multi-modal data, such as text, audio, and images.
Compared with single-mode data, the multi-mode data contains richer information, and is more beneficial to identifying the real emotion of the user; currently, the multi-modal emotion analysis attracts wide attention, becomes a research hotspot in the field of artificial intelligence, and a plurality of effective multi-modal emotion analysis models appear; for example: the method comprises the following steps of (1) a multi-modal emotion analysis model based on a circulating neural network, a multi-modal emotion analysis model based on a Transformer and a multi-modal emotion analysis model based on a convolutional neural network; the existing multi-mode emotion analysis model can better identify the emotion of a user, and the development of the field of multi-mode emotion analysis is promoted.
However, existing MSA models, when performing sentiment analysis, are proposed under the assumption that all modalities (text, vision and audio) are available; however, in real-world scenarios, the absence of uncertain modalities always occurs due to some uncontrollable factors; for example, as shown in FIG. 1, visual content is not available due to camera shut-off or occlusion; voice content is not available due to user silence; voice and text loss due to monitoring device errors; or the face cannot be detected due to illumination or occlusion problems; thus, the assumption that all modalities are available at any time is not always true in real-world scenarios; when the mode is randomly absent, most of the existing multi-mode emotion analysis models may fail; therefore, how to deal with the missing modalities in the multi-modal emotion analysis is becoming a new challenge.
Currently, some research work has focused on the problem of modal loss; han et al propose a joint training method, this method has implicitly merged the multimode information from auxiliary mode, thus has improved the single mode affective analysis performance; the method proposed by Srinivas and the like researches the problem of audio-visual modal loss in automatic audio-visual expression recognition, researches the performance of a Transformer in the absence of one mode, and carries out ablation research to evaluate a model, and the result proves that the work has good universality in the absence of the mode; to solve the problem of pattern missing in object recognition, tran et al estimate the missing data by using the correlation between different patterns.
Zhao et al propose a Missing Mode Imagination Network (MMIN) to handle the problem of uncertain mode missing, the MMIN learning a robust joint multi-modal representation that can predict the representation of any missing mode given the available modes under different missing mode conditions; zeng et al have proposed a Tag Assisted Transform Encoder (TATE) network to handle the problem of missing uncertain modes, where the TATE includes a tag encoding module that can cover both single-mode and multi-mode missing situations.
The research obtains good effect, and the research of emotion analysis is promoted under the condition of lacking of a specific mode; although good results were obtained, the following disadvantages remained: firstly, the existing works only carry out splicing operation when realizing feature fusion, and can not capture the interaction among different modal features; secondly, the existing work does not consider the advantages of text modes in MSA, so that the effect of multi-mode emotion analysis is influenced; in addition, the existing work needs to consider the missing situations of different modalities, then process the different missing situations, and then perform emotion analysis, so that the complexity of the model is increased.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-modal emotion analysis method and system facing to uncertain modal deletion, wherein visual and audio modals are translated into text modals based on a modal translation module, and the combined features of the complete modals are obtained by pre-training the complete modals to guide the combined features of the missing modals to approach to the combined features of the complete modals, which modal deletion is not required to be considered, only the combined features of the complete modals need to approach, and the method and system have stronger universality.
To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the invention provides a multi-modal emotion analysis method for uncertain modal loss in a first aspect;
the multi-modal emotion analysis method for uncertain modal loss comprises the following steps:
multi-modal data with uncertain missing is acquired, including three modalities: text, visual and audio;
processing the three modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification;
the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three modal data based on a multi-head self-attention mechanism, and a Transformer decoder is used for monitoring a Transformer encoder to enable the visual characteristics and the audio characteristics to approach text characteristics, so that the visual characteristics and the audio characteristics are translated into text characteristics.
Further, the extracting of the single-mode features of the three-mode data based on the multi-head self-attention mechanism specifically includes:
extracting context features of each modality by using a Transformer encoder;
residual error connection is carried out on the extracted context characteristics, and normalization is carried out;
and performing linear transformation on the normalized context characteristics to obtain the single-mode characteristics of the three-mode data.
Further, the monitoring the transform encoder by using the transform decoder specifically includes:
constructing a Transformer decoder by taking text characteristics as a quick of a multi-head attention mechanism and modal characteristics to be translated as Key and Value of the multi-head attention mechanism;
and supervising the Transformer encoder to translate the modal characteristics to be translated to the text characteristics according to the translation loss between the modal characteristics to be translated and the text characteristics output by the Transformer decoder.
Further, the multi-modal emotion analysis network further comprises a public space projection module;
the public space projection module is used for carrying out linear transformation on the three modal characteristics after the modal translation to obtain the autocorrelation public space of each modal and fusing the autocorrelation public space into the joint characteristics of the missing modal.
Further, the multi-modal emotion analysis network further comprises a pre-training module and a Transformer encoder module;
the pre-training module is used for pre-training the multi-modal emotion analysis network by using all complete modal data;
and the Transformer encoder module guides the approach of the combined features of the missing modes to the combined features of the complete modes under the supervision of a pre-trained multi-mode emotion analysis network, and encodes the combined features of the missing modes so as to generate the combined features of the complete modes.
Further, the multi-modal emotion analysis network further comprises a Transformer decoder module;
and taking the output of the Transformer encoder module as the input of the Transformer decoder module, and guiding the Transformer encoder to learn the long-term dependence relationship among different modes through the decoder loss.
Further, the overall training target of the multi-modal emotion analysis network is obtained by weighted summation of classification loss, pre-training loss, transform coder-decoder loss and modal translation loss.
The invention provides a multi-modal emotion analysis system facing uncertain modal loss in a second aspect.
A multi-modal emotion analysis system for uncertain modal loss comprises:
a data acquisition unit configured to: multi-modal data with uncertain missing is acquired, including three modalities: text, visual and audio;
an emotion analysis unit configured to: processing the three kinds of modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification;
the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three modal data based on a multi-head self-attention mechanism, and a Transformer decoder is used for monitoring a Transformer encoder to enable the visual characteristics and the audio characteristics to approach text characteristics, so that the visual characteristics and the audio characteristics are translated into text characteristics.
A third aspect of the present invention provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the steps in the method for multimodal emotion analysis oriented to uncertain modal absence according to the first aspect of the present invention.
A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for multimodal emotion analysis oriented to uncertain mode absence according to the first aspect of the present invention when executing the program.
The above one or more technical solutions have the following beneficial effects:
in order to solve the problems that the advantages of text modalities in emotion analysis models are not considered in the existing work and deep interaction among different modalities is not considered in feature fusion, the invention provides a modality translation module which translates visual and audio modalities into text modalities, improves the quality of the visual and audio modalities and can capture the deep interaction among the different modalities.
In order to solve the problem of uncertain mode deletion, the invention guides the approach of the combined feature of the deletion mode to the combined feature of the complete mode by pre-training the complete mode to obtain the combined feature of the complete mode, does not need to consider which mode is deleted, only needs to approach to the combined feature vector of the complete mode, and has stronger universality.
Experiments were conducted on two common multimodal datasets CMU-MOSI and IEMOCAP, and the experimental results show that compared with the good results of several benchmark tests on the two datasets, the model provided by the invention achieves remarkable improvement and verifies the effectiveness of the proposed model.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is an exemplary diagram of missing modalities in multimodal sentiment analysis.
Fig. 2 is a flow chart of the method of the first embodiment.
FIG. 3 is a diagram of a multi-modal emotion analysis network structure according to the first embodiment.
Fig. 4 is a configuration diagram of a modality translation module according to the first embodiment.
Detailed Description
The invention is further described with reference to the following figures and examples.
Example one
Aiming at the problem of uncertain mode deletion in multi-modal emotion analysis, the embodiment discloses a multi-modal emotion analysis method for uncertain mode deletion;
as shown in fig. 2, the multi-modal emotion analysis method for uncertain modal loss includes:
step S1: multi-modal data with uncertain absence is acquired, comprising three modalities: text, visual, and audio.
Given multi-modal data to be analyzed P = [ X = [) v ,X a ,X t ]Where v, a and t represent visual, audio and text, respectively, X v 、X a X t Respectively representing visual modal data, audio modal data and text modal data; without loss of generality, the present embodiment uses
Figure BDA0004067511500000061
To represent a missing modality, where M e { v, a, t }, e.g., given that no visual modality exists, multi-modal data can be represented as £ v £, t @, at which time multi-modal data can be represented>
Figure BDA0004067511500000062
Step S2: and processing the three modal data through the trained multi-modal emotion analysis network to generate and output final emotion classification.
The embodiment discloses a multi-modal emotion analysis network based on a Transformer, the structure of which is shown in fig. 3 and comprises a modal translation module, a public space projection module, a pre-training module, a Transformer encoder-decoder module and a classification module, and based on the above modules, the specific process of processing multi-modal data by using the trained multi-modal emotion analysis network is as follows:
step S201: and inputting the visual mode and the audio mode into a mode translation module, translating into a text mode, and coding the text mode by using a Transformer coder to obtain the single-mode characteristics of the data of the three modes.
Firstly, briefly introducing some key concepts in a Transformer; given an input X, querys is defined as Q = XW Q Keys is K = XW K Value is V = XW V Wherein W is Q ∈R d×d 、W K ∈R d×d And W V ∈P d×d Is a weight matrix; the attention mechanism in the Transformer is calculated as shown in formula (1):
Figure BDA0004067511500000071
wherein Softmax is a normalized exponential function, T is the transpose of the matrix, d k Is the dimension of the matrix K.
Since the multi-attention mechanism has multiple attention heads simultaneously and can capture information of different subspaces, in order to learn expression of multiple semantics in multiple modalities, the present embodiment chooses to use the multi-attention mechanism to extract features in different semantic spaces of each modality, and the multi-attention mechanism is formalized as shown in equation (2):
E M =MultiHead(Q,K,V)=Concat(head 1 ,head 2 ,...,head h )W O (2)
wherein, W O ∈P d×d Is a weight matrix, h is the number of attention heads; ith head i Is expressed as shown in equation (3):
Figure BDA0004067511500000072
wherein the content of the first and second substances,
Figure BDA0004067511500000073
and &>
Figure BDA0004067511500000074
Is the weight matrix of the ith Query, key and Value.
The existing research proves that in the multi-mode emotion analysis, the emotion analysis accuracy rate based on the text mode is about 70% -80%, and the emotion analysis accuracy rate based on the video or audio is about 60% -70%; based on the research results, in order to improve the effect of the multi-modal emotion analysis, in the embodiment, the visual modality and the audio modality are translated into the text modality by using the modality translation module based on the Transformer, and the quality of the multi-modal features is improved by making the visual modality and the audio modality approach the text modality.
Before modal translation, three kinds of modal data are firstly translated
Figure BDA0004067511500000081
Dimension conversion is carried out by using a full connection layer, and the converted data dimension is->
Figure BDA0004067511500000082
Where l (-) and d (-) denote the sequence length and data dimension, respectively.
The modality translation module is shown in fig. 4, and a specific calculation process of the modality translation module is described by taking the visual modality translation as an example.
(1) And extracting the single-mode characteristics of the three-mode data based on a multi-head self-attention mechanism.
Firstly, for three single-mode data, extracting context features of each mode by using a Transformer encoder, wherein the specific calculation process is shown in formulas (4) to (6):
Figure BDA0004067511500000083
E am =MultiHead(X a ,X a ,X a )(5)
E tm =MultiHead(X t ,X t ,X t )(6)
wherein the content of the first and second substances,
Figure BDA0004067511500000084
Figure BDA0004067511500000085
X a and X t Respectively representing a missing visual modality, audio modality and text modality; since the multi-head self-attention mechanism is used in the transform encoder, Q, K, and V in the attention mechanism formula at this time are the same.
Secondly, residual error connection is carried out on the extracted context characteristics of each mode, and the extracted context characteristics are input into a layerorm layer for normalization, wherein the process is shown in formulas (7) to (9);
Figure BDA0004067511500000086
E' a =Layernorm(X a +E am )(8)
E t '=Layernorm(X t +E tm )(9)
wherein Layernorm represents normalization,
Figure BDA0004067511500000087
X a and X t Respectively representing the visual, audio and text modalities of the defect, E vm 、E am 、E tm Representing the contextual characteristics of a visual modality, an audio modality, and a text modality, respectively.
And then, inputting the normalized context characteristics into a feedforward full-link layer for linear transformation, thereby completing the encoding of three single-mode data to obtain the single-mode characteristics of the three-mode data, wherein the process is shown in formulas (10) to (12).
Figure BDA0004067511500000091
Figure BDA0004067511500000092
Figure BDA0004067511500000093
Wherein, W vl 、W al And W tl Is a weight matrix, b vl 、b al And b tl Indicating the offset, reLU is the activation function.
(2) In the extraction of the single-mode characteristics of the three-mode data, a Transformer decoder is used for supervising a Transformer encoder, so that the visual characteristics and the audio characteristics are approximate to the text characteristics, and the visual characteristics and the audio characteristics are translated into the text characteristics
After the single-mode characteristics of the visual mode and the single-mode characteristics of the text mode are obtained respectively, a Transformer decoder is used for monitoring a Transformer encoder, so that the visual characteristics E generated by the encoder v Or audio features E a Approximating text feature E t I.e. direct the encoder to translate visual or audio features into text features.
The specific operation is as follows:
firstly, the single-mode features of the visual modality (or the single-mode features of the audio modality) and the single-mode features of the text modality are used as the input of the transform decoder.
Then, decoding the single-modal feature of the text modality as Query of the multi-head attention mechanism, and decoding the single-modal feature of the visual modality (or the single-modal feature of the audio modality) as Key and Value of the multi-head attention mechanism, wherein the specific calculation process is as shown in formulas (13) and (14):
D vtm =MultiHead(E t ,E v ,E v ) (13)
D atm =MultiHead(E t ,E a ,E a ) (14)
then, D is put vtm And D atm Respectively performing residual error connection operation and inputting the residual error connection operation into a layerorm layer for normalization; then, the normalized result is input into a feedforward full-concatenation layer to be linearly transformed to complete the transform decoder module, and the specific operation process is shown in equations (15) - (18):
D v ' t =Layernorm(E t +D vtm ) (15)
D' at =Layernorm(E t +D atm ) (16)
Figure BDA0004067511500000101
Figure BDA0004067511500000102
wherein, W vtl And W atl Is a weight matrix, b vtl And b atl Is a learnable bias, reLU is an activation function.
Finally, the output E to the visual encoder v And the output D of the decoder vt Computing modal translation loss (Λ) VtoT ) The supervised encoder translates visual modal features to textual modal features by minimizing losses.
Step S202: and inputting the single-mode features of the three-mode data into a common space projection module, and fusing the single-mode features into a joint feature (MJF) of the missing mode.
After obtaining the single-mode features of the three-mode data, the common space projection module performs linear transformation on the three-mode features to obtain the autocorrelation common space of each mode, and then splices the autocorrelation common space into the joint features of the missing modes, wherein the common space projection module is shown in formulas (19) to (21):
C v =[W va E v ||W vt E v ] (19)
C a =[W va E a ||W ta E a ] (20)
C t =[W vt E t ||W ta E t ] (21)
wherein, W va ,W vt And W ta All are weight matrices, | | represents a splicing operation, E v 、E a 、E t Representing the monomodal features of visual, audio and text, respectively.
Since the multi-modal features are randomly missing at this time, the joint feature which is obtained by splicing all the common vectors and is modal missing is represented as C all The formula is shown as (22):
C all =[C v ||C a ||C t ] (22)
the benefits of such a treatment are the following: firstly, a weight matrix is trained by two modalities together, and interaction information between the two modalities is reserved in the weight matrix; secondly, when the missing modal characteristics approach the complete modal characteristics, only the integral joint characteristics can be concerned; thus, no matter which modality is missing, the approach is only to the complete combined modality feature.
Step S203: and (3) coding the missing joint features by using a Transformer coder-decoder module to obtain joint features approaching to the complete mode, and supervising the coding of the MJF by using a pre-training model in the Transformer coder-decoder module so as to enable the MJF to approach to the joint features of the complete mode.
In particular, the combined feature C of modal loss all As input of the Transformer encoder, obtaining output E after encoding out As shown in equations (23) to (25):
E allm =MultiHead(C all ,C all ,C all ) (23)
E o ' ut =Layernorm(C all +E allm ) (24)
Figure BDA0004067511500000111
wherein, the input Query, key and Value are the same and are all C all
Figure BDA0004067511500000112
And &>
Figure BDA0004067511500000113
Is a weight matrix, is->
Figure BDA0004067511500000114
And &>
Figure BDA0004067511500000115
Are two learnable offsets, E out Is the combined characteristic of the missing modes after linear transformation, namely the combined characteristic approaching to the complete mode.
(1) During the course of encoding MJF by the Transformer encoder, a pre-trained model is used to guide the encoding of MJF by the Transformer encoder, thereby approximating MJF to the joint features of the full modality.
The structure of the pre-training model is a multi-modal emotion analysis network with a pre-training module removed, and the pre-training model is trained by using complete modal data; output E of the Pre-training Module pre And the transform encoder output E out The calculation modes of the method are the same and are obtained by splicing modal translation and public space projection.
(2) In order to efficiently model the long-term dependency of information between modalities, the dependency information between joint features is captured using a Transformer encoder-decoder, the output E of the encoder is used out As input to the decoder, the output D of the decoder out The expression is shown in equations (26) to (28):
D outm =MultiHead(E out ,E out ,E out ) (26)
D out =Layernorm(E out +D outm ) (27)
Figure BDA0004067511500000121
wherein the content of the first and second substances,
Figure BDA0004067511500000122
and &>
Figure BDA0004067511500000123
Is a parameter matrix, is->
Figure BDA0004067511500000124
And &>
Figure BDA0004067511500000125
Are two learnable biases, relu being the activation function.
Finally, the output E to the transform encoder out And the output D of the transform decoder out Compute Transformer encoder-decoder losses.
Step S204: joint feature E to approximate a complete modality out And inputting the emotion data into a classification module to generate and output a final emotion classification.
In the classification module, the output E of the Transformer encoder is processed out Inputting the data into a fully-connected network with a softmax activation function to obtain a prediction score
Figure BDA0004067511500000126
Based on the prediction score, an emotion classification is determined.
Overall training target for multimodal emotion analysis network is lost by classification (Λ) cls ) Pre-training loss (Λ) pretrain ) Transformer encoder-decoder loss (Λ) de ) And modal translation loss (Λ) AtoTVtoT ) The weighted sum is obtained, and the formula is shown as (29):
Λ=Λ cls1 Λ pretrain2 Λ de3 Λ VtoT4 Λ AtoT (29)
wherein λ is 1 ,λ 2 ,λ 3 And λ 4 Is the corresponding weight.
(1) Loss of pre-training (Λ) pretrain )
The loss of pre-training is through the pre-training output (E) pre ) And the transform encoder output (E) out ) Calculated from the difference between them, the present embodiment employs Kullback Leibler (KL) divergence loss to guide the reconstruction of the missing mode, and the formula of KL divergence is shown in (30):
Figure BDA0004067511500000127
wherein p and q are two probability distributions, and since the KL divergence is asymmetric, jensen-Shannon (JS) divergence loss is adopted to replace the KL divergence in this embodiment, and the formula of JS is shown as (31):
Λ pretrain =JS(E out ||E pre )
=D KL (E out ||E pre )+D KL (E pre ||E out ) (31)
(2) Transformer encoder-decoder loss (Λ) de )
Similar to the pre-training penalty, the decoder penalty is calculated by computing the output of the transform decoder (D) out ) And joint characterization (C) all ) The JS divergence between them is obtained, and the calculation process is shown in formula (32):
Λ de =JS(D out ||C all )
=D KL (D out ||C all )+D KL (C all ||D out ) (32)
(3) Loss of modal translation (Λ) AtoT And Λ VtoT )
For the translation task, only the visual and audio modalities are translated to the text modality, so only the output of the audio and visual decoders in the modality translation module (D) is calculated at And D vt ) And the text encoder representation in the modality translation Module (E) t ) JS divergence loss in between, as shown in (33) (34):
Λ AtoT =JS(D at ||E t )
=D KL (D at ||E t )+D KL (E t ||D at ) (33)
Λ VtoT =JS(D vt ||E t )
=D KL (D vt ||E t )+D KL (E t ||D vt ) (34)
(4) Loss of classification (Λ) cls )
For the final classification module, the output E of the encoder is compared out Inputting a full-connection network with a softmax activation function to obtain a prediction score
Figure BDA0004067511500000131
The formula is shown as (35):
Figure BDA0004067511500000132
wherein, W c And b c Are learnable weights and biases, applying a standard cross entropy penalty to the classification task, the cross entropy penalty formula is shown as (36):
Figure BDA0004067511500000133
wherein N is the number of samples, y n Is the true sentiment classification for the nth sample,
Figure BDA0004067511500000141
is the predictive emotion classification.
Experiments were conducted on the common data sets CMU-MOSI and IEMOCAP to verify that the model presented in this example achieved significant improvements over several baseline models.
All experiments were performed on a Windows 10 system with Intel (R) Core (TM) i9-10900K CPU, nvidia 3090GPU and 96G RAM; for the CMU-MOSI and IEMOCAP datasets, the model parameter size was 90.7M, the average run time per epoch was 29 seconds and 1 minute, and the datasets and experimental settings were described as follows:
data set: the present embodiment performed experiments on CMU-MOSI and IEMOCAP datasets, which are both multi-modal baseline datasets for emotion analysis, including visual, textual, and audio modalities; for the CMU-MOSI dataset, it contains 2199 fragments from 93 opinion videos on YouTube. The label of each sample is annotated with the sentiment score in [ -3,3 ]. The present embodiment converts the scores into negative, neutral, and positive labels; for an IEMOCAP data set, it contains 5 sessions, each session containing approximately 30 videos, where each video contains at least 24 utterances; the annotation tag is: neutral, depressed, angry, sad, happy, excited, surprised, fear, disappointed, and others; specifically, the present embodiment reports three classifications (negative: [ -3, 0), neutral: [0] and positive: (0, 3 ]) results, two categories (negative: [ depressed, angry, sad, fear, disappointed ], positive: [ happy, excited ]) results were reported on the IEMOCAP dataset.
Parameters are as follows: the present embodiment sets the learning rate to 0.001, the batch size to 32, and the hidden layer size to 300; an Adam optimizer was used to minimize the total loss, the epoch number was set to 20, the loss weight was set to 0.1, and the parameter summary is shown in table 1.
Table 1: detailed parameter settings in all experiments
Figure BDA0004067511500000142
Figure BDA0004067511500000151
Evaluation indexes are as follows: the performance of the model was measured using Accuracy and Macor-F1, which are given by the equations (37) - (38):
Figure BDA0004067511500000152
Figure BDA0004067511500000153
wherein, N true Is the number of samples correctly predicted, N is the total number of samples, C is the number of classifications, P is the positive prediction value, and R is the recall value.
Multi-modal data pre-processing
Visual representation: CMU-MOSI and IEMOCAP data sets are mainly composed of human dialogues, where visual features are mainly composed of human faces; extracting facial features through an OpenFace2.0 toolkit; finally obtaining 709-dimensional visual feature representation; including facial, head and eye movements.
Text representation: for each text utterance, extracting text features using a pre-trained Bert model; finally, a pre-trained BERT model (12-layer, 768-dimensional hidden layers, 12 heads) was used to obtain 768-dimensional word vectors.
Audio representation: acoustic features are extracted by Librosa; for the CMU-MOSI and IEMOCAP datasets, each audio was mixed to mono and resampled to 16000Hz; furthermore, each frame is separated by 512 samples, and zero-crossing rate, mel Frequency Cepstral Coefficient (MFCC) and Constant Q Transform (CQT) features are selected to represent the audio segment; finally, the three features are stitched together to produce a 33-dimensional acoustic feature.
Baseline model
The baseline model results selected for this example were based on work by Zeng et al, and the baseline model selected was as follows:
AE: an efficient data encoding network is trained to replicate its input to its output.
CRA: a missing modality reconstruction framework employs a residual join mechanism to approximate differences between input data.
And MCTN: a method of learning a robust joint representation by transitioning between modalities, the transition from a source modality to a target modality may capture joint information between modalities.
TransM: a multimodal fusion method based on end-to-end translation utilizes a Transformer to convert between a modality and coded multimodal characteristics.
MMIN: a unified multi-modal recognition model employs cascaded residual autoencoders and cyclic consistency learning to recover missing modalities.
TATE: the tag assisted transformer encoder network employing tag encoding techniques covers all uncertain cases and supervises joint representation learning.
Results of the experiment
In the experimental results module, the present embodiment reports the three-classification results on CMU-MOSI and the two-classification results on IEMOCAP, and the experimental results are shown in tables 2 and 3; overall, the overall results show a downward trend as the deletion rate increases.
TABLE 2 Experimental results in the case of monomodal deletion
Figure BDA0004067511500000161
Figure BDA0004067511500000171
TABLE 3 Experimental results in the absence of multimodal depletion
Figure BDA0004067511500000172
Figure BDA0004067511500000181
For the single-mode deletion condition, the experimental result is shown in table 2, the deletion rate is set to be 0-0.5, under the complete mode condition, the M-F1 value on the CMU-MOSI data set is 2.29% lower than that of the MMIN model, and the ACC value on the CMU-MOSI data set is 0.01% lower than that of the TATE model; under the condition of lacking a single mode, when the lacking rate is 0.1, the M-F1 value on the CMU-MOSI data set is 0.78 percent lower than that of a TATE model; in addition, the method provided by the embodiment achieves the best result on other settings, and the validity of the model of the embodiment is verified.
For the multimodal deletion case, the relevant results are shown in table 3; the experimental result shows that, under the condition of random loss of multiple modes, when the deletion rate is 0.4, the ACC value of the model proposed by the embodiment is 0.52% lower than that of the TATE model on the CMU-MOSI data set; on the IEMOCAP data set, the proposed model of this embodiment is improved by about 0.21% to 5.21% on M-F1 and by about 0.75% to 4.05% on ACC compared to other baseline models, demonstrating the robustness of the proposed model of this embodiment.
Ablation study
To explore the effect of different modules in TMTN, the model proposed in this example was evaluated in the absence of a single modality, set as follows: 1) Only one modality is used; 2) Two modalities are used; 3) Removing the modality translation and then using the single modality; 4) Removing modal translation and then using bimodulus; 5) Removing the modal translation module; 6) Removing the public space projection module; 7) The pre-training module is removed.
Table 4: comparison of all modules in TMTN
Figure BDA0004067511500000182
Figure BDA0004067511500000191
According to the table 4, when the text mode is absent, the performance is rapidly reduced, and the fact that the text information is dominant in the multi-mode emotion analysis is verified; however, no similar reduction was observed when visual modalities were removed; the possible reasons are: visual information is not well extracted due to small changes in the face; in addition, the upper part of the table shows the effect of modal features after modal translation, and it can be found that the combination of multiple modalities provides better performance than a single modality, indicating that complementary features can be learned between multiple modalities; to verify the validity of the modality translation method, the present embodiment performed an experiment as shown in the middle part of table 4; it can be seen that the untranslated visual and audio modalities do not perform as well as the translated visual and audio modalities; in addition, the present embodiment also investigated the combination of untranslated visual and audio modalities with text modalities, again confirming that a gap still exists between the combination of untranslated and translated modalities.
For the influence of different modules, after removing the modal translation module, the performance of the model proposed in this embodiment on M-F1 is reduced by about 1.28% to 5.57% and on ACC by about 1.04% to 8.33% relative to the whole model; after removal of the pre-training module, the performance of the model on M-F1 dropped by about 2.41% to 8.31% and on ACC index dropped by about 2.6% to 11.98%. When the public space projection module is removed, in order to ensure the normal operation of the model, the operation of the public space projection module is replaced by the direct splicing operation, and the performance of the model is reduced by about 1.43% to 6.85% on M-F1 and about 2.08% to 9.37% on ACC compared with the TMTN model when the modality is complete.
The ablation experiment proves the effectiveness of the modal translation module; as the embodiment trains the pre-training network by using a complete mode, the pre-training module can play a good supervision role; from the table, a significant result can be obtained, that is, the modality translation operation and the pre-training operation in the model of the present embodiment can still enable the model to maintain strong stability and good effect in the absence of the modality.
Multi-classification under IEMOCAP dataset
The present embodiment also explores the performance of multiple classifications on the IEMOCAP dataset; selecting happy, angry, sad and neutral emotion labels to perform four-classification experiments besides the two-classification results; selecting happy, angry, sad, neutral, depressed, excited and surprised emotion labels to carry out seven classification experiments; the detailed distribution and results are shown in tables 5 and 6, respectively; it can be seen that both the M-F1 value and ACC decrease with increasing number of categories; furthermore, a careful examination of table 6 reveals that when the classification number is 7, the overall performance drops sharply, which may be due to confusion of multiple classes, leading to difficulty in model convergence; in addition, compared with the TATE model, the result shows that the method of the embodiment obtains better results in class 2, class 4 or class 7; it is noted that the model of this example performs 10% higher on average in category 4 than the TATE model; furthermore, in classification 7, the effect of the model proposed in the present embodiment decreases more slowly as the deletion rate increases.
Table 5: detailed distribution of IEMOCAP datasets
Figure BDA0004067511500000201
Table 6: IEMOCAP dataset multi-classification results
Figure BDA0004067511500000211
/>
The embodiment provides a multimodal emotion analysis network based on a Transformer to solve the problem of losing uncertain modes; the model adopts a modal translation method to translate visual and audio modes into text modes so as to improve the quality of the visual and audio modes; in addition, the learning of the missing modal characteristics is supervised by a pre-trained module, so that the missing modal characteristics are approximated to the complete combined characteristics to reconstruct the missing modal characteristics; the validity of the proposed model in the absence modality case is evident from the experiments on the CMU-MOSI and IEMOCAP data sets.
Example two
The embodiment discloses a multi-modal emotion analysis system for uncertain modal loss;
the multimode emotion analysis system for uncertain modal loss comprises:
a data acquisition unit configured to: multi-modal data with uncertain absence is acquired, comprising three modalities: text, visual and audio;
an emotion analysis unit configured to: processing the three kinds of modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification;
the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three modal data based on a multi-head self-attention mechanism, and a Transformer decoder is used for monitoring a Transformer encoder to enable the visual characteristics and the audio characteristics to approach text characteristics, so that the visual characteristics and the audio characteristics are translated into text characteristics.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer readable storage medium, on which a computer program is stored, where the program, when executed by a processor, implements the steps in the method for multimodal emotion analysis towards uncertain modality missing as in the first embodiment of the present disclosure.
Example four
An object of the present embodiment is to provide an electronic device.
The electronic device comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the method for multimodal emotion analysis facing uncertain mode deletion according to the first embodiment of the disclosure.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The multimode emotion analysis method for uncertain modal loss is characterized by comprising the following steps:
multi-modal data with uncertain absence is acquired, comprising three modalities: text, visual and audio;
processing the three modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification;
the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three types of modal data based on a multi-head self-attention mechanism, and a Transformer decoder is used for monitoring a Transformer encoder to enable the visual characteristics and the audio characteristics to approach text characteristics, so that the visual characteristics and the audio characteristics are translated into the text characteristics.
2. The multi-modal sentiment analysis method facing uncertain modal absence according to claim 1, wherein the single-modal features of three modal data extracted based on the multi-head self-attention mechanism are specifically:
extracting context features of each modality by using a Transformer encoder;
residual error connection is carried out on the extracted context characteristics, and normalization is carried out;
and performing linear transformation on the normalized context characteristics to obtain the single-mode characteristics of the three-mode data.
3. The method for multimodal emotion analysis oriented to uncertain modal loss according to claim 1, wherein the transform encoder is supervised by using a transform decoder, specifically:
constructing a Transformer decoder by taking text characteristics as a quick of a multi-head attention mechanism and modal characteristics to be translated as Key and Value of the multi-head attention mechanism;
and supervising the Transformer encoder to translate the modal characteristics to be translated into the text characteristics according to the translation loss between the modal characteristics to be translated and the text characteristics output by the Transformer decoder.
4. The multimodal emotion analysis method oriented to uncertain modal loss as recited in claim 1, wherein the multimodal emotion analysis network further comprises a common space projection module;
the public space projection module is used for carrying out linear transformation on the three modal characteristics after the modal translation to obtain the autocorrelation public space of each modal and fusing the autocorrelation public space into the joint characteristics of the missing modal.
5. The multimodal emotion analysis method oriented to uncertain modal loss of claim 1, wherein the multimodal emotion analysis network further comprises a pre-training module and a Transformer encoder module;
the pre-training module is used for pre-training the multi-modal emotion analysis network by using all complete modal data;
and the Transformer encoder module guides the approach of the combined features of the missing modes to the combined features of the complete modes under the supervision of a pre-trained multi-mode emotion analysis network, and encodes the combined features of the missing modes so as to generate the combined features of the complete modes.
6. The multimodal emotion analysis method oriented to uncertain modal loss of claim 5, wherein the multimodal emotion analysis network further comprises a transform decoder module;
and taking the output of the Transformer encoder module as the input of the Transformer decoder module, and guiding the Transformer encoder to learn the long-term dependence relationship among different modes through the decoder loss.
7. The method of multimodal emotion analysis oriented to uncertain modal loss as recited in claim 1, wherein the overall training objective of the multimodal emotion analysis network is obtained by weighted summation of classification loss, pre-training loss, transformer coder-decoder loss, modal translation loss.
8. The multimode emotion analysis system for uncertain modal loss is characterized by comprising the following components:
a data acquisition unit configured to: multi-modal data with uncertain absence is acquired, comprising three modalities: text, visual and audio;
an emotion analysis unit configured to: processing the three kinds of modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification;
the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three modal data based on a multi-head self-attention mechanism, and a Transformer decoder is used for monitoring a Transformer encoder to enable the visual characteristics and the audio characteristics to approach text characteristics, so that the visual characteristics and the audio characteristics are translated into text characteristics.
9. An electronic device, comprising:
a memory for non-transitory storage of computer readable instructions; and
a processor for executing the computer readable instructions,
wherein the computer readable instructions, when executed by the processor, perform the method of any of claims 1-7.
10. A storage medium storing non-transitory computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform the instructions of the method of any one of claims 1-7.
CN202310081044.0A 2023-01-31 2023-01-31 Multi-mode emotion analysis method and system for uncertain mode deletion Active CN115983280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310081044.0A CN115983280B (en) 2023-01-31 2023-01-31 Multi-mode emotion analysis method and system for uncertain mode deletion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310081044.0A CN115983280B (en) 2023-01-31 2023-01-31 Multi-mode emotion analysis method and system for uncertain mode deletion

Publications (2)

Publication Number Publication Date
CN115983280A true CN115983280A (en) 2023-04-18
CN115983280B CN115983280B (en) 2023-08-15

Family

ID=85976060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310081044.0A Active CN115983280B (en) 2023-01-31 2023-01-31 Multi-mode emotion analysis method and system for uncertain mode deletion

Country Status (1)

Country Link
CN (1) CN115983280B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312530A (en) * 2021-06-09 2021-08-27 哈尔滨工业大学 Multi-mode emotion classification method taking text as core
CN113420807A (en) * 2021-06-22 2021-09-21 哈尔滨理工大学 Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
CN113806609A (en) * 2021-09-26 2021-12-17 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113971837A (en) * 2021-10-27 2022-01-25 厦门大学 Knowledge-based multi-modal feature fusion dynamic graph neural sign language translation method
CN114091466A (en) * 2021-10-13 2022-02-25 山东师范大学 Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN114418069A (en) * 2022-01-19 2022-04-29 腾讯科技(深圳)有限公司 Method and device for training encoder and storage medium
CN114973062A (en) * 2022-04-25 2022-08-30 西安电子科技大学 Multi-modal emotion analysis method based on Transformer
CN114973044A (en) * 2021-02-22 2022-08-30 上海大学 Video emotion analysis method for enhancing multi-head attention based on bimodal information
CN115019405A (en) * 2022-05-27 2022-09-06 中国科学院计算技术研究所 Multi-modal fusion-based tumor classification method and system
CN115063709A (en) * 2022-04-14 2022-09-16 齐鲁工业大学 Multi-modal emotion analysis method and system based on cross-modal attention and hierarchical fusion
CN115544279A (en) * 2022-10-11 2022-12-30 合肥工业大学 Multi-modal emotion classification method based on cooperative attention and application thereof
CN115565071A (en) * 2022-10-26 2023-01-03 深圳大学 Hyperspectral image transform network training and classifying method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114973044A (en) * 2021-02-22 2022-08-30 上海大学 Video emotion analysis method for enhancing multi-head attention based on bimodal information
CN113312530A (en) * 2021-06-09 2021-08-27 哈尔滨工业大学 Multi-mode emotion classification method taking text as core
CN113420807A (en) * 2021-06-22 2021-09-21 哈尔滨理工大学 Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
CN113806609A (en) * 2021-09-26 2021-12-17 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN114091466A (en) * 2021-10-13 2022-02-25 山东师范大学 Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN113971837A (en) * 2021-10-27 2022-01-25 厦门大学 Knowledge-based multi-modal feature fusion dynamic graph neural sign language translation method
CN114418069A (en) * 2022-01-19 2022-04-29 腾讯科技(深圳)有限公司 Method and device for training encoder and storage medium
CN115063709A (en) * 2022-04-14 2022-09-16 齐鲁工业大学 Multi-modal emotion analysis method and system based on cross-modal attention and hierarchical fusion
CN114973062A (en) * 2022-04-25 2022-08-30 西安电子科技大学 Multi-modal emotion analysis method based on Transformer
CN115019405A (en) * 2022-05-27 2022-09-06 中国科学院计算技术研究所 Multi-modal fusion-based tumor classification method and system
CN115544279A (en) * 2022-10-11 2022-12-30 合肥工业大学 Multi-modal emotion classification method based on cooperative attention and application thereof
CN115565071A (en) * 2022-10-26 2023-01-03 深圳大学 Hyperspectral image transform network training and classifying method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JIANDIAN ZENG等: "Robust Multimodal Sentiment Analysis via Tag Encoding of Uncertain Missing Modalities", IEEE TRANSACTIONS ON MULTIMEDIA, pages 1 - 14 *
WEI LUO等: "Multimodal Reconstruct and Align Net for Missing Modality Problem in Sentiment Analysis", LECTURE NOTES IN COMPUTER SCIENCE. 13834, pages 411 - 422 *
徐志京等: "基于Transformer-ESIM注意力机制的多模态情绪识别", 计算机工程与应用, vol. 58, no. 10, pages 132 - 138 *

Also Published As

Publication number Publication date
CN115983280B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN111488739B (en) Implicit chapter relation identification method for generating image enhancement representation based on multiple granularities
Yuan et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis
CN113420807A (en) Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
CN111523534B (en) Image description method
CN112348111B (en) Multi-modal feature fusion method and device in video, electronic equipment and medium
Sheng et al. Deep learning for visual speech analysis: A survey
CN114417097A (en) Emotion prediction method and system based on time convolution and self-attention
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
Ray et al. A multimodal corpus for emotion recognition in sarcasm
CN114091466A (en) Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN113392265A (en) Multimedia processing method, device and equipment
Xu et al. A comprehensive survey of automated audio captioning
Zeng et al. Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities
CN114817564A (en) Attribute extraction method and device and storage medium
CN115374281B (en) Session emotion analysis method based on multi-granularity fusion and graph convolution network
CN116956920A (en) Multi-mode named entity identification method for multi-task collaborative characterization
Wang et al. MT-TCCT: Multi-task learning for multimodal emotion recognition
CN115983280B (en) Multi-mode emotion analysis method and system for uncertain mode deletion
Ai et al. A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning
CN115809438A (en) Multi-modal emotion analysis method, system, device and storage medium
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
Zou et al. Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation
Nguyen et al. Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation
CN117150320B (en) Dialog digital human emotion style similarity evaluation method and system
Cai et al. Multimodal emotion recognition based on long-distance modeling and multi-source data fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant