CN115983280A

CN115983280A - Multi-modal emotion analysis method and system for uncertain modal loss

Info

Publication number: CN115983280A
Application number: CN202310081044.0A
Authority: CN
Inventors: 刘志中; 周斌; 初佃辉; 孟令强; 孙宇航
Original assignee: Yantai University
Current assignee: Yantai University
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-04-18
Anticipated expiration: 2043-01-31
Also published as: CN115983280B

Abstract

The invention provides a multi-modal emotion analysis method and system for uncertain modal loss, and relates to the technical field of data processing, wherein the specific scheme comprises the following steps: multi-modal data with uncertain missing is acquired, including three modalities: text, visual and audio; processing the three kinds of modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification; the invention is based on the modal translation module, and translates the visual and audio modalities into the text modality, thereby improving the quality of the visual and audio modalities and capturing the deep interaction among different modalities; the complete mode is pre-trained to obtain the combined characteristic of the complete mode to guide the approach of the combined characteristic of the missing mode to the combined characteristic of the complete mode, which mode is missing is not needed to be considered, only the approach to the combined characteristic vector of the complete mode is needed, and the method has stronger universality.

Description

Multi-modal emotion analysis method and system for uncertain modal loss

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a multi-modal emotion analysis method and system for uncertain modal loss.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

At present, under the environment of informatization and intelligent rapid development, multi-modal emotion analysis (MSA) plays a great role in the fields of natural human-computer interaction, personalized advertisement, opinion mining, decision making and the like; the technology is dedicated to recognizing human emotions in different ways such as text, voice or facial expressions; in recent years, with the rapid development of mobile internet and intelligent terminals, more and more users express their viewpoints and feelings through social platforms such as twitter and microblog, and a large amount of social data is generated; data on social platforms has evolved from single text forms to multi-modal data, such as text, audio, and images.

Compared with single-mode data, the multi-mode data contains richer information, and is more beneficial to identifying the real emotion of the user; currently, the multi-modal emotion analysis attracts wide attention, becomes a research hotspot in the field of artificial intelligence, and a plurality of effective multi-modal emotion analysis models appear; for example: the method comprises the following steps of (1) a multi-modal emotion analysis model based on a circulating neural network, a multi-modal emotion analysis model based on a Transformer and a multi-modal emotion analysis model based on a convolutional neural network; the existing multi-mode emotion analysis model can better identify the emotion of a user, and the development of the field of multi-mode emotion analysis is promoted.

However, existing MSA models, when performing sentiment analysis, are proposed under the assumption that all modalities (text, vision and audio) are available; however, in real-world scenarios, the absence of uncertain modalities always occurs due to some uncontrollable factors; for example, as shown in FIG. 1, visual content is not available due to camera shut-off or occlusion; voice content is not available due to user silence; voice and text loss due to monitoring device errors; or the face cannot be detected due to illumination or occlusion problems; thus, the assumption that all modalities are available at any time is not always true in real-world scenarios; when the mode is randomly absent, most of the existing multi-mode emotion analysis models may fail; therefore, how to deal with the missing modalities in the multi-modal emotion analysis is becoming a new challenge.

Currently, some research work has focused on the problem of modal loss; han et al propose a joint training method, this method has implicitly merged the multimode information from auxiliary mode, thus has improved the single mode affective analysis performance; the method proposed by Srinivas and the like researches the problem of audio-visual modal loss in automatic audio-visual expression recognition, researches the performance of a Transformer in the absence of one mode, and carries out ablation research to evaluate a model, and the result proves that the work has good universality in the absence of the mode; to solve the problem of pattern missing in object recognition, tran et al estimate the missing data by using the correlation between different patterns.

Zhao et al propose a Missing Mode Imagination Network (MMIN) to handle the problem of uncertain mode missing, the MMIN learning a robust joint multi-modal representation that can predict the representation of any missing mode given the available modes under different missing mode conditions; zeng et al have proposed a Tag Assisted Transform Encoder (TATE) network to handle the problem of missing uncertain modes, where the TATE includes a tag encoding module that can cover both single-mode and multi-mode missing situations.

The research obtains good effect, and the research of emotion analysis is promoted under the condition of lacking of a specific mode; although good results were obtained, the following disadvantages remained: firstly, the existing works only carry out splicing operation when realizing feature fusion, and can not capture the interaction among different modal features; secondly, the existing work does not consider the advantages of text modes in MSA, so that the effect of multi-mode emotion analysis is influenced; in addition, the existing work needs to consider the missing situations of different modalities, then process the different missing situations, and then perform emotion analysis, so that the complexity of the model is increased.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-modal emotion analysis method and system facing to uncertain modal deletion, wherein visual and audio modals are translated into text modals based on a modal translation module, and the combined features of the complete modals are obtained by pre-training the complete modals to guide the combined features of the missing modals to approach to the combined features of the complete modals, which modal deletion is not required to be considered, only the combined features of the complete modals need to approach, and the method and system have stronger universality.

To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the invention provides a multi-modal emotion analysis method for uncertain modal loss in a first aspect;

the multi-modal emotion analysis method for uncertain modal loss comprises the following steps:

multi-modal data with uncertain missing is acquired, including three modalities: text, visual and audio;

processing the three modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification;

the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three modal data based on a multi-head self-attention mechanism, and a Transformer decoder is used for monitoring a Transformer encoder to enable the visual characteristics and the audio characteristics to approach text characteristics, so that the visual characteristics and the audio characteristics are translated into text characteristics.

Further, the extracting of the single-mode features of the three-mode data based on the multi-head self-attention mechanism specifically includes:

extracting context features of each modality by using a Transformer encoder;

residual error connection is carried out on the extracted context characteristics, and normalization is carried out;

and performing linear transformation on the normalized context characteristics to obtain the single-mode characteristics of the three-mode data.

Further, the monitoring the transform encoder by using the transform decoder specifically includes:

constructing a Transformer decoder by taking text characteristics as a quick of a multi-head attention mechanism and modal characteristics to be translated as Key and Value of the multi-head attention mechanism;

and supervising the Transformer encoder to translate the modal characteristics to be translated to the text characteristics according to the translation loss between the modal characteristics to be translated and the text characteristics output by the Transformer decoder.

Further, the multi-modal emotion analysis network further comprises a public space projection module;

the public space projection module is used for carrying out linear transformation on the three modal characteristics after the modal translation to obtain the autocorrelation public space of each modal and fusing the autocorrelation public space into the joint characteristics of the missing modal.

Further, the multi-modal emotion analysis network further comprises a pre-training module and a Transformer encoder module;

the pre-training module is used for pre-training the multi-modal emotion analysis network by using all complete modal data;

and the Transformer encoder module guides the approach of the combined features of the missing modes to the combined features of the complete modes under the supervision of a pre-trained multi-mode emotion analysis network, and encodes the combined features of the missing modes so as to generate the combined features of the complete modes.

Further, the multi-modal emotion analysis network further comprises a Transformer decoder module;

and taking the output of the Transformer encoder module as the input of the Transformer decoder module, and guiding the Transformer encoder to learn the long-term dependence relationship among different modes through the decoder loss.

Further, the overall training target of the multi-modal emotion analysis network is obtained by weighted summation of classification loss, pre-training loss, transform coder-decoder loss and modal translation loss.

The invention provides a multi-modal emotion analysis system facing uncertain modal loss in a second aspect.

A multi-modal emotion analysis system for uncertain modal loss comprises:

a data acquisition unit configured to: multi-modal data with uncertain missing is acquired, including three modalities: text, visual and audio;

an emotion analysis unit configured to: processing the three kinds of modal data through the trained multi-modal emotion analysis network to generate and output a final emotion classification;

A third aspect of the present invention provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the steps in the method for multimodal emotion analysis oriented to uncertain modal absence according to the first aspect of the present invention.

A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for multimodal emotion analysis oriented to uncertain mode absence according to the first aspect of the present invention when executing the program.

The above one or more technical solutions have the following beneficial effects:

in order to solve the problems that the advantages of text modalities in emotion analysis models are not considered in the existing work and deep interaction among different modalities is not considered in feature fusion, the invention provides a modality translation module which translates visual and audio modalities into text modalities, improves the quality of the visual and audio modalities and can capture the deep interaction among the different modalities.

In order to solve the problem of uncertain mode deletion, the invention guides the approach of the combined feature of the deletion mode to the combined feature of the complete mode by pre-training the complete mode to obtain the combined feature of the complete mode, does not need to consider which mode is deleted, only needs to approach to the combined feature vector of the complete mode, and has stronger universality.

Experiments were conducted on two common multimodal datasets CMU-MOSI and IEMOCAP, and the experimental results show that compared with the good results of several benchmark tests on the two datasets, the model provided by the invention achieves remarkable improvement and verifies the effectiveness of the proposed model.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is an exemplary diagram of missing modalities in multimodal sentiment analysis.

Fig. 2 is a flow chart of the method of the first embodiment.

FIG. 3 is a diagram of a multi-modal emotion analysis network structure according to the first embodiment.

Fig. 4 is a configuration diagram of a modality translation module according to the first embodiment.

Detailed Description

The invention is further described with reference to the following figures and examples.

Example one

Aiming at the problem of uncertain mode deletion in multi-modal emotion analysis, the embodiment discloses a multi-modal emotion analysis method for uncertain mode deletion;

as shown in fig. 2, the multi-modal emotion analysis method for uncertain modal loss includes:

step S1: multi-modal data with uncertain absence is acquired, comprising three modalities: text, visual, and audio.

Given multi-modal data to be analyzed P = [ X = [) _v ,X _a ,X _t ]Where v, a and t represent visual, audio and text, respectively, X _v 、X _a X _t Respectively representing visual modal data, audio modal data and text modal data; without loss of generality, the present embodiment uses

To represent a missing modality, where M e { v, a, t }, e.g., given that no visual modality exists, multi-modal data can be represented as £ v £, t @, at which time multi-modal data can be represented>

Step S2: and processing the three modal data through the trained multi-modal emotion analysis network to generate and output final emotion classification.

The embodiment discloses a multi-modal emotion analysis network based on a Transformer, the structure of which is shown in fig. 3 and comprises a modal translation module, a public space projection module, a pre-training module, a Transformer encoder-decoder module and a classification module, and based on the above modules, the specific process of processing multi-modal data by using the trained multi-modal emotion analysis network is as follows:

step S201: and inputting the visual mode and the audio mode into a mode translation module, translating into a text mode, and coding the text mode by using a Transformer coder to obtain the single-mode characteristics of the data of the three modes.

Firstly, briefly introducing some key concepts in a Transformer; given an input X, querys is defined as Q = XW _Q Keys is K = XW _K Value is V = XW _V Wherein W is _Q ∈R ^d×d 、W _K ∈R ^d×d And W _V ∈P ^d×d Is a weight matrix; the attention mechanism in the Transformer is calculated as shown in formula (1):

wherein Softmax is a normalized exponential function, T is the transpose of the matrix, d _k Is the dimension of the matrix K.

Since the multi-attention mechanism has multiple attention heads simultaneously and can capture information of different subspaces, in order to learn expression of multiple semantics in multiple modalities, the present embodiment chooses to use the multi-attention mechanism to extract features in different semantic spaces of each modality, and the multi-attention mechanism is formalized as shown in equation (2):

E _M ＝MultiHead(Q,K,V)＝Concat(head ₁ ,head ₂ ,...,head _h )W _O (2)

wherein, W _O ∈P ^d×d Is a weight matrix, h is the number of attention heads; ith head _i Is expressed as shown in equation (3):

wherein the content of the first and second substances,

and &>

Is the weight matrix of the ith Query, key and Value.

The existing research proves that in the multi-mode emotion analysis, the emotion analysis accuracy rate based on the text mode is about 70% -80%, and the emotion analysis accuracy rate based on the video or audio is about 60% -70%; based on the research results, in order to improve the effect of the multi-modal emotion analysis, in the embodiment, the visual modality and the audio modality are translated into the text modality by using the modality translation module based on the Transformer, and the quality of the multi-modal features is improved by making the visual modality and the audio modality approach the text modality.

Before modal translation, three kinds of modal data are firstly translated

Dimension conversion is carried out by using a full connection layer, and the converted data dimension is->

Where l (-) and d (-) denote the sequence length and data dimension, respectively.

The modality translation module is shown in fig. 4, and a specific calculation process of the modality translation module is described by taking the visual modality translation as an example.

(1) And extracting the single-mode characteristics of the three-mode data based on a multi-head self-attention mechanism.

Firstly, for three single-mode data, extracting context features of each mode by using a Transformer encoder, wherein the specific calculation process is shown in formulas (4) to (6):

E _am ＝MultiHead(X _a ,X _a ,X _a )(5)

E _tm ＝MultiHead(X _t ,X _t ,X _t )(6)

wherein the content of the first and second substances,

X _a and X _t Respectively representing a missing visual modality, audio modality and text modality; since the multi-head self-attention mechanism is used in the transform encoder, Q, K, and V in the attention mechanism formula at this time are the same.

Secondly, residual error connection is carried out on the extracted context characteristics of each mode, and the extracted context characteristics are input into a layerorm layer for normalization, wherein the process is shown in formulas (7) to (9);

E' _a ＝Layernorm(X _a +E _am )(8)

E _t '＝Layernorm(X _t +E _tm )(9)

wherein Layernorm represents normalization,

X _a and X _t Respectively representing the visual, audio and text modalities of the defect, E _vm 、E _am 、E _tm Representing the contextual characteristics of a visual modality, an audio modality, and a text modality, respectively.

And then, inputting the normalized context characteristics into a feedforward full-link layer for linear transformation, thereby completing the encoding of three single-mode data to obtain the single-mode characteristics of the three-mode data, wherein the process is shown in formulas (10) to (12).

Wherein, W _vl 、W _al And W _tl Is a weight matrix, b _vl 、b _al And b _tl Indicating the offset, reLU is the activation function.

(2) In the extraction of the single-mode characteristics of the three-mode data, a Transformer decoder is used for supervising a Transformer encoder, so that the visual characteristics and the audio characteristics are approximate to the text characteristics, and the visual characteristics and the audio characteristics are translated into the text characteristics

After the single-mode characteristics of the visual mode and the single-mode characteristics of the text mode are obtained respectively, a Transformer decoder is used for monitoring a Transformer encoder, so that the visual characteristics E generated by the encoder _v Or audio features E _a Approximating text feature E _t I.e. direct the encoder to translate visual or audio features into text features.

The specific operation is as follows:

firstly, the single-mode features of the visual modality (or the single-mode features of the audio modality) and the single-mode features of the text modality are used as the input of the transform decoder.

Then, decoding the single-modal feature of the text modality as Query of the multi-head attention mechanism, and decoding the single-modal feature of the visual modality (or the single-modal feature of the audio modality) as Key and Value of the multi-head attention mechanism, wherein the specific calculation process is as shown in formulas (13) and (14):

D _vtm ＝MultiHead(E _t ,E _v ,E _v ) (13)

D _atm ＝MultiHead(E _t ,E _a ,E _a ) (14)

then, D is put _vtm And D _atm Respectively performing residual error connection operation and inputting the residual error connection operation into a layerorm layer for normalization; then, the normalized result is input into a feedforward full-concatenation layer to be linearly transformed to complete the transform decoder module, and the specific operation process is shown in equations (15) - (18):

D _v ' _t ＝Layernorm(E _t +D _vtm ) (15)

D' _at ＝Layernorm(E _t +D _atm ) (16)

wherein, W _vtl And W _atl Is a weight matrix, b _vtl And b _atl Is a learnable bias, reLU is an activation function.

Finally, the output E to the visual encoder _v And the output D of the decoder _vt Computing modal translation loss (Λ) _VtoT ) The supervised encoder translates visual modal features to textual modal features by minimizing losses.

Step S202: and inputting the single-mode features of the three-mode data into a common space projection module, and fusing the single-mode features into a joint feature (MJF) of the missing mode.

After obtaining the single-mode features of the three-mode data, the common space projection module performs linear transformation on the three-mode features to obtain the autocorrelation common space of each mode, and then splices the autocorrelation common space into the joint features of the missing modes, wherein the common space projection module is shown in formulas (19) to (21):

C _v ＝[W _va E _v ||W _vt E _v ] (19)

C _a ＝[W _va E _a ||W _ta E _a ] (20)

C _t ＝[W _vt E _t ||W _ta E _t ] (21)

wherein, W _va ，W _vt And W _ta All are weight matrices, | | represents a splicing operation, E _v 、E _a 、E _t Representing the monomodal features of visual, audio and text, respectively.

Since the multi-modal features are randomly missing at this time, the joint feature which is obtained by splicing all the common vectors and is modal missing is represented as C _all The formula is shown as (22):

C _all ＝[C _v ||C _a ||C _t ] (22)

the benefits of such a treatment are the following: firstly, a weight matrix is trained by two modalities together, and interaction information between the two modalities is reserved in the weight matrix; secondly, when the missing modal characteristics approach the complete modal characteristics, only the integral joint characteristics can be concerned; thus, no matter which modality is missing, the approach is only to the complete combined modality feature.

Step S203: and (3) coding the missing joint features by using a Transformer coder-decoder module to obtain joint features approaching to the complete mode, and supervising the coding of the MJF by using a pre-training model in the Transformer coder-decoder module so as to enable the MJF to approach to the joint features of the complete mode.

In particular, the combined feature C of modal loss _all As input of the Transformer encoder, obtaining output E after encoding _out As shown in equations (23) to (25):

E _allm ＝MultiHead(C _all ,C _all ,C _all ) (23)

E _o ' _ut ＝Layernorm(C _all +E _allm ) (24)

wherein, the input Query, key and Value are the same and are all C _all ；

And &>

Is a weight matrix, is->

And &>

Are two learnable offsets, E _out Is the combined characteristic of the missing modes after linear transformation, namely the combined characteristic approaching to the complete mode.

(1) During the course of encoding MJF by the Transformer encoder, a pre-trained model is used to guide the encoding of MJF by the Transformer encoder, thereby approximating MJF to the joint features of the full modality.

The structure of the pre-training model is a multi-modal emotion analysis network with a pre-training module removed, and the pre-training model is trained by using complete modal data; output E of the Pre-training Module _pre And the transform encoder output E _out The calculation modes of the method are the same and are obtained by splicing modal translation and public space projection.

(2) In order to efficiently model the long-term dependency of information between modalities, the dependency information between joint features is captured using a Transformer encoder-decoder, the output E of the encoder is used _out As input to the decoder, the output D of the decoder _out The expression is shown in equations (26) to (28):

D _outm ＝MultiHead(E _out ,E _out ,E _out ) (26)

D _out ＝Layernorm(E _out +D _outm ) (27)

wherein the content of the first and second substances,

and &>

Is a parameter matrix, is->

And &>

Are two learnable biases, relu being the activation function.

Finally, the output E to the transform encoder _out And the output D of the transform decoder _out Compute Transformer encoder-decoder losses.

Step S204: joint feature E to approximate a complete modality _out And inputting the emotion data into a classification module to generate and output a final emotion classification.

In the classification module, the output E of the Transformer encoder is processed _out Inputting the data into a fully-connected network with a softmax activation function to obtain a prediction score

Based on the prediction score, an emotion classification is determined.

Overall training target for multimodal emotion analysis network is lost by classification (Λ) _cls ) Pre-training loss (Λ) _pretrain ) Transformer encoder-decoder loss (Λ) _de ) And modal translation loss (Λ) _AtoT ,Λ _VtoT ) The weighted sum is obtained, and the formula is shown as (29):

Λ＝Λ _cls +λ ₁ Λ _pretrain +λ ₂ Λ _de +λ ₃ Λ _VtoT +λ ₄ Λ _AtoT (29)

wherein λ is ₁ ，λ ₂ ，λ ₃ And λ ₄ Is the corresponding weight.

(1) Loss of pre-training (Λ) _pretrain )

The loss of pre-training is through the pre-training output (E) _pre ) And the transform encoder output (E) _out ) Calculated from the difference between them, the present embodiment employs Kullback Leibler (KL) divergence loss to guide the reconstruction of the missing mode, and the formula of KL divergence is shown in (30):

wherein p and q are two probability distributions, and since the KL divergence is asymmetric, jensen-Shannon (JS) divergence loss is adopted to replace the KL divergence in this embodiment, and the formula of JS is shown as (31):

Λ _pretrain ＝JS(E _out ||E _pre )

＝D _KL (E _out ||E _pre )+D _KL (E _pre ||E _out ) (31)

(2) Transformer encoder-decoder loss (Λ) _de )

Similar to the pre-training penalty, the decoder penalty is calculated by computing the output of the transform decoder (D) _out ) And joint characterization (C) _all ) The JS divergence between them is obtained, and the calculation process is shown in formula (32):

Λ _de ＝JS(D _out ||C _all )

＝D _KL (D _out ||C _all )+D _KL (C _all ||D _out ) (32)

(3) Loss of modal translation (Λ) _AtoT And Λ _VtoT )

For the translation task, only the visual and audio modalities are translated to the text modality, so only the output of the audio and visual decoders in the modality translation module (D) is calculated _at And D _vt ) And the text encoder representation in the modality translation Module (E) _t ) JS divergence loss in between, as shown in (33) (34):

Λ _AtoT ＝JS(D _at ||E _t )

＝D _KL (D _at ||E _t )+D _KL (E _t ||D _at ) (33)

Λ _VtoT ＝JS(D _vt ||E _t )

＝D _KL (D _vt ||E _t )+D _KL (E _t ||D _vt ) (34)

(4) Loss of classification (Λ) _cls )

For the final classification module, the output E of the encoder is compared _out Inputting a full-connection network with a softmax activation function to obtain a prediction score

The formula is shown as (35):

wherein, W _c And b _c Are learnable weights and biases, applying a standard cross entropy penalty to the classification task, the cross entropy penalty formula is shown as (36):

wherein N is the number of samples, y _n Is the true sentiment classification for the nth sample,

is the predictive emotion classification.

Experiments were conducted on the common data sets CMU-MOSI and IEMOCAP to verify that the model presented in this example achieved significant improvements over several baseline models.

All experiments were performed on a Windows 10 system with Intel (R) Core (TM) i9-10900K CPU, nvidia 3090GPU and 96G RAM; for the CMU-MOSI and IEMOCAP datasets, the model parameter size was 90.7M, the average run time per epoch was 29 seconds and 1 minute, and the datasets and experimental settings were described as follows:

data set: the present embodiment performed experiments on CMU-MOSI and IEMOCAP datasets, which are both multi-modal baseline datasets for emotion analysis, including visual, textual, and audio modalities; for the CMU-MOSI dataset, it contains 2199 fragments from 93 opinion videos on YouTube. The label of each sample is annotated with the sentiment score in [ -3,3 ]. The present embodiment converts the scores into negative, neutral, and positive labels; for an IEMOCAP data set, it contains 5 sessions, each session containing approximately 30 videos, where each video contains at least 24 utterances; the annotation tag is: neutral, depressed, angry, sad, happy, excited, surprised, fear, disappointed, and others; specifically, the present embodiment reports three classifications (negative: [ -3, 0), neutral: [0] and positive: (0, 3 ]) results, two categories (negative: [ depressed, angry, sad, fear, disappointed ], positive: [ happy, excited ]) results were reported on the IEMOCAP dataset.

Parameters are as follows: the present embodiment sets the learning rate to 0.001, the batch size to 32, and the hidden layer size to 300; an Adam optimizer was used to minimize the total loss, the epoch number was set to 20, the loss weight was set to 0.1, and the parameter summary is shown in table 1.

Table 1: detailed parameter settings in all experiments

Evaluation indexes are as follows: the performance of the model was measured using Accuracy and Macor-F1, which are given by the equations (37) - (38):

wherein, N _true Is the number of samples correctly predicted, N is the total number of samples, C is the number of classifications, P is the positive prediction value, and R is the recall value.

Multi-modal data pre-processing

Visual representation: CMU-MOSI and IEMOCAP data sets are mainly composed of human dialogues, where visual features are mainly composed of human faces; extracting facial features through an OpenFace2.0 toolkit; finally obtaining 709-dimensional visual feature representation; including facial, head and eye movements.

Text representation: for each text utterance, extracting text features using a pre-trained Bert model; finally, a pre-trained BERT model (12-layer, 768-dimensional hidden layers, 12 heads) was used to obtain 768-dimensional word vectors.

Audio representation: acoustic features are extracted by Librosa; for the CMU-MOSI and IEMOCAP datasets, each audio was mixed to mono and resampled to 16000Hz; furthermore, each frame is separated by 512 samples, and zero-crossing rate, mel Frequency Cepstral Coefficient (MFCC) and Constant Q Transform (CQT) features are selected to represent the audio segment; finally, the three features are stitched together to produce a 33-dimensional acoustic feature.

Baseline model

The baseline model results selected for this example were based on work by Zeng et al, and the baseline model selected was as follows:

AE: an efficient data encoding network is trained to replicate its input to its output.

CRA: a missing modality reconstruction framework employs a residual join mechanism to approximate differences between input data.

And MCTN: a method of learning a robust joint representation by transitioning between modalities, the transition from a source modality to a target modality may capture joint information between modalities.

TransM: a multimodal fusion method based on end-to-end translation utilizes a Transformer to convert between a modality and coded multimodal characteristics.

MMIN: a unified multi-modal recognition model employs cascaded residual autoencoders and cyclic consistency learning to recover missing modalities.

TATE: the tag assisted transformer encoder network employing tag encoding techniques covers all uncertain cases and supervises joint representation learning.

Results of the experiment

In the experimental results module, the present embodiment reports the three-classification results on CMU-MOSI and the two-classification results on IEMOCAP, and the experimental results are shown in tables 2 and 3; overall, the overall results show a downward trend as the deletion rate increases.

TABLE 2 Experimental results in the case of monomodal deletion

TABLE 3 Experimental results in the absence of multimodal depletion

For the single-mode deletion condition, the experimental result is shown in table 2, the deletion rate is set to be 0-0.5, under the complete mode condition, the M-F1 value on the CMU-MOSI data set is 2.29% lower than that of the MMIN model, and the ACC value on the CMU-MOSI data set is 0.01% lower than that of the TATE model; under the condition of lacking a single mode, when the lacking rate is 0.1, the M-F1 value on the CMU-MOSI data set is 0.78 percent lower than that of a TATE model; in addition, the method provided by the embodiment achieves the best result on other settings, and the validity of the model of the embodiment is verified.

For the multimodal deletion case, the relevant results are shown in table 3; the experimental result shows that, under the condition of random loss of multiple modes, when the deletion rate is 0.4, the ACC value of the model proposed by the embodiment is 0.52% lower than that of the TATE model on the CMU-MOSI data set; on the IEMOCAP data set, the proposed model of this embodiment is improved by about 0.21% to 5.21% on M-F1 and by about 0.75% to 4.05% on ACC compared to other baseline models, demonstrating the robustness of the proposed model of this embodiment.

Ablation study

To explore the effect of different modules in TMTN, the model proposed in this example was evaluated in the absence of a single modality, set as follows: 1) Only one modality is used; 2) Two modalities are used; 3) Removing the modality translation and then using the single modality; 4) Removing modal translation and then using bimodulus; 5) Removing the modal translation module; 6) Removing the public space projection module; 7) The pre-training module is removed.

Table 4: comparison of all modules in TMTN

According to the table 4, when the text mode is absent, the performance is rapidly reduced, and the fact that the text information is dominant in the multi-mode emotion analysis is verified; however, no similar reduction was observed when visual modalities were removed; the possible reasons are: visual information is not well extracted due to small changes in the face; in addition, the upper part of the table shows the effect of modal features after modal translation, and it can be found that the combination of multiple modalities provides better performance than a single modality, indicating that complementary features can be learned between multiple modalities; to verify the validity of the modality translation method, the present embodiment performed an experiment as shown in the middle part of table 4; it can be seen that the untranslated visual and audio modalities do not perform as well as the translated visual and audio modalities; in addition, the present embodiment also investigated the combination of untranslated visual and audio modalities with text modalities, again confirming that a gap still exists between the combination of untranslated and translated modalities.

For the influence of different modules, after removing the modal translation module, the performance of the model proposed in this embodiment on M-F1 is reduced by about 1.28% to 5.57% and on ACC by about 1.04% to 8.33% relative to the whole model; after removal of the pre-training module, the performance of the model on M-F1 dropped by about 2.41% to 8.31% and on ACC index dropped by about 2.6% to 11.98%. When the public space projection module is removed, in order to ensure the normal operation of the model, the operation of the public space projection module is replaced by the direct splicing operation, and the performance of the model is reduced by about 1.43% to 6.85% on M-F1 and about 2.08% to 9.37% on ACC compared with the TMTN model when the modality is complete.

The ablation experiment proves the effectiveness of the modal translation module; as the embodiment trains the pre-training network by using a complete mode, the pre-training module can play a good supervision role; from the table, a significant result can be obtained, that is, the modality translation operation and the pre-training operation in the model of the present embodiment can still enable the model to maintain strong stability and good effect in the absence of the modality.

Multi-classification under IEMOCAP dataset

The present embodiment also explores the performance of multiple classifications on the IEMOCAP dataset; selecting happy, angry, sad and neutral emotion labels to perform four-classification experiments besides the two-classification results; selecting happy, angry, sad, neutral, depressed, excited and surprised emotion labels to carry out seven classification experiments; the detailed distribution and results are shown in tables 5 and 6, respectively; it can be seen that both the M-F1 value and ACC decrease with increasing number of categories; furthermore, a careful examination of table 6 reveals that when the classification number is 7, the overall performance drops sharply, which may be due to confusion of multiple classes, leading to difficulty in model convergence; in addition, compared with the TATE model, the result shows that the method of the embodiment obtains better results in class 2, class 4 or class 7; it is noted that the model of this example performs 10% higher on average in category 4 than the TATE model; furthermore, in classification 7, the effect of the model proposed in the present embodiment decreases more slowly as the deletion rate increases.

Table 5: detailed distribution of IEMOCAP datasets

Table 6: IEMOCAP dataset multi-classification results

/>

The embodiment provides a multimodal emotion analysis network based on a Transformer to solve the problem of losing uncertain modes; the model adopts a modal translation method to translate visual and audio modes into text modes so as to improve the quality of the visual and audio modes; in addition, the learning of the missing modal characteristics is supervised by a pre-trained module, so that the missing modal characteristics are approximated to the complete combined characteristics to reconstruct the missing modal characteristics; the validity of the proposed model in the absence modality case is evident from the experiments on the CMU-MOSI and IEMOCAP data sets.

Example two

The embodiment discloses a multi-modal emotion analysis system for uncertain modal loss;

the multimode emotion analysis system for uncertain modal loss comprises:

a data acquisition unit configured to: multi-modal data with uncertain absence is acquired, comprising three modalities: text, visual and audio;

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer readable storage medium, on which a computer program is stored, where the program, when executed by a processor, implements the steps in the method for multimodal emotion analysis towards uncertain modality missing as in the first embodiment of the present disclosure.

Example four

An object of the present embodiment is to provide an electronic device.

The electronic device comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the method for multimodal emotion analysis facing uncertain mode deletion according to the first embodiment of the disclosure.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multimode emotion analysis method for uncertain modal loss is characterized by comprising the following steps:

multi-modal data with uncertain absence is acquired, comprising three modalities: text, visual and audio;

the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three types of modal data based on a multi-head self-attention mechanism, and a Transformer decoder is used for monitoring a Transformer encoder to enable the visual characteristics and the audio characteristics to approach text characteristics, so that the visual characteristics and the audio characteristics are translated into the text characteristics.

2. The multi-modal sentiment analysis method facing uncertain modal absence according to claim 1, wherein the single-modal features of three modal data extracted based on the multi-head self-attention mechanism are specifically:

extracting context features of each modality by using a Transformer encoder;

3. The method for multimodal emotion analysis oriented to uncertain modal loss according to claim 1, wherein the transform encoder is supervised by using a transform decoder, specifically:

and supervising the Transformer encoder to translate the modal characteristics to be translated into the text characteristics according to the translation loss between the modal characteristics to be translated and the text characteristics output by the Transformer decoder.

4. The multimodal emotion analysis method oriented to uncertain modal loss as recited in claim 1, wherein the multimodal emotion analysis network further comprises a common space projection module;

5. The multimodal emotion analysis method oriented to uncertain modal loss of claim 1, wherein the multimodal emotion analysis network further comprises a pre-training module and a Transformer encoder module;

6. The multimodal emotion analysis method oriented to uncertain modal loss of claim 5, wherein the multimodal emotion analysis network further comprises a transform decoder module;

7. The method of multimodal emotion analysis oriented to uncertain modal loss as recited in claim 1, wherein the overall training objective of the multimodal emotion analysis network is obtained by weighted summation of classification loss, pre-training loss, transformer coder-decoder loss, modal translation loss.

8. The multimode emotion analysis system for uncertain modal loss is characterized by comprising the following components:

9. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of claims 1-7.

10. A storage medium storing non-transitory computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform the instructions of the method of any one of claims 1-7.