CN115983280B

CN115983280B - Multi-mode emotion analysis method and system for uncertain mode deletion

Info

Publication number: CN115983280B
Application number: CN202310081044.0A
Authority: CN
Inventors: 刘志中; 周斌; 初佃辉; 孟令强; 孙宇航
Original assignee: Yantai University
Current assignee: Yantai University
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-08-15
Anticipated expiration: 2043-01-31
Also published as: CN115983280A

Abstract

The invention provides a multi-mode emotion analysis method and a multi-mode emotion analysis system for uncertain mode deletion, which relate to the technical field of data processing, and specifically comprise the following steps: acquiring multi-modal data with indeterminate deletions, including three modalities: text, visual, and audio; processing the three modal data through the trained multi-modal emotion analysis network to generate and output final emotion classification; the invention translates the visual and audio modes into text modes based on the mode translation module, improves the quality of the visual and audio modes and can capture the deep interaction between different modes; the method has the advantages that the complete mode is pre-trained to obtain the combined characteristic of the complete mode to guide the combined characteristic approximation of the combined characteristic of the missing mode to the complete mode, which mode is missing is not required to be considered, and only the combined characteristic vector of the complete mode is required to be approximated, so that the method has stronger universality.

Description

Multi-mode emotion analysis method and system for uncertain mode deletion

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a multi-mode emotion analysis method and system for uncertain modal deletion.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

At present, in the environment of informatization and intelligent rapid development, multi-modal emotion analysis (MSA) plays a great role in the fields of natural human-computer interaction, personalized advertisement, opinion mining, decision making and the like; this technology is directed to identifying human emotion through different ways of text, speech, or facial expression; in recent years, with the rapid development of mobile internet and intelligent terminals, more and more users express their own views and feelings through social platforms such as twitter, microblog and the like, and a large amount of social data is generated; data on social platforms has evolved from a single text form to multimodal data such as text, audio, and images.

Compared with single-mode data, the multi-mode data contains more abundant information, and is more beneficial to identifying the true emotion of the user; currently, multi-modal emotion analysis attracts a great deal of attention, has become a research hotspot in the field of artificial intelligence, and a plurality of effective multi-modal emotion analysis models are presented; for example: a multimode emotion analysis model based on a cyclic neural network, a multimode emotion analysis model based on a transducer and a multimode emotion analysis model based on a graph convolution neural network; the existing multi-modal emotion analysis model can better identify emotion of a user, and promotes development of the multi-modal emotion analysis field.

However, existing MSA models are proposed under the assumption that all modalities (text, visual and audio) are available when emotion analysis is performed; however, in a real-world scenario, due to some uncontrollable factors, an uncertain modality loss will always occur; for example, as shown in FIG. 1, visual content is not available due to camera closing or occlusion; voice content is not available due to user silence; the voice and the characters are lost due to the error of the monitoring equipment; or the face cannot be detected due to illumination or occlusion problems; thus, the assumption that all modalities are available at any time is not always true in real-world scenarios; most existing multi-modal emotion analysis models may fail when the modalities are randomly missing; thus, how to deal with the missing modalities in multi-modal emotion analysis is becoming a new challenge.

Currently, some research efforts have focused on the problem of modal loss; han et al propose a joint training method that implicitly fuses multimodal information from auxiliary modalities, thereby improving the performance of single-modality emotion analysis; the method proposed by Srinivas et al researches the problem of missing audio-visual modes in automatic audio-visual expression recognition, researches the performance of a transducer in the absence of one mode, and carries out ablation research to evaluate a model, and the result proves that the work has good universality in the absence of the mode; to solve the problem of pattern missing in object recognition, tran et al estimate missing data by exploiting the correlation between different patterns.

Zhao et al propose a Missing Modality Imagination Network (MMIN) to address the problem of uncertain modality missing, MMIN learning a robust joint multi-modal representation that can predict the representation of any missing modality given an available modality under different missing modality conditions; zeng et al propose a tag-assisted transducer encoder (TATE) network to address the problem of missing uncertain modes, where TATE contains a tag coding module that can cover both single-mode and multi-mode missing cases.

The research has good effect, and promotes the research of emotion analysis under the condition of lacking a specific mode; while good results were achieved, the following disadvantages still remain: firstly, the existing works only perform splicing operation when realizing feature fusion, and cannot capture interaction among different modal features; secondly, the existing work does not consider the advantages of the text modes in the MSA, so that the effect of multi-mode emotion analysis is affected; in addition, the existing work needs to consider the missing situation of different modes, then process the different missing situations, and then carry out emotion analysis, so that the complexity of the model is increased.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the multi-mode emotion analysis method and the system for the uncertain mode deletion, which are based on a mode translation module, translate the visual and audio modes into text modes, and guide the joint characteristics of the missing modes to the joint characteristics of the complete modes by pre-training the complete modes to obtain the joint characteristics of the complete modes without considering which mode is missing, and only need to approach the joint characteristic vectors of the complete modes, thereby having stronger universality.

To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the first aspect of the invention provides a multi-mode emotion analysis method for uncertain mode deletion;

a multi-mode emotion analysis method for uncertain mode deletion comprises the following steps:

acquiring multi-modal data with indeterminate deletions, including three modalities: text, visual, and audio;

processing the three modal data through the trained multi-modal emotion analysis network to generate and output final emotion classification;

the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three modal data based on a multi-head self-attention mechanism, and uses a transducer decoder to monitor a transducer encoder, so that visual characteristics and audio characteristics approximate to text characteristics, and the visual characteristics and the audio characteristics are translated into text characteristics.

Further, the multi-head self-attention mechanism-based extraction of the single-mode characteristics of the three-mode data comprises the following specific steps:

extracting context features of each modality using a transducer encoder;

residual connection is carried out on the extracted context characteristics, and normalization is carried out;

and carrying out linear transformation on the normalized context characteristics to obtain single-mode characteristics of the three-mode data.

Further, the use of a transducer decoder to supervise the transducer encoder is specifically:

constructing a transducer decoder by taking text features as a quick of a multi-head attention mechanism and modal features to be translated as keys and values of the multi-head attention mechanism;

the supervising converter encoder translates the modal feature to be translated into the text feature by a loss of translation between the modal feature to be translated and the text feature output by the converter decoder.

Further, the multi-modal emotion analysis network further comprises a public space projection module;

the common space projection module is used for carrying out linear transformation on the three mode characteristics after the mode translation to obtain an autocorrelation common space of each mode, and fusing the autocorrelation common space into the joint characteristics of the missing modes.

Further, the multi-modal emotion analysis network further comprises a pre-training module and a transducer encoder module;

the pre-training module is used for pre-training the multi-modal emotion analysis network by using all complete modal data;

the transducer encoder module guides the joint characteristics of the missing mode to approach the joint characteristics of the complete mode under the supervision of a pre-trained multi-mode emotion analysis network, and encodes the joint characteristics of the missing mode, so that the joint characteristics of the complete mode are generated.

Further, the multi-modal emotion analysis network further comprises a transducer decoder module;

and taking the output of the transducer encoder module as the input of the transducer decoder module, and guiding the transducer encoder to learn the long-term dependency relationship among different modes through decoder loss.

Further, the overall training objective of the multimodal emotion analysis network is derived from a weighted sum of classification loss, pre-training loss, transducer encoder-decoder loss, and modal translation loss.

The second aspect of the invention provides a multimodal emotion analysis system oriented to uncertain modality absence.

A multimodal emotion analysis system for uncertain modality deletions comprising:

a data acquisition unit configured to: acquiring multi-modal data with indeterminate deletions, including three modalities: text, visual, and audio;

an emotion analysis unit configured to: processing the three modal data through the trained multi-modal emotion analysis network to generate and output final emotion classification;

the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal features of three modal data based on a multi-head self-attention mechanism, and uses a transducer decoder to monitor a transducer encoder, so that visual features and audio features approach text features, and the visual features and the audio features are translated into text features.

A third aspect of the present invention provides a computer readable storage medium having stored thereon a program which when executed by a processor performs the steps in a multimodal emotion analysis method for uncertainty modality deficiency according to the first aspect of the present invention.

A fourth aspect of the present invention provides an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in the multimodal emotion analysis method for indeterminate modality loss according to the first aspect of the invention when the program is executed.

The one or more of the above technical solutions have the following beneficial effects:

in order to solve the problem that the advantages of text modes in an emotion analysis model and deep interaction among different modes are not considered in feature fusion in the existing work, the invention provides a mode translation module which translates visual and audio modes into the text modes, improves the quality of the visual and audio modes and can capture the deep interaction among different modes.

In order to solve the problem of uncertain modal deletion, the method and the device obtain the joint characteristics of the complete modal by pre-training the complete modal to guide the joint characteristics of the missing modal to approach the joint characteristics of the complete modal, do not need to consider which mode is missing, only need to approach the joint characteristic vector of the complete modal, and have stronger universality.

Experiments were performed on two common multi-modal data sets, CMU-MOSI and IEMOCAP, and the experimental results showed that the proposed model was significantly improved and the effectiveness of the proposed model was verified compared to the good results of several baseline tests on both data sets.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is an exemplary diagram of missing modalities in a multimodal emotion analysis.

Fig. 2 is a flow chart of a method of the first embodiment.

Fig. 3 is a diagram showing a network structure of multi-modal emotion analysis according to the first embodiment.

Fig. 4 is a schematic diagram of a modal translating module according to the first embodiment.

Detailed Description

The invention will be further described with reference to the drawings and examples.

Example 1

Aiming at the problem of uncertain modal deletion in multi-modal emotion analysis, the embodiment discloses a multi-modal emotion analysis method for uncertain modal deletion;

as shown in fig. 2, the multi-mode emotion analysis method for uncertain modal deletion includes:

step S1: acquiring multi-modal data with indeterminate deletions, including three modalities: text, visual, and audio.

Given multimodal data to be analyzed p= [ X _v ,X _a ,X _t ]Wherein v, a and t represent visual, audio and text, respectively, X _v 、X _a X _t Respectively representing visual mode data, audio mode data and text mode data; without loss of generality, the present embodiment usesTo represent missing modalities where M ε { v, a, t }, e.g., assuming no visual modalities exist, the number of multi-modalities is thenCan be expressed as +.>

Step S2: and processing the three modal data through the trained multi-modal emotion analysis network to generate and output final emotion classification.

The embodiment discloses a multimode emotion analysis network based on a transducer, the structure of the multimode emotion analysis network is shown in fig. 3, the multimode emotion analysis network comprises a modal translation module, a public space projection module, a pre-training module, a transducer encoder-decoder module and a classification module, and based on the modules, the specific process of processing multimode data by using the trained multimode emotion analysis network is as follows:

step S201: the visual mode and the audio mode are input into a mode translation module and translated into a text mode, and the text mode is encoded by a transducer encoder to obtain single mode characteristics of three mode data.

First, some key concepts in a transducer are briefly introduced; given an input X, define query as q=xw _Q Keys is k=xw _K Value is v=xw _V Wherein W is _Q ∈R ^d×d 、W _K ∈R ^d×d And W is _V ∈P ^d×d Is a weight matrix; the calculation of the attentiveness mechanism in the transducer is shown in equation (1):

where Softmax is the normalized exponential function, T is the transpose of the matrix, d _k Is the dimension of the matrix K.

Since the multi-head attention mechanism has multiple attention heads simultaneously and can capture information of different subspaces, in order to learn multiple semantic expressions in multiple modes, the embodiment selects to use the multi-head attention mechanism to extract features in different semantic spaces of each mode, and the multi-head attention mechanism is formalized as shown in a formula (2):

E _M ＝MultiHead(Q,K,V)＝Concat(head ₁ ,head ₂ ,...,head _h )W _O (2)

wherein W is _O ∈P ^d×d Is a weight matrix, h is the number of attention heads; ith head _i The expression of (2) is as shown in formula (3):

wherein, the liquid crystal display device comprises a liquid crystal display device,and->Is the weight matrix of the ith Query, key and Value.

The prior researches prove that in multi-mode emotion analysis, the emotion analysis accuracy based on a text mode is about 70-80%, and the emotion analysis accuracy based on video or audio is about 60-70%; in order to improve the effect of multi-modal emotion analysis, the embodiment uses a transducer-based modal translation module to translate a visual mode and an audio mode into a text mode, and improves the quality of multi-modal features by making the visual mode and the audio mode close to the text mode, and the modal translation module is also applicable to scenes when the visual mode and the audio mode are absent.

Three types of modal data are firstly processed before modal translationDimension transformation is carried out by using a full connection layer, and the dimension of the transformed data is +.>Where l (·) and d (·) represent the sequence length and the data dimension, respectively.

The modal translation module is shown in fig. 4, and the specific calculation process of the modal translation module is illustrated by taking visual modal translation as an example.

(1) And extracting single-mode characteristics of the three-mode data based on a multi-head self-attention mechanism.

First, for three single-modality data, the contextual features of each modality are extracted using a transducer encoder, and the specific calculation process is as shown in equations (4) - (6):

E _am ＝MultiHead(X _a ,X _a ,X _a )(5)

E _tm ＝MultiHead(X _t ,X _t ,X _t )(6)

wherein, the liquid crystal display device comprises a liquid crystal display device, X _a and X is _t Respectively representing a missing visual mode, an audio mode and a text mode; since a multi-headed self-attention mechanism is used in the transducer encoder, Q, K, V in the attention mechanism formula at this time is the same.

Secondly, carrying out residual connection on the extracted context characteristics of each mode, and inputting the residual connection to a layerrnorm layer for normalization, wherein the process is shown in formulas (7) - (9);

E' _a ＝Layernorm(X _a +E _am )(8)

E _t '＝Layernorm(X _t +E _tm )(9)

wherein Layernorm represents normalization,X _a and X is _t Representing the missing visual, audio and text modalities, respectively, E _vm 、E _am 、E _tm The contextual characteristics of the visual modality, the audio modality, and the text modality, respectively.

And then, inputting the normalized context characteristics into a feedforward full-connection layer for linear transformation, thereby finishing the encoding of three single-mode data, namely obtaining the single-mode characteristics of the three-mode data, wherein the process is shown in formulas (10) - (12).

Wherein W is _vl 、W _al And W is _tl Is a weight matrix, b _vl 、b _al And b _tl Indicating the offset, reLU is the activation function.

(2) In extracting the unimodal features of the three-modal data, the transducer decoder is used to supervise the transducer encoder, causing the visual and audio features to approximate the text features, thereby translating the visual and audio features into text features

After obtaining the unimodal feature of the visual mode and the unimodal feature of the text mode respectively, the transducer encoder is supervised by a transducer decoder, so that the visual feature E generated by the encoder _v Or audio feature E _a Approximating text feature E _t I.e. direct the encoder to translate visual or audio features to text features.

The specific operation is as follows:

first, the unimodal feature of the visual modality (or the unimodal feature of the audio modality) and the unimodal feature of the text modality are taken as inputs to the transducer decoder.

Then, taking the unimodal feature of the text mode as a Query of the multi-head attention mechanism, and taking the unimodal feature of the visual mode (or the unimodal feature of the audio mode) as the Key and Value of the multi-head attention mechanism for decoding, wherein the concrete calculation process is as shown in a formula (13) (14):

D _vtm ＝MultiHead(E _t ,E _v ,E _v ) (13)

D _atm ＝MultiHead(E _t ,E _a ,E _a ) (14)

thereafter, D is _vtm And D _atm Respectively carrying out residual connection operation and then inputting the residual connection operation into a layerrnorm layer for normalization; next, the normalized result is input to the feedforward full-join layer for linear transformation to complete the transducer decoder module, and the specific operation is as shown in equations (15) - (18):

D _v ' _t ＝Layernorm(E _t +D _vtm ) (15)

D' _at ＝Layernorm(E _t +D _atm ) (16)

wherein W is _vtl And W is _atl Is a weight matrix, b _vtl And b _atl Is a learnable bias, and ReLU is an activation function.

Finally, the output E of the visual encoder _v And the output D of the decoder _vt Calculate modal translation loss (Λ _VtoT ) By minimizing the loss, the supervisory encoder translates the visual modality features to text modality features.

Step S202: and inputting the single-mode characteristics of the three-mode data into a public space projection module, and fusing the single-mode characteristics into the joint characteristics (MJF) of the missing modes.

After obtaining the single-mode characteristics of the three-mode data, the common space projection module carries out linear transformation on the three-mode characteristics to obtain an autocorrelation common space of each mode, and then the autocorrelation common space is spliced into the joint characteristics of the missing modes, wherein the common space projection module is shown in formulas (19) - (21):

C _v ＝[W _va E _v ||W _vt E _v ] (19)

C _a ＝[W _va E _a ||W _ta E _a ] (20)

C _t ＝[W _vt E _t ||W _ta E _t ] (21)

wherein W is _va ，W _vt And W is _ta Are weight matrices, ||represents a stitching operation, E _v 、E _a 、E _t Representing the unimodal characteristics of visual, audio and text, respectively.

Since the multi-modal features are randomly missing at this time, the joint features which are obtained by splicing all the common vectors and are modal missing are expressed as C _all The formula is shown as (22):

C _all ＝[C _v ||C _a ||C _t ] (22)

the benefits of this treatment are the following: firstly, a weight matrix is trained by two modalities together, and interaction information between the two modalities is reserved in the weight; secondly, when the missing modal feature approaches the complete modal feature, only the integral joint feature can be concerned; thus, no matter which mode is missing, the mode is only approximated to the complete joint mode characteristic.

Step S203: the missing joint features are encoded by using a transducer encoder-decoder module, so that joint features approaching the complete mode are obtained, and in the transducer encoder-decoder module, the encoding of the MJF is supervised by using a pre-training model, so that the MJF approaches the joint features of the complete mode.

Specifically, the joint feature C of the modal deletion _all As input to the transducer encoder, is encodedAfter the code, output E is obtained _out As shown in formulas (23) - (25):

E _allm ＝MultiHead(C _all ,C _all ,C _all ) (23)

E _o ' _ut ＝Layernorm(C _all +E _allm ) (24)

wherein the input Query, key and Value are the same and are all C _all ；And->Is a weight matrix, < >>And->Is two learnable biases, E _out The method is a combination characteristic of a missing mode after linear transformation, namely a combination characteristic approaching to a complete mode.

(1) In encoding MJFs by the transducer encoder, a pre-training model is used to guide the encoding of MJFs by the transducer encoder so that MJFs approximate the joint features of the full modality.

The structure of the pre-training model is that a multi-mode emotion analysis network of the pre-training module is removed, and the pre-training model is trained by using complete mode data; output E of Pre-training Module _pre And a transducer encoder output E _out The calculation mode of the system is the same and is obtained by splicing modal translation and public space projection.

(2) In order to efficiently model long-term dependencies of inter-modality information, a transducer encoder-decoder is used to capture dependency information between joint features,output E of encoder _out As input to the decoder, the output D of the decoder _out The expressions are as shown in formulas (26) - (28):

D _outm ＝MultiHead(E _out ,E _out ,E _out ) (26)

D _out ＝Layernorm(E _out +D _outm ) (27)

wherein, the liquid crystal display device comprises a liquid crystal display device,and->Is a parameter matrix,/->And->Is two learnable biases, relu is the activation function.

Finally, output E to the transducer encoder _out And output D of the transducer decoder _out The transform encoder-decoder penalty is calculated.

Step S204: joint feature E that will approximate the full modality _out Input to a classification module to generate and output a final emotion classification.

In the classification module, the output E of the transducer encoder _out Input to fully connected network with softmax activation function to obtain predictive scoreBased on the predictive score, an emotion classification is determined.

The overall training objective of the multi-modal emotion analysis network is determined by the classification loss (Λ _cls ) Loss of pretraining (Λ) _pretrain ) Transformer encoder-decoder loss (Λ _de ) And modal translation loss (Λ _AtoT ,Λ _VtoT ) The weighted summation results in the formula shown in (29):

Λ＝Λ _cls +λ ₁ Λ _pretrain +λ ₂ Λ _de +λ ₃ Λ _VtoT +λ ₄ Λ _AtoT (29)

wherein lambda is ₁ ，λ ₂ ，λ ₃ And lambda (lambda) ₄ Is the corresponding weight.

(1) Pre-training loss (Λ) _pretrain )

The pre-training loss is determined by the pre-training output (E _pre ) And a transducer encoder output (E _out ) The difference between the two is calculated, and the loss of the Kullback Leibler (KL) divergence is adopted to guide the reconstruction of the missing mode, and the formula of the KL divergence is shown in (30):

where p and q are two probability distributions, since the KL divergence is asymmetric, the present embodiment uses the Jensen-Shannon (JS) divergence loss instead of the KL divergence, and the equation of JS is shown in (31):

Λ _pretrain ＝JS(E _out ||E _pre )

＝D _KL (E _out ||E _pre )+D _KL (E _pre ||E _out ) (31)

(2) Transformer encoder-decoder loss (Λ _de )

Similar to the pre-training penalty, the decoder penalty is determined by calculating the output of the transducer decoder (D _out ) And a joint feature representation (C _all ) The JS divergence is obtained, and the calculation process is shown in a formula (32):

Λ _de ＝JS(D _out ||C _all )

＝D _KL (D _out ||C _all )+D _KL (C _all ||D _out ) (32)

(3) Modal translation loss (Λ) _AtoT Sum lambda _VtoT )

For the translation task, only the visual and audio modalities are translated to the text modality, and thus only the outputs of the audio and visual decoders in the modality translation module (D _at And D _vt ) And text encoder representation in the modality translation module (E _t ) The JS divergence loss between the two is shown in the formula (33) (34):

Λ _AtoT ＝JS(D _at ||E _t )

＝D _KL (D _at ||E _t )+D _KL (E _t ||D _at ) (33)

Λ _VtoT ＝JS(D _vt ||E _t )

＝D _KL (D _vt ||E _t )+D _KL (E _t ||D _vt ) (34)

(4) Classification loss (Λ) _cls )

For the final classification module, the output E of the encoder _out Inputting fully connected network with softmax activation function to obtain predictive scoreThe formula is shown as (35):

wherein W is _c And b _c Is a learnable weight and bias, applying a standard cross entropy penalty to the classification task, the cross entropy penalty formula being shown as (36):

where N is the number of samples, y _n Is the true emotion classification of the nth sample,is predictive emotion classification.

Experiments were performed on the public data sets CMU-MOSI and IEMOCAP, verifying that the model proposed in this example achieved significant improvements over several baseline models.

All experiments were performed on a Windows 10 system of Intel (R) Core (TM) i9-10900K CPU, nvidia 3090GPU and 96G RAM; for the CMU-MOSI and IEMOCAP data sets, model parameter sizes were 90.7M, average run times per epoch were 29 seconds and 1 minute, the data sets and experimental settings were described as follows:

data set: experiments were performed on CMU-MOSI and IEMOCAP datasets, both datasets being multimodal baseline datasets of emotion analysis, including visual, text and audio modalities; for the CMU-MOSI dataset, it contains 2199 segments from 93 opinion videos on YouTube. The label of each sample is annotated with the emotion score in [ -3,3 ]. The present embodiment converts the score into passive, neutral, and active labels; for an IEMOCAP dataset, it contains 5 sessions, each session containing about 30 videos, where each video contains at least 24 utterances; the annotation tags are: neutral, depressed, anger, sadness, happiness, excitement, surprise, fear, disappointment, and others; specifically, the present example reports three classifications (negative: [ -3, 0), neutral) on the CMU-MOSI dataset: [0] positive: (0, 3) results two classification (negative: [ frustration, anger, sadness, fear, disappointment ], positive: [ happy, excited ]) results were reported on the IEMOCAP dataset.

Parameters: in this embodiment, the learning rate is set to 0.001, the batch size is set to 32, and the hidden layer size is set to 300; an Adam optimizer was used to minimize the total loss, the epoch number was set to 20, the loss weight was set to 0.1, and the parameters were summarized in table 1.

Table 1: detailed parameter settings in all experiments

Evaluation index: the performance of the model was measured using Accuracy and Macor-F1, and the formulas are shown in (37) - (38):

wherein N is _true Is the number of correctly predicted samples, N is the total number of samples, C is the number of classifications, P is the positive predictive value, and R is the recall value.

Multimodal data preprocessing

Visual representation: CMU-MOSI and IEMOCAP datasets consist primarily of human conversations, with visual features consisting primarily of human faces; facial features are extracted through the openface2.0 toolkit; finally obtaining 709-dimensional visual characteristic representation; including facial, head, and eye movements.

Text representation: for each text utterance, extracting text features using a pre-trained Bert model; finally, a pre-trained BERT model (12 layers, 768-dimensional hidden layers, 12 heads) is employed to obtain 768-dimensional word vectors.

Audio representation: acoustic features are extracted by Librosa; for the CMU-MOSI and IEMOCAP datasets, each audio is mixed as mono and resampled to 16000Hz; furthermore, each frame is separated by 512 samples, and zero-crossing rate, mel-frequency cepstral coefficient (MFCC), and Constant Q Transform (CQT) characteristics are selected to represent the audio segment; finally, stitching the three features together produces 33-dimensional acoustic features.

Baseline model

The baseline model results selected in this example were based on the work of Zeng et al, the baseline model selected was as follows:

AE: an efficient data coding network is trained to replicate its input to its output.

CRA: a missing modality reconstruction framework employs a residual connection mechanism to approximate differences between input data.

MCTN: a method of learning a robust joint representation by transitioning between modalities, the transition from a source modality to a target modality may capture joint information between modalities.

TransM: a multimode fusion method based on end-to-end translation uses a transducer to convert between a modality and encoded multimode characteristics.

MMIN: a unified multi-mode recognition model employs a cascading residual auto-encoder and cyclic consistency learning to recover missing modes.

TATE: a tag-assisted transformer encoder network employing tag encoding techniques covers all uncertain cases and oversees joint representation learning.

Experimental results

In the experimental results module, the present example reports the three classification results on CMU-MOSI and the two classification results on IEMOCAP, the experimental results are shown in tables 2 and 3; overall, the overall result shows a decreasing trend with increasing deletion rate.

TABLE 2 Experimental results in the case of a Mono-modal loss

TABLE 3 experimental results in the case of multimodal loss

For the single-mode deletion situation, the experimental result is shown in table 2, the deletion rate is set to 0-0.5, the M-F1 value on the CMU-MOSI data set is 2.29% lower than the MMIN model under the complete mode condition, and the ACC value on the CMU-MOSI data set is 0.01% lower than the TATE model; under the condition of deleting a single mode, when the deletion rate is 0.1, the M-F1 value on the CMU-MOSI data set is 0.78% lower than that of a TATE model; in addition, the method provided by the embodiment obtains the best results on other settings, and the validity of the model of the embodiment is verified.

For the multimodal absence case, the relevant results are shown in table 3; experimental results show that under the condition of randomly losing multiple modes, when the loss rate is 0.4, the ACC value of the proposed model is 0.52% lower than that of a TATE model on a CMU-MOSI data set; on the IEMOCAP dataset, the proposed model of this example increased by about 0.21% to 5.21% over M-F1 and about 0.75% to 4.05% over ACC values compared to the other baseline models, demonstrating the robustness of the proposed model of this example.

Ablation study

In order to explore the effects of different modules in a TMTN, the model proposed in this embodiment was evaluated in the absence of a single modality, set as follows: 1) Only one modality is used; 2) Two modalities are used; 3) Removing the mode translation and then using a single mode; 4) Removing the mode translation and then using the dual modes; 5) Removing the modal translation module; 6) Removing the public space projection module; 7) The pre-training module is removed.

Table 4: comparison of all modules in TMTN

According to table 4, it can be found that when the text mode is absent, the performance is drastically reduced, and it is verified that the text information is dominant in the multimodal emotion analysis; however, no similar reduction was observed when the visual modality was removed; the possible reasons are: visual information is not well extracted due to small changes in the face; in addition, the upper half of the table shows the effect of modal characteristics after modal translation, and it can be found that the combination of multiple modalities provides better performance than a single modality, indicating that complementary characteristics can be learned between multiple modalities; to verify the validity of the modal translation method, the present example conducted an experiment as shown in the middle part of table 4; it can be found that the performance of the untranslated visual and audio modalities is not as good as the translated visual and audio modalities; in addition, the present example also investigated the combination of the untranslated visual and audio modalities with the text modalities, and the results again confirm that there is still a gap between the untranslated and translated modality combinations.

For the influence of different modules, after the modal translation module is removed, the performance of the model on M-F1 is reduced by about 1.28 to 5.57 percent relative to the whole model, and the performance on ACC is reduced by about 1.04 to 8.33 percent; after removal of the pre-training module, the performance of the model was reduced by about 2.41% to 8.31% on M-F1 and about 2.6% to 11.98% on ACC index. When the common space projection module is removed, in order to ensure the normal operation of the model, the operation of the common space projection module is replaced by direct splicing operation, and the performance of the model is reduced by about 1.43 to 6.85 percent on M-F1 and about 2.08 to 9.37 percent on ACC compared with a TMTN model when the mode is complete.

The ablation experiment proves the effectiveness of the modal translation module; because the complete mode is used for training the pre-training network, the pre-training module can be seen to play a good role in supervision; from this table, a remarkable result can be obtained, namely, the mode translation operation and the pre-training operation in the model of this embodiment can still enable the model to maintain stronger stability and better effect under the condition of mode missing.

Multi-classification under an IEMOCAP dataset

The present embodiment also explores the performance of multiple classifications on an IEMOCAP dataset; in addition to the classification results, a four-classification experiment is performed by selecting happy, angry, sad and neutral emotion labels; seven classification experiments were performed with happy, angry, sad, neutral, depressed, excited, and surprised mood tags; the detailed distribution and results are shown in tables 5 and 6, respectively; it can be seen that both the M-F1 value and ACC decrease with increasing category number; further, careful examination of Table 6 shows that when the number of classifications is 7, the overall performance drops dramatically, which may be due to confusion among multiple classes, resulting in difficulty in model convergence; in addition, comparing the model proposed in this embodiment with the TATE model, the results indicate that the method of this embodiment obtains better results in the category 2, the category 4 or the category 7; notably, the performance of the model of this example was 10% higher on average on a class 4 than the TATE model; further, in the 7-classification, the effect of the model proposed in the present embodiment drops more slowly as the deletion rate increases.

Table 5: detailed distribution of IEMOCAP datasets

Table 6: multiple classification results for IEMOCAP datasets

/>

The embodiment provides a multimode emotion analysis network based on a transducer for solving the problem of losing uncertain modes; the model adopts a mode translation method to translate visual and audio modes into text modes so as to improve the quality of the visual and audio modes; in addition, the learning of the missing modal features is supervised by a pre-trained module, so that the missing modal features are approximated to the complete joint features to reconstruct the missing modal features; experiments on CMU-MOSI and IEMOCAP datasets demonstrated the effectiveness of the proposed model in the absence of modalities.

Example two

The embodiment discloses a multi-mode emotion analysis system for uncertain mode deletion;

Example III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a multimodal emotion analysis method for uncertainty modality-oriented deletion as described in an embodiment of the present disclosure.

Example IV

An object of the present embodiment is to provide an electronic apparatus.

The electronic device comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the processor realizes the steps in the multi-mode emotion analysis method for uncertain mode deletion according to the first embodiment of the disclosure when executing the program.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multi-modal emotion analysis method for uncertain modal deletion is characterized by comprising the following steps of:

the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal characteristics of three modal data based on a multi-head attention mechanism, and uses a transducer decoder to monitor a transducer encoder, so that visual characteristics and audio characteristics approximate to text characteristics, and the visual characteristics and the audio characteristics are translated into the text characteristics;

the single-mode characteristics of three mode data are extracted based on the multi-head attention mechanism, and the single-mode characteristics are specifically as follows:

extracting context features of each modality using a transducer encoder;

performing linear transformation on the normalized context characteristics to obtain single-mode characteristics of three-mode data;

the use of a transducer decoder to supervise the transducer encoder is described in detail:

constructing a transducer decoder by taking text features as Query of a multi-head attention mechanism and modal features to be translated as Key and Value of the multi-head attention mechanism;

the method comprises the steps that through the translation loss between the modal feature to be translated and the text feature output by a transducer decoder, the transducer encoder is supervised to translate the modal feature to be translated to the text feature;

the multi-modal emotion analysis network further comprises a public space projection module;

the public space projection module is used for carrying out linear transformation on three mode characteristics after the mode translation to obtain an autocorrelation public space of each mode, and fusing the autocorrelation public space into a joint characteristic of a missing mode;

the multi-modal emotion analysis network further comprises a pre-training module and a transducer encoder module;

the transducer encoder module guides the joint characteristics of the missing mode to approach the joint characteristics of the complete mode under the supervision of a pre-trained multi-mode emotion analysis network, and encodes the joint characteristics of the missing mode so as to generate the joint characteristics of the complete mode;

the multi-modal emotion analysis network further comprises a transducer decoder module;

2. The method for multi-modal emotion analysis for uncertain modal loss as recited in claim 1, wherein the overall training objective of the multi-modal emotion analysis network is derived from weighted summation of classification loss, pre-training loss, transducer encoder-decoder loss, modal translation loss.

3. A multi-modal emotion analysis system for the absence of an uncertain modality, comprising:

the multi-modal emotion analysis network comprises a modal translation module, wherein the modal translation module extracts single-modal features of three modal data based on a multi-head attention mechanism, and uses a transducer decoder to monitor a transducer encoder, so that visual features and audio features approach text features, and the visual features and the audio features are translated into the text features;

extracting context features of each modality using a transducer encoder;

4. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer-readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of the preceding claims 1-2.

5. A storage medium, characterized by non-transitory storing computer-readable instructions, wherein the instructions of the method of any one of claims 1-2 are performed when the non-transitory computer-readable instructions are executed by a computer.