CN113435496A

CN113435496A - Self-adaptive fusion multi-mode emotion classification method based on attention mechanism

Info

Publication number: CN113435496A
Application number: CN202110703330.7A
Authority: CN
Inventors: 蒋斌; 袁梦
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-24
Anticipated expiration: 2041-06-24
Also published as: CN113435496B

Abstract

The invention relates to the field of multi-modal emotion analysis, in particular to a multi-modal emotion classification method based on adaptive fusion of an attention system. The technical scheme includes that a word sequence X is given as X₁,x₂,...x_N]Capturing emotion-related features in a sentence sequence from different angles, and learning visual emotion features by depending on sentence contexts; describing a multi-modal emotion prediction process through a context guide module and a multi-modal complementary fusion module; the method has the advantages that (1) not only the important text information is emphasized, but also the potential emotion expression of vision in the text can be learned. (2) Not only do they contain important aspects common to the modalities,there are also unique aspects of text that are visually ignored, while fusing adaptively, relying on overall emotion. (3) The average accuracy is improved from 62.80 to 65.80, and the improvement rate is improved from 43.1% to 49.92%.

Description

Self-adaptive fusion multi-mode emotion classification method based on attention mechanism

Technical Field

The invention relates to the field of multi-modal emotion analysis, in particular to a multi-modal emotion classification method based on adaptive fusion of an attention system.

Background

With the rapid development of the internet, users are more and more willing to publish contents (such as comments) on a social platform, which causes the rapid growth of information on the internet. It is reported that 90% of consumers tend to read reviews about goods prior to consumption. 88% of consumers would choose to view reviews and suggestions written by acquaintances. Thus, businesses desire to know consumer preferences or consumer opinions about product design and marketing strategies. Emotion analysis is central to understanding user-generated content. Currently, emotion analysis based on a single modality is the center of gravity of previous research, and emotion classification methods based on texts can be roughly divided into three types: emotion dictionary based, machine learning based traditional methods and deep learning based methods. Due to the popularity of smartphones, the content on the social platform gradually evolves from a single document to a rich document (such as text and images) with a combination of multiple modalities, and the user-generated multi-modal content more accurately describes the user's current experience than text.

Visual information is introduced, and the problem of text recognition errors in important aspects is solved to a certain extent. How to perform effective fusion among multiple modes and improve the emotion classification effect is a hot problem in the field. There are two main views of the relationship between modalities, one is that the image and the text are considered to be equally effective, and both can be regarded as independent features contributing to emotion classification. The modal information is combined using a multi-layer perceptron (MLP) for inferring potential emotional states of the user by obtaining image and text features using a deep neural network, respectively. Or by the Deep Multimodal Attention Fusion (DMAF) framework, exploiting the distinctive features and intrinsic relevance between visual and textual modalities. Or judging whether the contents of the two are consistent by using an image-text consistency method, and then adaptively fusing text features and visual features extracted from the traditional sentiBank based on a Support Vector Machine (SVM) to classify the graphics and text emotions. Or two-layer multimodal hypergraph learning (Bi-MHG) is proposed, explicitly modeling the correlation between visual, textual and expressive modalities by connecting the two layers by sharing the correlation of multimodal features. Or considering that the fusion of multi-modal information needs to be based on overall emotion, an attention-based modal gating network (AMGN) is proposed for the fusion, wherein a modal gating long-term short-term memory network (LSTM) adaptively selects a mode with strong emotion to learn multi-modal characteristics; while another view considers text to be dominant, images should be considered as an aid to assist text in focusing on important aspects.

The traditional method based on machine learning considers that images in the comments can emphasize some important aspects, visual information plays an auxiliary role for overall emotion and is not used as an independent feature, and in the prior art, when a visual aspect attention network is designed, an image guide model is used for focusing on important sentences in the comments. However, the inventor believes that the image in the comment is a presentation of a specific thing, and the attention network tends to give more attention to the sentence describing the fact, rather than the sentence reflecting the emotion, and does not fully utilize the potential information of the image. Obtaining affective features directly from images is tricky and requires text comments to further mine features that are valid for the emotion classification task. By guiding the emotion module in the model context, the vision-enhanced sentences are made to pay attention to the important aspects related to emotion, so that the potential visual emotion is learned. In addition, the image categories in the online review data set are mostly single, the food and environment aspects of the restaurant are basically presented, and the review of the invention is biased to be comprehensive, and the contained aspects are more complete. Therefore, it can be said that the aspect embodied by the image does not cover all aspects important for document emotion classification, and the inventors propose to use a global MEAN image as a compensation, but the lacking aspect of different comment images is different, such processing is too simple and not flexible enough, the inventors consider that the aspect reflected by important sentences in the text can be used as a complementary feature, and solve the problem that the image only focuses on a single aspect, and compared with the important aspect based on image capture, a more accurate classification result can be obtained.

Disclosure of Invention

The invention aims to provide a multi-modal emotion classification method based on adaptive fusion of an attention mechanism, so that the defects in the prior art are overcome.

The invention is realized by the following technical scheme, which comprises a pre-training model for learning bidirectional context information, namely BERT_baseModel with 768 hiddenA feed forward network of cells and 12 heads of attention, comprising the steps of:

step one, a word sequence X is given as [ X ]₁,x₂,....x_N]Wherein x is_iIs the sum of words, segments and position embeddings and N is the maximum length of the sequence. Inputting the word sequence X into a pre-training model for coding, and taking the output of the last layer of the coder as a sentence hiding state h_i：

h_i＝Bert(X_i)

For multiple images attached to the document, the size of the images is uniformly adjusted to 224 x 224, the last complete connection layer of the residual error network is removed, and the output of the last convolution layer is used as I_jAn image representation of (a);

a_j＝Re sNet(I_j)

image representation a_jIs from the image I_jEncoding the obtained 2048-dimensional vector;

capturing emotion-related features in a sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences a by using nonlinear conversion_jAnd sentence level hidden state h_iProjecting to the same space;

then, the correlation degree between the hidden state of the sentence and the specific image is learned by matrix multiplication, and a softmax function is applied to obtain a weight alpha_j,iWeighted summation to obtain sentence characterization t with visual enhancement_j；

p_j＝relu(W_pa_j+b_p)

q_i＝relu(W_qh_i+b_q)

t_j＝α_j,ih_i

Wherein, W_p、W_q、b_pAnd b_qFor weighting and biasing of multi-tier perceptrons, rel is usedu nonlinear activation function, where α_j,iCapturing a correlation of the visual representation and the sentence hiding state;

thirdly, from the perspective of a text, in the emotion analysis task, sentences containing emotion information are more important than sentences describing facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state h_iLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight β_iFinally, attention is paid to the weights and sentence-level hidden states h_iObtaining the sentence representation s with self-enhanced text after weighted summation_i；

u_i＝W_u tanh(W_sh_i+b_s)

Wherein, W_u、W_s、b_sAnd b_uAre respective weights and offsets, of which_iReflecting the different importance degree of each sentence in the document;

step four, because it is difficult to directly obtain the emotional information of the image, the visual emotional characteristics need to be learned by depending on the sentence context; obtaining sentence expression t after visual enhancement according to the step two_jThe sentence with the self-enhanced text in the step three is represented as s_iAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;

for this purpose, in the context-oriented complementary convergence network CGCFN, the context-oriented module cgm (context Guide module) of one of the core modules is given a context representation by mainly relying on the context-oriented attention mechanismThe sentences after visual enhancement learn common characteristics of vision and texts in the aspect of emotion to obtain visual potential emotional characteristics. Characterizing a context s_iSentence t related to image_jProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentence_j,iFurther obtaining a visual emotion representation c_i(ii) a The calculation process for calculating the emotion weight coefficient is as follows:

u_j＝tanh(W_ut_j+b_e)

v_i＝tanh(W_vs_i+b_f)

c_i＝γ_j,it_j

wherein, W_u∈R^cxe,W_v∈R^cxe，b_u∈R^c,b_v∈R^cRespectively as weight and bias parameters, using sigmoid function, gamma_j,iThe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;

step five, because most of the contents reflected by the images in the comments are single, all important aspects in the text cannot be covered, and the complementary or enhanced relation between the modalities is dynamically adjusted by learning the interaction between the vision and the text; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation t_j(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not high_iAs Complementary features, in the context-oriented Complementary Fusion network CGCFN, a multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) which is one of the core modules is composed of a gate function and a self-attention mechanism, the gate function learns the interaction between the cross-modalities, different weights are given to the visual emotion features, the relationship between the modalities is dynamically converted, the self-attention mechanism fuses the text features and the visual emotion features, and the final multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) is obtainedAnd (6) state representation. Specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoid_j,iThen let a gate function g_j,iAnd characterization of visual emotion c_iLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition method_iObtaining an adaptive multi-modal emotion representation d_i；

e_j＝tanh(W_em_j+b_e)

f_i＝tanh(W_fh_i+b_f)

Wherein, W_e、W_f、b_eAnd b_fFor the purpose of the corresponding weight and the bias,

representing element-by-element multiplication; gate function g for learning relevance of vision and text_j,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely related_j,iWill be very large, visual emotion characterization c_iThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, g_j,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itself_i；

Step six, expressing the multi-modal emotion in the step five by using d_iInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM (Multi-modality complementary fusion Module), and performing effective multi-mode fusion to obtain a final and emotion classification taskA relevant multi-modal representation d; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d is_clsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;

k_i＝W_k(W_dd_i+b_d)

φ＝soft max(W_cd_cls+b_c)

wherein, W_d、W_kAnd b_dIs the weight and bias, δ, of the multi-layer perceptron MLP_iThe emotion expressions of different modes are reflected to make contribution to the final multi-mode representation;

step seven, the multi-modal emotion prediction process is described; the model is trained in an end-to-end fashion by minimizing the cross-entropy loss function:

where l is the true tag value of document d.

The method has the advantages that (1) on the basis of a visual attention mechanism and a self-attention mechanism, a context guide module is constructed, and a guide model is combined with visually enhanced sentence representations and context information to further mine emotion information in the image, so that not only is important text information emphasized, but also potential emotion representations of the vision in the text can be learned. (2) A multi-modal complementary fusion method is provided, wherein the relationship of emotion representation among the modalities is dynamically adjusted through a gate function, and the adjusted multi-modal representation not only contains important aspects shared by the modalities, but also has unique aspects of texts which are ignored visually, and is fused in a self-adaptive manner by depending on the overall emotion. (3) Numerous experiments performed on the Yelp data set showed that the present invention is superior to the methods in the prior art schemes. The Accuracy (acc) on 5 city data sets is respectively improved from 67.61, 70.70, 62.38, 61.45 and 62.40 to 71.43, 69.54, 65.47, 64.61 and 66.32, the Average Accuracy averageaccuracy (Avg) is improved from 62.80 to 65.80, and the Improvement rate (Improvement) is improved from 43.1% to 49.92%.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

In order to make the inventive content of the present invention better understood by those skilled in the art, a preferred embodiment of the present invention is further described below with reference to fig. 1, which includes learning a pre-trained model of bi-directional context information, i.e., BERT_baseA model, which is a feed forward network with 768 hidden units and 12 attention heads, characterized by the following steps:

step one, a word sequence X is given as [ X ]₁,x₂,....x_N]Wherein x is_iIs the sum of words, segments and position embedding, the word sequence X is input into a pre-training model for coding, and the output of the last layer of a coder is taken as a sentence hiding state h_i：h_i＝Bert(X_i)

a_j＝ResNet(I_j)

capturing emotion-related features in the sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences by using nonlinear conversiona_jAnd sentence level hidden state h_iProjecting to the same space;

p_j＝relu(W_pa_j+b_p)

q_i＝relu(W_qh_i+b_q)

t_j＝α_j,ih_i

Wherein, W_p、W_q、b_pAnd b_qFor the weights and biases of the multi-layer perceptron, a relu nonlinear activation function is used, where alpha_j,iCapturing a correlation of the visual representation and the sentence hiding state;

u_i＝W_u tanh(W_sh_i+b_s)

therefore, in a context Guide Complementary Fusion network CGCFN (context Guide Complementary Fusion network), a context Guide module CGM (context Guide module) of one of core modules is mainly used for learning common characteristics of vision and texts in the aspect of emotion by virtue of a context Guide attention mechanism through context representation so as to obtain visual potential emotion characteristics; characterizing a context s_iSentence t related to image_jProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentence_j,iFurther obtaining a visual emotion representation c_i(ii) a The calculation process for calculating the emotion weight coefficient is as follows:

u_j＝tanh(W_ut_j+b_e)

v_i＝tanh(W_vs_i+b_f)

c_i＝γ_j,it_j

wherein, W_u∈R^cxe,W_v∈R^cxe，b_u∈R^c,b_v∈R^cRespectively weight and bias parameters, adopting sigmoid function,γ_j,ithe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;

step five, because most of the contents reflected by the images in the comments are single, all important aspects in the text cannot be covered, and the complementary or enhanced relation between the modalities is dynamically adjusted by learning the interaction between the vision and the text; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation t_j(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not high_iAs Complementary features, in a context-oriented Complementary Fusion network CGCFN, a multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) which is one of core modules is composed of a gate function and a self-attention mechanism, the gate function learns interaction between cross-modalities, different weights are given to visual emotion features, the relationship between modalities is dynamically converted, and the self-attention mechanism fuses text features and the visual emotion features to obtain a final multi-modal representation; specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoid_j,iThen let a gate function g_j,iAnd characterization of visual emotion c_iLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition method_iObtaining an adaptive multi-modal emotion representation d_i；

e_j＝tanh(W_em_j+b_e)

f_i＝tanh(W_fh_i+b_f)

Step six, expressing the multi-modal emotion in the step five by using d_iInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM to perform effective multi-mode fusion to obtain a multi-mode representation d related to the emotion classification task; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d is_clsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;

k_i＝W_k(W_dd_i+b_d)

φ＝soft max(W_cd_cls+b_c)

wherein, W_d、W_kAnd b_dIs the weight and offset, δ, of the MLP_iEmotion expression pair embodying different modesThe contribution made by the last multi-modal characterization;

where l is the true tag value of document d.

The present invention is compared to previously proposed multimodal emotion analysis models as shown in table 1:

table 1: comparison of the results of the method of the invention with the other methods

Table 1 lists the comparison result between the reference model and the model, and it can be seen that the average accuracy of the context-guided Complementary Fusion network cgcfn (context Guide Complementary Fusion network) on the yelp data set reaches 65.8%, which is 6.3% higher than that of the VistaNet model and 4.8% higher than that of the SFNN. The effect obtained on the CH data set is inferior to that of the reference model, which is caused by the fact that the CH data set has a smaller number of data sets than the data sets of LA, NY, etc., and cannot cover the general characteristics of a larger data set.

Further, the effectiveness of each module of the complementary fusion network CGCFN is guided by ablation experimental study context, starting from the most basic configuration, modules are added step by step to form the final model architecture, as shown in table 2,

table 2: ablation experiment of the method of the invention

First, only the text part is relied on, i.e. the self-attention mechanism between the text feature extraction module and the sequence. As shown in the first row, the average accuracy is 63.09%. The image features extracted with resnet152 are then used to learn important sentence representations through the visual attention module, with an effect that is 0.5% higher than the text portion, as shown in the second row of table 2. And then, a context guide module is added, and the visual emotion representation is obtained through further learning, so that the effect is improved by 1.6 percent as shown in the third row of the table 2. And finally, adding a multi-mode complementary fusion module, and performing effective balanced and adaptive fusion on the emotion expressions of the text mode and the visual mode by using a gate function and a self-attention mechanism, wherein as shown in the fourth row of the table 2, the average precision reaches 65.80%, and is improved by 4.3%. Ablation experiment results show that all sub-modules of the context-guided complementary fusion network CGCFN represent respective contribution and effectiveness.

This embodiment referring to FIG. 1, first, comment the text "redeem after 9 o' clock 30 pm! Compared with the outside, the inside is clean and tidy. After preprocessing, inputting a pre-training model, namely bi-directional coding representation from a transformer, and obtaining sequence characteristics; preprocessing 3 images of the comments and inputting the preprocessed images into a depth residual error network to obtain image representation;

secondly, learning the enhanced sequence features from two angles, wherein the first is from vision, and the visual attention module uses the image features of each image to carry out inner product interaction on the sequence features to obtain the visually enhanced sequence features; and secondly, starting from the text, a sentence-level self-attention mechanism learns the relationship in the sequence to obtain the self-enhanced sequence feature of the text. Because the image of the comment does not contain the emotion exceeding the text, the context guide module guides the sequence feature of the visual enhancement to pay attention to the emotion information shared by the image and the text by using the sequence feature of the text self-enhancement as the context to obtain the visual emotion feature;

secondly, dynamically adjusting weights of visual emotion characteristics and self-enhanced sequence characteristics of the text in multi-modal representation by a gate function of the multi-modal complementary fusion module through learning interaction of the text and the image, then inputting a self-attention mechanism for multi-modal fusion, and learning characteristics important to emotion classification tasks based on overall emotion of the document to obtain multi-modal document representation;

and finally, inputting the CLS mark represented by the multi-modal document into an emotion classifier for prediction.

In conclusion, the method and the device not only emphasize important text information, but also can learn the potential emotion expression of vision in the text. Not only contain important aspects common to modalities, but also text unique aspects of visual ignorance, while adaptively fusing by means of overall emotion. The average accuracy is improved from 62.80 to 65.80, and the improvement rate is improved from 43.1% to 49.92%.

Claims

1. A multi-modal emotion classification method based on adaptive fusion of an attention mechanism comprises learning a pre-training model of bidirectional context information, namely BERT_baseA model, which is a feed forward network with 768 hidden units and 12 attention heads, characterized by the following steps:

h_i＝Bert(X_i)

a_j＝ResNet(I_j)

reuse of matrix multiplication learningThe sentence hiding state and the correlation degree of the specific image are applied, and the softmax function is applied to obtain the weight alpha_j,iWeighted summation to obtain sentence characterization t with visual enhancement_j；

p_j＝relu(W_pa_j+b_p)

q_i＝relu(W_qh_i+b_q)

t_j＝α_j,ih_i

thirdly, the sentences containing the emotional information are more important than the sentences describing the facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state h_iLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight β_iFinally, attention is paid to the weights and sentence-level hidden states h_iObtaining the sentence representation s with self-enhanced text after weighted summation_i；

u_i＝W_utanh(W_sh_i+b_s)

Wherein, W_u、W_s、b_sAnd b_uAre respective weights and offsets, whereinBeta of (A)_iReflecting the different importance degree of each sentence in the document;

step four, learning visual emotional characteristics by depending on sentence contexts; obtaining sentence expression t after visual enhancement according to the step two_jThe sentence with the self-enhanced text in the step three is represented as s_iAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;

a context guidance module CGM of one of the core modules in the context guidance complementary fusion network CGCFN; the module mainly depends on a context guide attention mechanism, and the sentences subjected to vision enhancement learn common characteristics of vision and texts in the aspect of emotion through context representation to obtain potential visual emotion characteristics; characterizing a context s_iSentence t related to image_jProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentence_j,iFurther obtaining visual emotion representation ci; the calculation process for calculating the emotion weight coefficient is as follows:

u_j＝tanh(W_ut_j+b_e)

v_i＝tanh(W_vs_i+b_f)

c_i＝γ_j,it_j

step five, dynamically adjusting complementation between modalities by learning interaction between vision and textsOr enhancing the relationship; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation t_j(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not high_iAs a complementary feature, for this purpose, a multimodal complementary fusion module MCFM of one of the core modules in the context-guided complementary fusion network CGCFN; the module consists of a gate function and a self-attention mechanism, wherein the gate function learns the interaction among cross-modal states, gives different weights to visual emotion characteristics, dynamically converts the relationship among the modal states, and the self-attention mechanism fuses text characteristics and the visual emotion characteristics to obtain final multi-modal representation; specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoid_j,iThen let a gate function g_j,iAnd characterization of visual emotion c_iLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition method_iObtaining an adaptive multi-modal emotion representation d_i；

e_j＝tanh(W_em_j+b_e)

f_i＝tanh(W_fh_i+b_f)

representing element-by-element multiplication; study the designGate function g for learning relevance of vision and text_j,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely related_j,iWill be very large, visual emotion characterization c_iThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, g_j,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itself_i；

k_i＝W_k(W_dd_i+b_d)

φ＝softmax(W_cd_cls+b_c)

wherein, W_d、W_kAnd b_dIs the weight and offset, δ, of the MLP_iThe emotion expressions of different modes are reflected to make contribution to the final multi-mode representation;

where l is the true tag value of document d.