CN113435496A - Self-adaptive fusion multi-mode emotion classification method based on attention mechanism - Google Patents

Self-adaptive fusion multi-mode emotion classification method based on attention mechanism Download PDF

Info

Publication number
CN113435496A
CN113435496A CN202110703330.7A CN202110703330A CN113435496A CN 113435496 A CN113435496 A CN 113435496A CN 202110703330 A CN202110703330 A CN 202110703330A CN 113435496 A CN113435496 A CN 113435496A
Authority
CN
China
Prior art keywords
emotion
sentence
text
representation
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110703330.7A
Other languages
Chinese (zh)
Other versions
CN113435496B (en
Inventor
蒋斌
袁梦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202110703330.7A priority Critical patent/CN113435496B/en
Publication of CN113435496A publication Critical patent/CN113435496A/en
Application granted granted Critical
Publication of CN113435496B publication Critical patent/CN113435496B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to the field of multi-modal emotion analysis, in particular to a multi-modal emotion classification method based on adaptive fusion of an attention system. The technical scheme includes that a word sequence X is given as X1,x2,...xN]Capturing emotion-related features in a sentence sequence from different angles, and learning visual emotion features by depending on sentence contexts; describing a multi-modal emotion prediction process through a context guide module and a multi-modal complementary fusion module; the method has the advantages that (1) not only the important text information is emphasized, but also the potential emotion expression of vision in the text can be learned. (2) Not only do they contain important aspects common to the modalities,there are also unique aspects of text that are visually ignored, while fusing adaptively, relying on overall emotion. (3) The average accuracy is improved from 62.80 to 65.80, and the improvement rate is improved from 43.1% to 49.92%.

Description

Self-adaptive fusion multi-mode emotion classification method based on attention mechanism
Technical Field
The invention relates to the field of multi-modal emotion analysis, in particular to a multi-modal emotion classification method based on adaptive fusion of an attention system.
Background
With the rapid development of the internet, users are more and more willing to publish contents (such as comments) on a social platform, which causes the rapid growth of information on the internet. It is reported that 90% of consumers tend to read reviews about goods prior to consumption. 88% of consumers would choose to view reviews and suggestions written by acquaintances. Thus, businesses desire to know consumer preferences or consumer opinions about product design and marketing strategies. Emotion analysis is central to understanding user-generated content. Currently, emotion analysis based on a single modality is the center of gravity of previous research, and emotion classification methods based on texts can be roughly divided into three types: emotion dictionary based, machine learning based traditional methods and deep learning based methods. Due to the popularity of smartphones, the content on the social platform gradually evolves from a single document to a rich document (such as text and images) with a combination of multiple modalities, and the user-generated multi-modal content more accurately describes the user's current experience than text.
Visual information is introduced, and the problem of text recognition errors in important aspects is solved to a certain extent. How to perform effective fusion among multiple modes and improve the emotion classification effect is a hot problem in the field. There are two main views of the relationship between modalities, one is that the image and the text are considered to be equally effective, and both can be regarded as independent features contributing to emotion classification. The modal information is combined using a multi-layer perceptron (MLP) for inferring potential emotional states of the user by obtaining image and text features using a deep neural network, respectively. Or by the Deep Multimodal Attention Fusion (DMAF) framework, exploiting the distinctive features and intrinsic relevance between visual and textual modalities. Or judging whether the contents of the two are consistent by using an image-text consistency method, and then adaptively fusing text features and visual features extracted from the traditional sentiBank based on a Support Vector Machine (SVM) to classify the graphics and text emotions. Or two-layer multimodal hypergraph learning (Bi-MHG) is proposed, explicitly modeling the correlation between visual, textual and expressive modalities by connecting the two layers by sharing the correlation of multimodal features. Or considering that the fusion of multi-modal information needs to be based on overall emotion, an attention-based modal gating network (AMGN) is proposed for the fusion, wherein a modal gating long-term short-term memory network (LSTM) adaptively selects a mode with strong emotion to learn multi-modal characteristics; while another view considers text to be dominant, images should be considered as an aid to assist text in focusing on important aspects.
The traditional method based on machine learning considers that images in the comments can emphasize some important aspects, visual information plays an auxiliary role for overall emotion and is not used as an independent feature, and in the prior art, when a visual aspect attention network is designed, an image guide model is used for focusing on important sentences in the comments. However, the inventor believes that the image in the comment is a presentation of a specific thing, and the attention network tends to give more attention to the sentence describing the fact, rather than the sentence reflecting the emotion, and does not fully utilize the potential information of the image. Obtaining affective features directly from images is tricky and requires text comments to further mine features that are valid for the emotion classification task. By guiding the emotion module in the model context, the vision-enhanced sentences are made to pay attention to the important aspects related to emotion, so that the potential visual emotion is learned. In addition, the image categories in the online review data set are mostly single, the food and environment aspects of the restaurant are basically presented, and the review of the invention is biased to be comprehensive, and the contained aspects are more complete. Therefore, it can be said that the aspect embodied by the image does not cover all aspects important for document emotion classification, and the inventors propose to use a global MEAN image as a compensation, but the lacking aspect of different comment images is different, such processing is too simple and not flexible enough, the inventors consider that the aspect reflected by important sentences in the text can be used as a complementary feature, and solve the problem that the image only focuses on a single aspect, and compared with the important aspect based on image capture, a more accurate classification result can be obtained.
Disclosure of Invention
The invention aims to provide a multi-modal emotion classification method based on adaptive fusion of an attention mechanism, so that the defects in the prior art are overcome.
The invention is realized by the following technical scheme, which comprises a pre-training model for learning bidirectional context information, namely BERTbaseModel with 768 hiddenA feed forward network of cells and 12 heads of attention, comprising the steps of:
step one, a word sequence X is given as [ X ]1,x2,....xN]Wherein x isiIs the sum of words, segments and position embeddings and N is the maximum length of the sequence. Inputting the word sequence X into a pre-training model for coding, and taking the output of the last layer of the coder as a sentence hiding state hi
hi=Bert(Xi)
For multiple images attached to the document, the size of the images is uniformly adjusted to 224 x 224, the last complete connection layer of the residual error network is removed, and the output of the last convolution layer is used as IjAn image representation of (a);
aj=Re sNet(Ij)
image representation ajIs from the image IjEncoding the obtained 2048-dimensional vector;
capturing emotion-related features in a sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences a by using nonlinear conversionjAnd sentence level hidden state hiProjecting to the same space;
then, the correlation degree between the hidden state of the sentence and the specific image is learned by matrix multiplication, and a softmax function is applied to obtain a weight alphaj,iWeighted summation to obtain sentence characterization t with visual enhancementj
pj=relu(Wpaj+bp)
qi=relu(Wqhi+bq)
Figure BDA0003130279910000031
tj=αj,ihi
Wherein, Wp、Wq、bpAnd bqFor weighting and biasing of multi-tier perceptrons, rel is usedu nonlinear activation function, where αj,iCapturing a correlation of the visual representation and the sentence hiding state;
thirdly, from the perspective of a text, in the emotion analysis task, sentences containing emotion information are more important than sentences describing facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state hiLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight βiFinally, attention is paid to the weights and sentence-level hidden states hiObtaining the sentence representation s with self-enhanced text after weighted summationi
ui=Wu tanh(Wshi+bs)
Figure BDA0003130279910000032
Figure BDA0003130279910000033
Wherein, Wu、Ws、bsAnd buAre respective weights and offsets, of whichiReflecting the different importance degree of each sentence in the document;
step four, because it is difficult to directly obtain the emotional information of the image, the visual emotional characteristics need to be learned by depending on the sentence context; obtaining sentence expression t after visual enhancement according to the step twojThe sentence with the self-enhanced text in the step three is represented as siAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;
for this purpose, in the context-oriented complementary convergence network CGCFN, the context-oriented module cgm (context Guide module) of one of the core modules is given a context representation by mainly relying on the context-oriented attention mechanismThe sentences after visual enhancement learn common characteristics of vision and texts in the aspect of emotion to obtain visual potential emotional characteristics. Characterizing a context siSentence t related to imagejProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentencej,iFurther obtaining a visual emotion representation ci(ii) a The calculation process for calculating the emotion weight coefficient is as follows:
uj=tanh(Wutj+be)
vi=tanh(Wvsi+bf)
Figure BDA0003130279910000041
ci=γj,itj
wherein, Wu∈Rcxe,Wv∈Rcxe,bu∈Rc,bv∈RcRespectively as weight and bias parameters, using sigmoid function, gammaj,iThe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;
step five, because most of the contents reflected by the images in the comments are single, all important aspects in the text cannot be covered, and the complementary or enhanced relation between the modalities is dynamically adjusted by learning the interaction between the vision and the text; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation tj(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not highiAs Complementary features, in the context-oriented Complementary Fusion network CGCFN, a multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) which is one of the core modules is composed of a gate function and a self-attention mechanism, the gate function learns the interaction between the cross-modalities, different weights are given to the visual emotion features, the relationship between the modalities is dynamically converted, the self-attention mechanism fuses the text features and the visual emotion features, and the final multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) is obtainedAnd (6) state representation. Specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoidj,iThen let a gate function gj,iAnd characterization of visual emotion ciLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition methodiObtaining an adaptive multi-modal emotion representation di
ej=tanh(Wemj+be)
fi=tanh(Wfhi+bf)
Figure BDA0003130279910000042
Figure BDA0003130279910000043
Wherein, We、Wf、beAnd bfFor the purpose of the corresponding weight and the bias,
Figure BDA0003130279910000044
representing element-by-element multiplication; gate function g for learning relevance of vision and textj,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely relatedj,iWill be very large, visual emotion characterization ciThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, gj,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itselfi
Step six, expressing the multi-modal emotion in the step five by using diInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM (Multi-modality complementary fusion Module), and performing effective multi-mode fusion to obtain a final and emotion classification taskA relevant multi-modal representation d; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d isclsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;
ki=Wk(Wddi+bd)
Figure BDA0003130279910000051
Figure BDA0003130279910000052
φ=soft max(Wcdcls+bc)
wherein, Wd、WkAnd bdIs the weight and bias, δ, of the multi-layer perceptron MLPiThe emotion expressions of different modes are reflected to make contribution to the final multi-mode representation;
step seven, the multi-modal emotion prediction process is described; the model is trained in an end-to-end fashion by minimizing the cross-entropy loss function:
Figure BDA0003130279910000053
where l is the true tag value of document d.
The method has the advantages that (1) on the basis of a visual attention mechanism and a self-attention mechanism, a context guide module is constructed, and a guide model is combined with visually enhanced sentence representations and context information to further mine emotion information in the image, so that not only is important text information emphasized, but also potential emotion representations of the vision in the text can be learned. (2) A multi-modal complementary fusion method is provided, wherein the relationship of emotion representation among the modalities is dynamically adjusted through a gate function, and the adjusted multi-modal representation not only contains important aspects shared by the modalities, but also has unique aspects of texts which are ignored visually, and is fused in a self-adaptive manner by depending on the overall emotion. (3) Numerous experiments performed on the Yelp data set showed that the present invention is superior to the methods in the prior art schemes. The Accuracy (acc) on 5 city data sets is respectively improved from 67.61, 70.70, 62.38, 61.45 and 62.40 to 71.43, 69.54, 65.47, 64.61 and 66.32, the Average Accuracy averageaccuracy (Avg) is improved from 62.80 to 65.80, and the Improvement rate (Improvement) is improved from 43.1% to 49.92%.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
In order to make the inventive content of the present invention better understood by those skilled in the art, a preferred embodiment of the present invention is further described below with reference to fig. 1, which includes learning a pre-trained model of bi-directional context information, i.e., BERTbaseA model, which is a feed forward network with 768 hidden units and 12 attention heads, characterized by the following steps:
step one, a word sequence X is given as [ X ]1,x2,....xN]Wherein x isiIs the sum of words, segments and position embedding, the word sequence X is input into a pre-training model for coding, and the output of the last layer of a coder is taken as a sentence hiding state hi:hi=Bert(Xi)
For multiple images attached to the document, the size of the images is uniformly adjusted to 224 x 224, the last complete connection layer of the residual error network is removed, and the output of the last convolution layer is used as IjAn image representation of (a);
aj=ResNet(Ij)
image representation ajIs from the image IjEncoding the obtained 2048-dimensional vector;
capturing emotion-related features in the sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences by using nonlinear conversionajAnd sentence level hidden state hiProjecting to the same space;
then, the correlation degree between the hidden state of the sentence and the specific image is learned by matrix multiplication, and a softmax function is applied to obtain a weight alphaj,iWeighted summation to obtain sentence characterization t with visual enhancementj
pj=relu(Wpaj+bp)
qi=relu(Wqhi+bq)
Figure BDA0003130279910000061
tj=αj,ihi
Wherein, Wp、Wq、bpAnd bqFor the weights and biases of the multi-layer perceptron, a relu nonlinear activation function is used, where alphaj,iCapturing a correlation of the visual representation and the sentence hiding state;
thirdly, from the perspective of a text, in the emotion analysis task, sentences containing emotion information are more important than sentences describing facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state hiLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight βiFinally, attention is paid to the weights and sentence-level hidden states hiObtaining the sentence representation s with self-enhanced text after weighted summationi
ui=Wu tanh(Wshi+bs)
Figure BDA0003130279910000071
Figure BDA0003130279910000072
Wherein, Wu、Ws、bsAnd buAre respective weights and offsets, of whichiReflecting the different importance degree of each sentence in the document;
step four, because it is difficult to directly obtain the emotional information of the image, the visual emotional characteristics need to be learned by depending on the sentence context; obtaining sentence expression t after visual enhancement according to the step twojThe sentence with the self-enhanced text in the step three is represented as siAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;
therefore, in a context Guide Complementary Fusion network CGCFN (context Guide Complementary Fusion network), a context Guide module CGM (context Guide module) of one of core modules is mainly used for learning common characteristics of vision and texts in the aspect of emotion by virtue of a context Guide attention mechanism through context representation so as to obtain visual potential emotion characteristics; characterizing a context siSentence t related to imagejProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentencej,iFurther obtaining a visual emotion representation ci(ii) a The calculation process for calculating the emotion weight coefficient is as follows:
uj=tanh(Wutj+be)
vi=tanh(Wvsi+bf)
Figure BDA0003130279910000073
ci=γj,itj
wherein, Wu∈Rcxe,Wv∈Rcxe,bu∈Rc,bv∈RcRespectively weight and bias parameters, adopting sigmoid function,γj,ithe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;
step five, because most of the contents reflected by the images in the comments are single, all important aspects in the text cannot be covered, and the complementary or enhanced relation between the modalities is dynamically adjusted by learning the interaction between the vision and the text; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation tj(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not highiAs Complementary features, in a context-oriented Complementary Fusion network CGCFN, a multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) which is one of core modules is composed of a gate function and a self-attention mechanism, the gate function learns interaction between cross-modalities, different weights are given to visual emotion features, the relationship between modalities is dynamically converted, and the self-attention mechanism fuses text features and the visual emotion features to obtain a final multi-modal representation; specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoidj,iThen let a gate function gj,iAnd characterization of visual emotion ciLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition methodiObtaining an adaptive multi-modal emotion representation di
ej=tanh(Wemj+be)
fi=tanh(Wfhi+bf)
Figure BDA0003130279910000081
Figure BDA0003130279910000082
Wherein, We、Wf、beAnd bfFor the purpose of the corresponding weight and the bias,
Figure BDA0003130279910000083
representing element-by-element multiplication; gate function g for learning relevance of vision and textj,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely relatedj,iWill be very large, visual emotion characterization ciThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, gj,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itselfi
Step six, expressing the multi-modal emotion in the step five by using diInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM to perform effective multi-mode fusion to obtain a multi-mode representation d related to the emotion classification task; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d isclsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;
ki=Wk(Wddi+bd)
Figure BDA0003130279910000091
Figure BDA0003130279910000092
φ=soft max(Wcdcls+bc)
wherein, Wd、WkAnd bdIs the weight and offset, δ, of the MLPiEmotion expression pair embodying different modesThe contribution made by the last multi-modal characterization;
step seven, the multi-modal emotion prediction process is described; the model is trained in an end-to-end fashion by minimizing the cross-entropy loss function:
Figure BDA0003130279910000093
where l is the true tag value of document d.
The present invention is compared to previously proposed multimodal emotion analysis models as shown in table 1:
table 1: comparison of the results of the method of the invention with the other methods
Figure BDA0003130279910000094
Table 1 lists the comparison result between the reference model and the model, and it can be seen that the average accuracy of the context-guided Complementary Fusion network cgcfn (context Guide Complementary Fusion network) on the yelp data set reaches 65.8%, which is 6.3% higher than that of the VistaNet model and 4.8% higher than that of the SFNN. The effect obtained on the CH data set is inferior to that of the reference model, which is caused by the fact that the CH data set has a smaller number of data sets than the data sets of LA, NY, etc., and cannot cover the general characteristics of a larger data set.
Further, the effectiveness of each module of the complementary fusion network CGCFN is guided by ablation experimental study context, starting from the most basic configuration, modules are added step by step to form the final model architecture, as shown in table 2,
table 2: ablation experiment of the method of the invention
Figure BDA0003130279910000095
First, only the text part is relied on, i.e. the self-attention mechanism between the text feature extraction module and the sequence. As shown in the first row, the average accuracy is 63.09%. The image features extracted with resnet152 are then used to learn important sentence representations through the visual attention module, with an effect that is 0.5% higher than the text portion, as shown in the second row of table 2. And then, a context guide module is added, and the visual emotion representation is obtained through further learning, so that the effect is improved by 1.6 percent as shown in the third row of the table 2. And finally, adding a multi-mode complementary fusion module, and performing effective balanced and adaptive fusion on the emotion expressions of the text mode and the visual mode by using a gate function and a self-attention mechanism, wherein as shown in the fourth row of the table 2, the average precision reaches 65.80%, and is improved by 4.3%. Ablation experiment results show that all sub-modules of the context-guided complementary fusion network CGCFN represent respective contribution and effectiveness.
This embodiment referring to FIG. 1, first, comment the text "redeem after 9 o' clock 30 pm! Compared with the outside, the inside is clean and tidy. After preprocessing, inputting a pre-training model, namely bi-directional coding representation from a transformer, and obtaining sequence characteristics; preprocessing 3 images of the comments and inputting the preprocessed images into a depth residual error network to obtain image representation;
secondly, learning the enhanced sequence features from two angles, wherein the first is from vision, and the visual attention module uses the image features of each image to carry out inner product interaction on the sequence features to obtain the visually enhanced sequence features; and secondly, starting from the text, a sentence-level self-attention mechanism learns the relationship in the sequence to obtain the self-enhanced sequence feature of the text. Because the image of the comment does not contain the emotion exceeding the text, the context guide module guides the sequence feature of the visual enhancement to pay attention to the emotion information shared by the image and the text by using the sequence feature of the text self-enhancement as the context to obtain the visual emotion feature;
secondly, dynamically adjusting weights of visual emotion characteristics and self-enhanced sequence characteristics of the text in multi-modal representation by a gate function of the multi-modal complementary fusion module through learning interaction of the text and the image, then inputting a self-attention mechanism for multi-modal fusion, and learning characteristics important to emotion classification tasks based on overall emotion of the document to obtain multi-modal document representation;
and finally, inputting the CLS mark represented by the multi-modal document into an emotion classifier for prediction.
In conclusion, the method and the device not only emphasize important text information, but also can learn the potential emotion expression of vision in the text. Not only contain important aspects common to modalities, but also text unique aspects of visual ignorance, while adaptively fusing by means of overall emotion. The average accuracy is improved from 62.80 to 65.80, and the improvement rate is improved from 43.1% to 49.92%.

Claims (1)

1. A multi-modal emotion classification method based on adaptive fusion of an attention mechanism comprises learning a pre-training model of bidirectional context information, namely BERTbaseA model, which is a feed forward network with 768 hidden units and 12 attention heads, characterized by the following steps:
step one, a word sequence X is given as [ X ]1,x2,....xN]Wherein x isiIs the sum of words, segments and position embeddings and N is the maximum length of the sequence. Inputting the word sequence X into a pre-training model for coding, and taking the output of the last layer of the coder as a sentence hiding state hi
hi=Bert(Xi)
For multiple images attached to the document, the size of the images is uniformly adjusted to 224 x 224, the last complete connection layer of the residual error network is removed, and the output of the last convolution layer is used as IjAn image representation of (a);
aj=ResNet(Ij)
image representation ajIs from the image IjEncoding the obtained 2048-dimensional vector;
capturing emotion-related features in a sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences a by using nonlinear conversionjAnd sentence level hidden state hiProjecting to the same space;
reuse of matrix multiplication learningThe sentence hiding state and the correlation degree of the specific image are applied, and the softmax function is applied to obtain the weight alphaj,iWeighted summation to obtain sentence characterization t with visual enhancementj
pj=relu(Wpaj+bp)
qi=relu(Wqhi+bq)
Figure FDA0003130279900000011
tj=αj,ihi
Wherein, Wp、Wq、bpAnd bqFor the weights and biases of the multi-layer perceptron, a relu nonlinear activation function is used, where alphaj,iCapturing a correlation of the visual representation and the sentence hiding state;
thirdly, the sentences containing the emotional information are more important than the sentences describing the facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state hiLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight βiFinally, attention is paid to the weights and sentence-level hidden states hiObtaining the sentence representation s with self-enhanced text after weighted summationi
ui=Wutanh(Wshi+bs)
Figure FDA0003130279900000021
Figure FDA0003130279900000022
Wherein, Wu、Ws、bsAnd buAre respective weights and offsets, whereinBeta of (A)iReflecting the different importance degree of each sentence in the document;
step four, learning visual emotional characteristics by depending on sentence contexts; obtaining sentence expression t after visual enhancement according to the step twojThe sentence with the self-enhanced text in the step three is represented as siAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;
a context guidance module CGM of one of the core modules in the context guidance complementary fusion network CGCFN; the module mainly depends on a context guide attention mechanism, and the sentences subjected to vision enhancement learn common characteristics of vision and texts in the aspect of emotion through context representation to obtain potential visual emotion characteristics; characterizing a context siSentence t related to imagejProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentencej,iFurther obtaining visual emotion representation ci; the calculation process for calculating the emotion weight coefficient is as follows:
uj=tanh(Wutj+be)
vi=tanh(Wvsi+bf)
Figure FDA0003130279900000023
ci=γj,itj
wherein, Wu∈Rcxe,Wv∈Rcxe,bu∈Rc,bv∈RcRespectively as weight and bias parameters, using sigmoid function, gammaj,iThe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;
step five, dynamically adjusting complementation between modalities by learning interaction between vision and textsOr enhancing the relationship; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation tj(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not highiAs a complementary feature, for this purpose, a multimodal complementary fusion module MCFM of one of the core modules in the context-guided complementary fusion network CGCFN; the module consists of a gate function and a self-attention mechanism, wherein the gate function learns the interaction among cross-modal states, gives different weights to visual emotion characteristics, dynamically converts the relationship among the modal states, and the self-attention mechanism fuses text characteristics and the visual emotion characteristics to obtain final multi-modal representation; specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoidj,iThen let a gate function gj,iAnd characterization of visual emotion ciLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition methodiObtaining an adaptive multi-modal emotion representation di
ej=tanh(Wemj+be)
fi=tanh(Wfhi+bf)
Figure FDA0003130279900000031
Figure FDA0003130279900000032
Wherein, We、Wf、beAnd bfFor the purpose of the corresponding weight and the bias,
Figure FDA0003130279900000033
representing element-by-element multiplication; study the designGate function g for learning relevance of vision and textj,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely relatedj,iWill be very large, visual emotion characterization ciThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, gj,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itselfi
Step six, expressing the multi-modal emotion in the step five by using diInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM to perform effective multi-mode fusion to obtain a multi-mode representation d related to the emotion classification task; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d isclsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;
ki=Wk(Wddi+bd)
Figure FDA0003130279900000034
Figure FDA0003130279900000035
φ=softmax(Wcdcls+bc)
wherein, Wd、WkAnd bdIs the weight and offset, δ, of the MLPiThe emotion expressions of different modes are reflected to make contribution to the final multi-mode representation;
step seven, the multi-modal emotion prediction process is described; the model is trained in an end-to-end fashion by minimizing the cross-entropy loss function:
Figure FDA0003130279900000036
where l is the true tag value of document d.
CN202110703330.7A 2021-06-24 2021-06-24 Self-adaptive fusion multi-mode emotion classification method based on attention mechanism Active CN113435496B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110703330.7A CN113435496B (en) 2021-06-24 2021-06-24 Self-adaptive fusion multi-mode emotion classification method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110703330.7A CN113435496B (en) 2021-06-24 2021-06-24 Self-adaptive fusion multi-mode emotion classification method based on attention mechanism

Publications (2)

Publication Number Publication Date
CN113435496A true CN113435496A (en) 2021-09-24
CN113435496B CN113435496B (en) 2022-09-02

Family

ID=77753844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110703330.7A Active CN113435496B (en) 2021-06-24 2021-06-24 Self-adaptive fusion multi-mode emotion classification method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN113435496B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169450A (en) * 2021-12-10 2022-03-11 同济大学 Social media data multi-modal attitude analysis method
CN114626441A (en) * 2022-02-23 2022-06-14 苏州大学 Implicit multi-mode matching method and system based on visual contrast attention
CN114969458A (en) * 2022-06-28 2022-08-30 昆明理工大学 Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance
CN115019237A (en) * 2022-06-30 2022-09-06 中国电信股份有限公司 Multi-modal emotion analysis method and device, electronic equipment and storage medium
CN115034202A (en) * 2022-04-13 2022-09-09 天津大学 Deep learning text matching method based on enhancement mode fusion grammar information
CN115083005A (en) * 2022-06-13 2022-09-20 广东省人民医院 ROP image classification system and method based on deep learning
CN115730153A (en) * 2022-08-30 2023-03-03 郑州轻工业大学 Multi-mode emotion analysis method based on emotion correlation and emotion label generation
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention
CN117033733A (en) * 2023-10-09 2023-11-10 北京民谐文化传播有限公司 Intelligent automatic classification and label generation system and method for library resources

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815903A (en) * 2019-01-24 2019-05-28 同济大学 A kind of video feeling classification method based on adaptive converged network
CN110874411A (en) * 2019-11-20 2020-03-10 福州大学 Cross-domain emotion classification system based on attention mechanism fusion
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
CN111753549A (en) * 2020-05-22 2020-10-09 江苏大学 Multi-mode emotion feature learning and recognition method based on attention mechanism
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network
CN112559683A (en) * 2020-12-11 2021-03-26 苏州元启创人工智能科技有限公司 Multi-mode data and multi-interaction memory network-based aspect-level emotion analysis method
CN112860888A (en) * 2021-01-26 2021-05-28 中山大学 Attention mechanism-based bimodal emotion analysis method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815903A (en) * 2019-01-24 2019-05-28 同济大学 A kind of video feeling classification method based on adaptive converged network
CN110874411A (en) * 2019-11-20 2020-03-10 福州大学 Cross-domain emotion classification system based on attention mechanism fusion
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
CN111753549A (en) * 2020-05-22 2020-10-09 江苏大学 Multi-mode emotion feature learning and recognition method based on attention mechanism
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network
CN112559683A (en) * 2020-12-11 2021-03-26 苏州元启创人工智能科技有限公司 Multi-mode data and multi-interaction memory network-based aspect-level emotion analysis method
CN112860888A (en) * 2021-01-26 2021-05-28 中山大学 Attention mechanism-based bimodal emotion analysis method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FEIRAN HUANG 等: "Image-text sentiment analysis via deep multimodal attention fusion", 《KNOWLEDGE-BASED SYSTEM》 *
吴良庆等: "基于情感信息辅助的多模态情绪识别", 《北京大学学报(自然科学版)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169450A (en) * 2021-12-10 2022-03-11 同济大学 Social media data multi-modal attitude analysis method
CN114626441A (en) * 2022-02-23 2022-06-14 苏州大学 Implicit multi-mode matching method and system based on visual contrast attention
CN115034202A (en) * 2022-04-13 2022-09-09 天津大学 Deep learning text matching method based on enhancement mode fusion grammar information
CN115083005A (en) * 2022-06-13 2022-09-20 广东省人民医院 ROP image classification system and method based on deep learning
CN114969458A (en) * 2022-06-28 2022-08-30 昆明理工大学 Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance
CN114969458B (en) * 2022-06-28 2024-04-26 昆明理工大学 Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion
CN115019237B (en) * 2022-06-30 2023-12-08 中国电信股份有限公司 Multi-mode emotion analysis method and device, electronic equipment and storage medium
CN115019237A (en) * 2022-06-30 2022-09-06 中国电信股份有限公司 Multi-modal emotion analysis method and device, electronic equipment and storage medium
CN115730153A (en) * 2022-08-30 2023-03-03 郑州轻工业大学 Multi-mode emotion analysis method based on emotion correlation and emotion label generation
CN115730153B (en) * 2022-08-30 2023-05-26 郑州轻工业大学 Multi-mode emotion analysis method based on emotion association and emotion label generation
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention
CN117033733A (en) * 2023-10-09 2023-11-10 北京民谐文化传播有限公司 Intelligent automatic classification and label generation system and method for library resources
CN117033733B (en) * 2023-10-09 2023-12-22 北京民谐文化传播有限公司 Intelligent automatic classification and label generation system and method for library resources

Also Published As

Publication number Publication date
CN113435496B (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN113435496B (en) Self-adaptive fusion multi-mode emotion classification method based on attention mechanism
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
KR102222451B1 (en) An apparatus for predicting the status of user's psychology and a method thereof
CN111246256B (en) Video recommendation method based on multi-mode video content and multi-task learning
US11227108B2 (en) Convolutional neural network architecture with adaptive filters
CN109344404B (en) Context-aware dual-attention natural language reasoning method
CN109284506A (en) A kind of user comment sentiment analysis system and method based on attention convolutional neural networks
CN112131469A (en) Deep learning recommendation method based on comment text
Shen et al. A voice of the customer real-time strategy: An integrated quality function deployment approach
CN112989033A (en) Microblog emotion classification method based on emotion category description
CN111274396B (en) Visual angle level text emotion classification method and system based on external knowledge
Mishra et al. IIIT_DWD@ HASOC 2020: Identifying offensive content in Indo-European languages.
Wang et al. Gated hierarchical attention for image captioning
CN115630145A (en) Multi-granularity emotion-based conversation recommendation method and system
CN112307755A (en) Multi-feature and deep learning-based spam comment identification method
CN114648031A (en) Text aspect level emotion recognition method based on bidirectional LSTM and multi-head attention mechanism
CN111651661A (en) Image-text cross-media retrieval method
CN113254637B (en) Grammar-fused aspect-level text emotion classification method and system
CN113268592B (en) Short text object emotion classification method based on multi-level interactive attention mechanism
Li et al. Image aesthetics assessment with attribute-assisted multimodal memory network
Das A multimodal approach to sarcasm detection on social media
CN111368524A (en) Microblog viewpoint sentence recognition method based on self-attention bidirectional GRU and SVM
CN115659990A (en) Tobacco emotion analysis method, device and medium
Suddul et al. A Smart Virtual Tutor with Facial Emotion Recognition for Online Learning
CN115309894A (en) Text emotion classification method and device based on confrontation training and TF-IDF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant