CN113435496A - Self-adaptive fusion multi-mode emotion classification method based on attention mechanism - Google Patents
Self-adaptive fusion multi-mode emotion classification method based on attention mechanism Download PDFInfo
- Publication number
- CN113435496A CN113435496A CN202110703330.7A CN202110703330A CN113435496A CN 113435496 A CN113435496 A CN 113435496A CN 202110703330 A CN202110703330 A CN 202110703330A CN 113435496 A CN113435496 A CN 113435496A
- Authority
- CN
- China
- Prior art keywords
- emotion
- sentence
- text
- representation
- visual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the field of multi-modal emotion analysis, in particular to a multi-modal emotion classification method based on adaptive fusion of an attention system. The technical scheme includes that a word sequence X is given as X1,x2,...xN]Capturing emotion-related features in a sentence sequence from different angles, and learning visual emotion features by depending on sentence contexts; describing a multi-modal emotion prediction process through a context guide module and a multi-modal complementary fusion module; the method has the advantages that (1) not only the important text information is emphasized, but also the potential emotion expression of vision in the text can be learned. (2) Not only do they contain important aspects common to the modalities,there are also unique aspects of text that are visually ignored, while fusing adaptively, relying on overall emotion. (3) The average accuracy is improved from 62.80 to 65.80, and the improvement rate is improved from 43.1% to 49.92%.
Description
Technical Field
The invention relates to the field of multi-modal emotion analysis, in particular to a multi-modal emotion classification method based on adaptive fusion of an attention system.
Background
With the rapid development of the internet, users are more and more willing to publish contents (such as comments) on a social platform, which causes the rapid growth of information on the internet. It is reported that 90% of consumers tend to read reviews about goods prior to consumption. 88% of consumers would choose to view reviews and suggestions written by acquaintances. Thus, businesses desire to know consumer preferences or consumer opinions about product design and marketing strategies. Emotion analysis is central to understanding user-generated content. Currently, emotion analysis based on a single modality is the center of gravity of previous research, and emotion classification methods based on texts can be roughly divided into three types: emotion dictionary based, machine learning based traditional methods and deep learning based methods. Due to the popularity of smartphones, the content on the social platform gradually evolves from a single document to a rich document (such as text and images) with a combination of multiple modalities, and the user-generated multi-modal content more accurately describes the user's current experience than text.
Visual information is introduced, and the problem of text recognition errors in important aspects is solved to a certain extent. How to perform effective fusion among multiple modes and improve the emotion classification effect is a hot problem in the field. There are two main views of the relationship between modalities, one is that the image and the text are considered to be equally effective, and both can be regarded as independent features contributing to emotion classification. The modal information is combined using a multi-layer perceptron (MLP) for inferring potential emotional states of the user by obtaining image and text features using a deep neural network, respectively. Or by the Deep Multimodal Attention Fusion (DMAF) framework, exploiting the distinctive features and intrinsic relevance between visual and textual modalities. Or judging whether the contents of the two are consistent by using an image-text consistency method, and then adaptively fusing text features and visual features extracted from the traditional sentiBank based on a Support Vector Machine (SVM) to classify the graphics and text emotions. Or two-layer multimodal hypergraph learning (Bi-MHG) is proposed, explicitly modeling the correlation between visual, textual and expressive modalities by connecting the two layers by sharing the correlation of multimodal features. Or considering that the fusion of multi-modal information needs to be based on overall emotion, an attention-based modal gating network (AMGN) is proposed for the fusion, wherein a modal gating long-term short-term memory network (LSTM) adaptively selects a mode with strong emotion to learn multi-modal characteristics; while another view considers text to be dominant, images should be considered as an aid to assist text in focusing on important aspects.
The traditional method based on machine learning considers that images in the comments can emphasize some important aspects, visual information plays an auxiliary role for overall emotion and is not used as an independent feature, and in the prior art, when a visual aspect attention network is designed, an image guide model is used for focusing on important sentences in the comments. However, the inventor believes that the image in the comment is a presentation of a specific thing, and the attention network tends to give more attention to the sentence describing the fact, rather than the sentence reflecting the emotion, and does not fully utilize the potential information of the image. Obtaining affective features directly from images is tricky and requires text comments to further mine features that are valid for the emotion classification task. By guiding the emotion module in the model context, the vision-enhanced sentences are made to pay attention to the important aspects related to emotion, so that the potential visual emotion is learned. In addition, the image categories in the online review data set are mostly single, the food and environment aspects of the restaurant are basically presented, and the review of the invention is biased to be comprehensive, and the contained aspects are more complete. Therefore, it can be said that the aspect embodied by the image does not cover all aspects important for document emotion classification, and the inventors propose to use a global MEAN image as a compensation, but the lacking aspect of different comment images is different, such processing is too simple and not flexible enough, the inventors consider that the aspect reflected by important sentences in the text can be used as a complementary feature, and solve the problem that the image only focuses on a single aspect, and compared with the important aspect based on image capture, a more accurate classification result can be obtained.
Disclosure of Invention
The invention aims to provide a multi-modal emotion classification method based on adaptive fusion of an attention mechanism, so that the defects in the prior art are overcome.
The invention is realized by the following technical scheme, which comprises a pre-training model for learning bidirectional context information, namely BERTbaseModel with 768 hiddenA feed forward network of cells and 12 heads of attention, comprising the steps of:
step one, a word sequence X is given as [ X ]1,x2,....xN]Wherein x isiIs the sum of words, segments and position embeddings and N is the maximum length of the sequence. Inputting the word sequence X into a pre-training model for coding, and taking the output of the last layer of the coder as a sentence hiding state hi:
hi=Bert(Xi)
For multiple images attached to the document, the size of the images is uniformly adjusted to 224 x 224, the last complete connection layer of the residual error network is removed, and the output of the last convolution layer is used as IjAn image representation of (a);
aj=Re sNet(Ij)
image representation ajIs from the image IjEncoding the obtained 2048-dimensional vector;
capturing emotion-related features in a sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences a by using nonlinear conversionjAnd sentence level hidden state hiProjecting to the same space;
then, the correlation degree between the hidden state of the sentence and the specific image is learned by matrix multiplication, and a softmax function is applied to obtain a weight alphaj,iWeighted summation to obtain sentence characterization t with visual enhancementj;
pj=relu(Wpaj+bp)
qi=relu(Wqhi+bq)
tj=αj,ihi
Wherein, Wp、Wq、bpAnd bqFor weighting and biasing of multi-tier perceptrons, rel is usedu nonlinear activation function, where αj,iCapturing a correlation of the visual representation and the sentence hiding state;
thirdly, from the perspective of a text, in the emotion analysis task, sentences containing emotion information are more important than sentences describing facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state hiLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight βiFinally, attention is paid to the weights and sentence-level hidden states hiObtaining the sentence representation s with self-enhanced text after weighted summationi;
ui=Wu tanh(Wshi+bs)
Wherein, Wu、Ws、bsAnd buAre respective weights and offsets, of whichiReflecting the different importance degree of each sentence in the document;
step four, because it is difficult to directly obtain the emotional information of the image, the visual emotional characteristics need to be learned by depending on the sentence context; obtaining sentence expression t after visual enhancement according to the step twojThe sentence with the self-enhanced text in the step three is represented as siAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;
for this purpose, in the context-oriented complementary convergence network CGCFN, the context-oriented module cgm (context Guide module) of one of the core modules is given a context representation by mainly relying on the context-oriented attention mechanismThe sentences after visual enhancement learn common characteristics of vision and texts in the aspect of emotion to obtain visual potential emotional characteristics. Characterizing a context siSentence t related to imagejProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentencej,iFurther obtaining a visual emotion representation ci(ii) a The calculation process for calculating the emotion weight coefficient is as follows:
uj=tanh(Wutj+be)
vi=tanh(Wvsi+bf)
ci=γj,itj
wherein, Wu∈Rcxe,Wv∈Rcxe,bu∈Rc,bv∈RcRespectively as weight and bias parameters, using sigmoid function, gammaj,iThe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;
step five, because most of the contents reflected by the images in the comments are single, all important aspects in the text cannot be covered, and the complementary or enhanced relation between the modalities is dynamically adjusted by learning the interaction between the vision and the text; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation tj(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not highiAs Complementary features, in the context-oriented Complementary Fusion network CGCFN, a multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) which is one of the core modules is composed of a gate function and a self-attention mechanism, the gate function learns the interaction between the cross-modalities, different weights are given to the visual emotion features, the relationship between the modalities is dynamically converted, the self-attention mechanism fuses the text features and the visual emotion features, and the final multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) is obtainedAnd (6) state representation. Specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoidj,iThen let a gate function gj,iAnd characterization of visual emotion ciLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition methodiObtaining an adaptive multi-modal emotion representation di;
ej=tanh(Wemj+be)
fi=tanh(Wfhi+bf)
Wherein, We、Wf、beAnd bfFor the purpose of the corresponding weight and the bias,representing element-by-element multiplication; gate function g for learning relevance of vision and textj,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely relatedj,iWill be very large, visual emotion characterization ciThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, gj,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itselfi;
Step six, expressing the multi-modal emotion in the step five by using diInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM (Multi-modality complementary fusion Module), and performing effective multi-mode fusion to obtain a final and emotion classification taskA relevant multi-modal representation d; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d isclsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;
ki=Wk(Wddi+bd)
φ=soft max(Wcdcls+bc)
wherein, Wd、WkAnd bdIs the weight and bias, δ, of the multi-layer perceptron MLPiThe emotion expressions of different modes are reflected to make contribution to the final multi-mode representation;
step seven, the multi-modal emotion prediction process is described; the model is trained in an end-to-end fashion by minimizing the cross-entropy loss function:
where l is the true tag value of document d.
The method has the advantages that (1) on the basis of a visual attention mechanism and a self-attention mechanism, a context guide module is constructed, and a guide model is combined with visually enhanced sentence representations and context information to further mine emotion information in the image, so that not only is important text information emphasized, but also potential emotion representations of the vision in the text can be learned. (2) A multi-modal complementary fusion method is provided, wherein the relationship of emotion representation among the modalities is dynamically adjusted through a gate function, and the adjusted multi-modal representation not only contains important aspects shared by the modalities, but also has unique aspects of texts which are ignored visually, and is fused in a self-adaptive manner by depending on the overall emotion. (3) Numerous experiments performed on the Yelp data set showed that the present invention is superior to the methods in the prior art schemes. The Accuracy (acc) on 5 city data sets is respectively improved from 67.61, 70.70, 62.38, 61.45 and 62.40 to 71.43, 69.54, 65.47, 64.61 and 66.32, the Average Accuracy averageaccuracy (Avg) is improved from 62.80 to 65.80, and the Improvement rate (Improvement) is improved from 43.1% to 49.92%.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
In order to make the inventive content of the present invention better understood by those skilled in the art, a preferred embodiment of the present invention is further described below with reference to fig. 1, which includes learning a pre-trained model of bi-directional context information, i.e., BERTbaseA model, which is a feed forward network with 768 hidden units and 12 attention heads, characterized by the following steps:
step one, a word sequence X is given as [ X ]1,x2,....xN]Wherein x isiIs the sum of words, segments and position embedding, the word sequence X is input into a pre-training model for coding, and the output of the last layer of a coder is taken as a sentence hiding state hi:hi=Bert(Xi)
For multiple images attached to the document, the size of the images is uniformly adjusted to 224 x 224, the last complete connection layer of the residual error network is removed, and the output of the last convolution layer is used as IjAn image representation of (a);
aj=ResNet(Ij)
image representation ajIs from the image IjEncoding the obtained 2048-dimensional vector;
capturing emotion-related features in the sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences by using nonlinear conversionajAnd sentence level hidden state hiProjecting to the same space;
then, the correlation degree between the hidden state of the sentence and the specific image is learned by matrix multiplication, and a softmax function is applied to obtain a weight alphaj,iWeighted summation to obtain sentence characterization t with visual enhancementj;
pj=relu(Wpaj+bp)
qi=relu(Wqhi+bq)
tj=αj,ihi
Wherein, Wp、Wq、bpAnd bqFor the weights and biases of the multi-layer perceptron, a relu nonlinear activation function is used, where alphaj,iCapturing a correlation of the visual representation and the sentence hiding state;
thirdly, from the perspective of a text, in the emotion analysis task, sentences containing emotion information are more important than sentences describing facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state hiLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight βiFinally, attention is paid to the weights and sentence-level hidden states hiObtaining the sentence representation s with self-enhanced text after weighted summationi;
ui=Wu tanh(Wshi+bs)
Wherein, Wu、Ws、bsAnd buAre respective weights and offsets, of whichiReflecting the different importance degree of each sentence in the document;
step four, because it is difficult to directly obtain the emotional information of the image, the visual emotional characteristics need to be learned by depending on the sentence context; obtaining sentence expression t after visual enhancement according to the step twojThe sentence with the self-enhanced text in the step three is represented as siAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;
therefore, in a context Guide Complementary Fusion network CGCFN (context Guide Complementary Fusion network), a context Guide module CGM (context Guide module) of one of core modules is mainly used for learning common characteristics of vision and texts in the aspect of emotion by virtue of a context Guide attention mechanism through context representation so as to obtain visual potential emotion characteristics; characterizing a context siSentence t related to imagejProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentencej,iFurther obtaining a visual emotion representation ci(ii) a The calculation process for calculating the emotion weight coefficient is as follows:
uj=tanh(Wutj+be)
vi=tanh(Wvsi+bf)
ci=γj,itj
wherein, Wu∈Rcxe,Wv∈Rcxe,bu∈Rc,bv∈RcRespectively weight and bias parameters, adopting sigmoid function,γj,ithe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;
step five, because most of the contents reflected by the images in the comments are single, all important aspects in the text cannot be covered, and the complementary or enhanced relation between the modalities is dynamically adjusted by learning the interaction between the vision and the text; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation tj(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not highiAs Complementary features, in a context-oriented Complementary Fusion network CGCFN, a multi-modal Complementary Fusion module mcfm (multi Complementary Fusion module) which is one of core modules is composed of a gate function and a self-attention mechanism, the gate function learns interaction between cross-modalities, different weights are given to visual emotion features, the relationship between modalities is dynamically converted, and the self-attention mechanism fuses text features and the visual emotion features to obtain a final multi-modal representation; specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoidj,iThen let a gate function gj,iAnd characterization of visual emotion ciLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition methodiObtaining an adaptive multi-modal emotion representation di;
ej=tanh(Wemj+be)
fi=tanh(Wfhi+bf)
Wherein, We、Wf、beAnd bfFor the purpose of the corresponding weight and the bias,representing element-by-element multiplication; gate function g for learning relevance of vision and textj,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely relatedj,iWill be very large, visual emotion characterization ciThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, gj,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itselfi;
Step six, expressing the multi-modal emotion in the step five by using diInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM to perform effective multi-mode fusion to obtain a multi-mode representation d related to the emotion classification task; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d isclsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;
ki=Wk(Wddi+bd)
φ=soft max(Wcdcls+bc)
wherein, Wd、WkAnd bdIs the weight and offset, δ, of the MLPiEmotion expression pair embodying different modesThe contribution made by the last multi-modal characterization;
step seven, the multi-modal emotion prediction process is described; the model is trained in an end-to-end fashion by minimizing the cross-entropy loss function:
where l is the true tag value of document d.
The present invention is compared to previously proposed multimodal emotion analysis models as shown in table 1:
table 1: comparison of the results of the method of the invention with the other methods
Table 1 lists the comparison result between the reference model and the model, and it can be seen that the average accuracy of the context-guided Complementary Fusion network cgcfn (context Guide Complementary Fusion network) on the yelp data set reaches 65.8%, which is 6.3% higher than that of the VistaNet model and 4.8% higher than that of the SFNN. The effect obtained on the CH data set is inferior to that of the reference model, which is caused by the fact that the CH data set has a smaller number of data sets than the data sets of LA, NY, etc., and cannot cover the general characteristics of a larger data set.
Further, the effectiveness of each module of the complementary fusion network CGCFN is guided by ablation experimental study context, starting from the most basic configuration, modules are added step by step to form the final model architecture, as shown in table 2,
table 2: ablation experiment of the method of the invention
First, only the text part is relied on, i.e. the self-attention mechanism between the text feature extraction module and the sequence. As shown in the first row, the average accuracy is 63.09%. The image features extracted with resnet152 are then used to learn important sentence representations through the visual attention module, with an effect that is 0.5% higher than the text portion, as shown in the second row of table 2. And then, a context guide module is added, and the visual emotion representation is obtained through further learning, so that the effect is improved by 1.6 percent as shown in the third row of the table 2. And finally, adding a multi-mode complementary fusion module, and performing effective balanced and adaptive fusion on the emotion expressions of the text mode and the visual mode by using a gate function and a self-attention mechanism, wherein as shown in the fourth row of the table 2, the average precision reaches 65.80%, and is improved by 4.3%. Ablation experiment results show that all sub-modules of the context-guided complementary fusion network CGCFN represent respective contribution and effectiveness.
This embodiment referring to FIG. 1, first, comment the text "redeem after 9 o' clock 30 pm! Compared with the outside, the inside is clean and tidy. After preprocessing, inputting a pre-training model, namely bi-directional coding representation from a transformer, and obtaining sequence characteristics; preprocessing 3 images of the comments and inputting the preprocessed images into a depth residual error network to obtain image representation;
secondly, learning the enhanced sequence features from two angles, wherein the first is from vision, and the visual attention module uses the image features of each image to carry out inner product interaction on the sequence features to obtain the visually enhanced sequence features; and secondly, starting from the text, a sentence-level self-attention mechanism learns the relationship in the sequence to obtain the self-enhanced sequence feature of the text. Because the image of the comment does not contain the emotion exceeding the text, the context guide module guides the sequence feature of the visual enhancement to pay attention to the emotion information shared by the image and the text by using the sequence feature of the text self-enhancement as the context to obtain the visual emotion feature;
secondly, dynamically adjusting weights of visual emotion characteristics and self-enhanced sequence characteristics of the text in multi-modal representation by a gate function of the multi-modal complementary fusion module through learning interaction of the text and the image, then inputting a self-attention mechanism for multi-modal fusion, and learning characteristics important to emotion classification tasks based on overall emotion of the document to obtain multi-modal document representation;
and finally, inputting the CLS mark represented by the multi-modal document into an emotion classifier for prediction.
In conclusion, the method and the device not only emphasize important text information, but also can learn the potential emotion expression of vision in the text. Not only contain important aspects common to modalities, but also text unique aspects of visual ignorance, while adaptively fusing by means of overall emotion. The average accuracy is improved from 62.80 to 65.80, and the improvement rate is improved from 43.1% to 49.92%.
Claims (1)
1. A multi-modal emotion classification method based on adaptive fusion of an attention mechanism comprises learning a pre-training model of bidirectional context information, namely BERTbaseA model, which is a feed forward network with 768 hidden units and 12 attention heads, characterized by the following steps:
step one, a word sequence X is given as [ X ]1,x2,....xN]Wherein x isiIs the sum of words, segments and position embeddings and N is the maximum length of the sequence. Inputting the word sequence X into a pre-training model for coding, and taking the output of the last layer of the coder as a sentence hiding state hi:
hi=Bert(Xi)
For multiple images attached to the document, the size of the images is uniformly adjusted to 224 x 224, the last complete connection layer of the residual error network is removed, and the output of the last convolution layer is used as IjAn image representation of (a);
aj=ResNet(Ij)
image representation ajIs from the image IjEncoding the obtained 2048-dimensional vector;
capturing emotion-related features in a sentence sequence from different angles, firstly enhancing the sentences related to the sentences in the document by using image information from the visual angle, and embedding the visual features into the sentences a by using nonlinear conversionjAnd sentence level hidden state hiProjecting to the same space;
reuse of matrix multiplication learningThe sentence hiding state and the correlation degree of the specific image are applied, and the softmax function is applied to obtain the weight alphaj,iWeighted summation to obtain sentence characterization t with visual enhancementj;
pj=relu(Wpaj+bp)
qi=relu(Wqhi+bq)
tj=αj,ihi
Wherein, Wp、Wq、bpAnd bqFor the weights and biases of the multi-layer perceptron, a relu nonlinear activation function is used, where alphaj,iCapturing a correlation of the visual representation and the sentence hiding state;
thirdly, the sentences containing the emotional information are more important than the sentences describing the facts; therefore, a self-attention mechanism is adopted at the sentence level to ensure that the sentence level is in a hidden state hiLearning the relation between sentences to obtain the relative importance of sentence representation, normalizing it with softmax to obtain its attention weight βiFinally, attention is paid to the weights and sentence-level hidden states hiObtaining the sentence representation s with self-enhanced text after weighted summationi;
ui=Wutanh(Wshi+bs)
Wherein, Wu、Ws、bsAnd buAre respective weights and offsets, whereinBeta of (A)iReflecting the different importance degree of each sentence in the document;
step four, learning visual emotional characteristics by depending on sentence contexts; obtaining sentence expression t after visual enhancement according to the step twojThe sentence with the self-enhanced text in the step three is represented as siAs context, it is made to guide the image-enhanced sentence to pay more attention to features related to emotion; the context guidance complementary fusion network CGCFN mainly comprises a context guidance module CGM and a multi-mode complementary fusion module MCFM;
a context guidance module CGM of one of the core modules in the context guidance complementary fusion network CGCFN; the module mainly depends on a context guide attention mechanism, and the sentences subjected to vision enhancement learn common characteristics of vision and texts in the aspect of emotion through context representation to obtain potential visual emotion characteristics; characterizing a context siSentence t related to imagejProjecting the images to the same space by means of different parameter matrixes, and calculating the correlation degree of the two to obtain the emotion weight coefficient gamma of the vision enhancement sentencej,iFurther obtaining visual emotion representation ci; the calculation process for calculating the emotion weight coefficient is as follows:
uj=tanh(Wutj+be)
vi=tanh(Wvsi+bf)
ci=γj,itj
wherein, Wu∈Rcxe,Wv∈Rcxe,bu∈Rc,bv∈RcRespectively as weight and bias parameters, using sigmoid function, gammaj,iThe relevance of sentence capture and emotion information embodying the context characterization guide vision enhancement;
step five, dynamically adjusting complementation between modalities by learning interaction between vision and textsOr enhancing the relationship; when the image and text are high in relevance, the relevance of the image and the text is utilized to strengthen the visual emotion representation tj(ii) a Relying on text-enhanced sentence representations s when image and text relevance is not highiAs a complementary feature, for this purpose, a multimodal complementary fusion module MCFM of one of the core modules in the context-guided complementary fusion network CGCFN; the module consists of a gate function and a self-attention mechanism, wherein the gate function learns the interaction among cross-modal states, gives different weights to visual emotion characteristics, dynamically converts the relationship among the modal states, and the self-attention mechanism fuses text characteristics and the visual emotion characteristics to obtain final multi-modal representation; specifically, firstly, the degree of correlation between an image and a text is calculated, visual feature embedding and text representation are projected to the same space through a layer of neurons with nonlinear conversion, then the visual feature embedding and the text representation are multiplied, and a mode gate function g is obtained through nonlinear conversion with an activation function sigmoidj,iThen let a gate function gj,iAnd characterization of visual emotion ciLearning the interaction of the two by the element multiplication method, and simultaneously adding the sentence s with self-enhanced text by the element addition methodiObtaining an adaptive multi-modal emotion representation di;
ej=tanh(Wemj+be)
fi=tanh(Wfhi+bf)
Wherein, We、Wf、beAnd bfFor the purpose of the corresponding weight and the bias,representing element-by-element multiplication; study the designGate function g for learning relevance of vision and textj,iDynamically adjusting the relationship between the two modal representations using back propagation, g, when the image and text are closely relatedj,iWill be very large, visual emotion characterization ciThe contribution to the multi-modal representation is greater; conversely, when the image-text correspondence is weak, gj,iWill be small, ensuring that the current multi-modal representation is more dependent on the sentence characterization s of the text itselfi;
Step six, expressing the multi-modal emotion in the step five by using diInputting the data into a self-attention mechanism in a multi-mode complementary fusion module MCFM to perform effective multi-mode fusion to obtain a multi-mode representation d related to the emotion classification task; in addition, the input of the pre-training model BERT is sequences, and the first label of each sequence is a classification label [ CLS]Classification tag [ CLS ]]The corresponding final hidden state learns the global information and is often used in a classification task as a sequence representation after aggregation; thus, d isclsInputting the emotion prediction value phi into a full connection layer and a softmax function to obtain a final emotion prediction value phi;
ki=Wk(Wddi+bd)
φ=softmax(Wcdcls+bc)
wherein, Wd、WkAnd bdIs the weight and offset, δ, of the MLPiThe emotion expressions of different modes are reflected to make contribution to the final multi-mode representation;
step seven, the multi-modal emotion prediction process is described; the model is trained in an end-to-end fashion by minimizing the cross-entropy loss function:
where l is the true tag value of document d.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110703330.7A CN113435496B (en) | 2021-06-24 | 2021-06-24 | Self-adaptive fusion multi-mode emotion classification method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110703330.7A CN113435496B (en) | 2021-06-24 | 2021-06-24 | Self-adaptive fusion multi-mode emotion classification method based on attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113435496A true CN113435496A (en) | 2021-09-24 |
CN113435496B CN113435496B (en) | 2022-09-02 |
Family
ID=77753844
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110703330.7A Active CN113435496B (en) | 2021-06-24 | 2021-06-24 | Self-adaptive fusion multi-mode emotion classification method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113435496B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113850842A (en) * | 2021-09-26 | 2021-12-28 | 北京理工大学 | Anti-occlusion target tracking method based on attention mask |
CN114170460A (en) * | 2021-11-24 | 2022-03-11 | 北京化工大学 | Multi-mode fusion-based artwork classification method and system |
CN114169450A (en) * | 2021-12-10 | 2022-03-11 | 同济大学 | Social media data multi-modal attitude analysis method |
CN114626441A (en) * | 2022-02-23 | 2022-06-14 | 苏州大学 | Implicit multi-mode matching method and system based on visual contrast attention |
CN114969458A (en) * | 2022-06-28 | 2022-08-30 | 昆明理工大学 | Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance |
CN115019237A (en) * | 2022-06-30 | 2022-09-06 | 中国电信股份有限公司 | Multi-modal emotion analysis method and device, electronic equipment and storage medium |
CN115034202A (en) * | 2022-04-13 | 2022-09-09 | 天津大学 | Deep learning text matching method based on enhancement mode fusion grammar information |
CN115083005A (en) * | 2022-06-13 | 2022-09-20 | 广东省人民医院 | ROP image classification system and method based on deep learning |
CN115730153A (en) * | 2022-08-30 | 2023-03-03 | 郑州轻工业大学 | Multi-mode emotion analysis method based on emotion correlation and emotion label generation |
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
CN117033733A (en) * | 2023-10-09 | 2023-11-10 | 北京民谐文化传播有限公司 | Intelligent automatic classification and label generation system and method for library resources |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815903A (en) * | 2019-01-24 | 2019-05-28 | 同济大学 | A kind of video feeling classification method based on adaptive converged network |
CN110874411A (en) * | 2019-11-20 | 2020-03-10 | 福州大学 | Cross-domain emotion classification system based on attention mechanism fusion |
CN111275085A (en) * | 2020-01-15 | 2020-06-12 | 重庆邮电大学 | Online short video multi-modal emotion recognition method based on attention fusion |
CN111753549A (en) * | 2020-05-22 | 2020-10-09 | 江苏大学 | Multi-mode emotion feature learning and recognition method based on attention mechanism |
CN112348075A (en) * | 2020-11-02 | 2021-02-09 | 大连理工大学 | Multi-mode emotion recognition method based on contextual attention neural network |
CN112559683A (en) * | 2020-12-11 | 2021-03-26 | 苏州元启创人工智能科技有限公司 | Multi-mode data and multi-interaction memory network-based aspect-level emotion analysis method |
CN112860888A (en) * | 2021-01-26 | 2021-05-28 | 中山大学 | Attention mechanism-based bimodal emotion analysis method |
-
2021
- 2021-06-24 CN CN202110703330.7A patent/CN113435496B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815903A (en) * | 2019-01-24 | 2019-05-28 | 同济大学 | A kind of video feeling classification method based on adaptive converged network |
CN110874411A (en) * | 2019-11-20 | 2020-03-10 | 福州大学 | Cross-domain emotion classification system based on attention mechanism fusion |
CN111275085A (en) * | 2020-01-15 | 2020-06-12 | 重庆邮电大学 | Online short video multi-modal emotion recognition method based on attention fusion |
CN111753549A (en) * | 2020-05-22 | 2020-10-09 | 江苏大学 | Multi-mode emotion feature learning and recognition method based on attention mechanism |
CN112348075A (en) * | 2020-11-02 | 2021-02-09 | 大连理工大学 | Multi-mode emotion recognition method based on contextual attention neural network |
CN112559683A (en) * | 2020-12-11 | 2021-03-26 | 苏州元启创人工智能科技有限公司 | Multi-mode data and multi-interaction memory network-based aspect-level emotion analysis method |
CN112860888A (en) * | 2021-01-26 | 2021-05-28 | 中山大学 | Attention mechanism-based bimodal emotion analysis method |
Non-Patent Citations (2)
Title |
---|
FEIRAN HUANG 等: "Image-text sentiment analysis via deep multimodal attention fusion", 《KNOWLEDGE-BASED SYSTEM》 * |
吴良庆等: "基于情感信息辅助的多模态情绪识别", 《北京大学学报(自然科学版)》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113850842A (en) * | 2021-09-26 | 2021-12-28 | 北京理工大学 | Anti-occlusion target tracking method based on attention mask |
CN114170460A (en) * | 2021-11-24 | 2022-03-11 | 北京化工大学 | Multi-mode fusion-based artwork classification method and system |
CN114169450A (en) * | 2021-12-10 | 2022-03-11 | 同济大学 | Social media data multi-modal attitude analysis method |
CN114626441A (en) * | 2022-02-23 | 2022-06-14 | 苏州大学 | Implicit multi-mode matching method and system based on visual contrast attention |
CN115034202A (en) * | 2022-04-13 | 2022-09-09 | 天津大学 | Deep learning text matching method based on enhancement mode fusion grammar information |
CN115083005A (en) * | 2022-06-13 | 2022-09-20 | 广东省人民医院 | ROP image classification system and method based on deep learning |
CN114969458A (en) * | 2022-06-28 | 2022-08-30 | 昆明理工大学 | Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance |
CN114969458B (en) * | 2022-06-28 | 2024-04-26 | 昆明理工大学 | Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion |
CN115019237A (en) * | 2022-06-30 | 2022-09-06 | 中国电信股份有限公司 | Multi-modal emotion analysis method and device, electronic equipment and storage medium |
CN115019237B (en) * | 2022-06-30 | 2023-12-08 | 中国电信股份有限公司 | Multi-mode emotion analysis method and device, electronic equipment and storage medium |
CN115730153A (en) * | 2022-08-30 | 2023-03-03 | 郑州轻工业大学 | Multi-mode emotion analysis method based on emotion correlation and emotion label generation |
CN115730153B (en) * | 2022-08-30 | 2023-05-26 | 郑州轻工业大学 | Multi-mode emotion analysis method based on emotion association and emotion label generation |
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
CN117033733A (en) * | 2023-10-09 | 2023-11-10 | 北京民谐文化传播有限公司 | Intelligent automatic classification and label generation system and method for library resources |
CN117033733B (en) * | 2023-10-09 | 2023-12-22 | 北京民谐文化传播有限公司 | Intelligent automatic classification and label generation system and method for library resources |
Also Published As
Publication number | Publication date |
---|---|
CN113435496B (en) | 2022-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113435496B (en) | Self-adaptive fusion multi-mode emotion classification method based on attention mechanism | |
CN109933664B (en) | Fine-grained emotion analysis improvement method based on emotion word embedding | |
KR102222451B1 (en) | An apparatus for predicting the status of user's psychology and a method thereof | |
CN111246256B (en) | Video recommendation method based on multi-mode video content and multi-task learning | |
US11227108B2 (en) | Convolutional neural network architecture with adaptive filters | |
Bilquise et al. | Emotionally intelligent chatbots: a systematic literature review | |
CN109344404B (en) | Context-aware dual-attention natural language reasoning method | |
CN109284506A (en) | A kind of user comment sentiment analysis system and method based on attention convolutional neural networks | |
CN107862087A (en) | Sentiment analysis method, apparatus and storage medium based on big data and deep learning | |
CN112199956A (en) | Entity emotion analysis method based on deep representation learning | |
CN112131469A (en) | Deep learning recommendation method based on comment text | |
Shen et al. | A voice of the customer real-time strategy: An integrated quality function deployment approach | |
CN112989033A (en) | Microblog emotion classification method based on emotion category description | |
CN111274396B (en) | Visual angle level text emotion classification method and system based on external knowledge | |
CN115630145A (en) | Multi-granularity emotion-based conversation recommendation method and system | |
CN112307755A (en) | Multi-feature and deep learning-based spam comment identification method | |
CN113255360A (en) | Document rating method and device based on hierarchical self-attention network | |
CN113268592B (en) | Short text object emotion classification method based on multi-level interactive attention mechanism | |
Das | A multimodal approach to sarcasm detection on social media | |
CN113570154A (en) | Multi-granularity interactive recommendation method and system fusing dynamic interests of users | |
CN115659990A (en) | Tobacco emotion analysis method, device and medium | |
Lin et al. | Social media popularity prediction based on multi-modal self-attention mechanisms | |
Suddul et al. | A Smart Virtual Tutor with Facial Emotion Recognition for Online Learning | |
Shi et al. | Product feature extraction from Chinese online reviews: Application to product improvement | |
CN115309894A (en) | Text emotion classification method and device based on confrontation training and TF-IDF |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |