CN114969458A - Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance - Google Patents
Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance Download PDFInfo
- Publication number
- CN114969458A CN114969458A CN202210743773.3A CN202210743773A CN114969458A CN 114969458 A CN114969458 A CN 114969458A CN 202210743773 A CN202210743773 A CN 202210743773A CN 114969458 A CN114969458 A CN 114969458A
- Authority
- CN
- China
- Prior art keywords
- modal
- text
- modality
- features
- mode
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 48
- 230000004927 fusion Effects 0.000 title claims abstract description 39
- 238000004458 analytical method Methods 0.000 title claims abstract description 33
- 230000000007 visual effect Effects 0.000 claims abstract description 54
- 238000000034 method Methods 0.000 claims abstract description 35
- 230000007246 mechanism Effects 0.000 claims abstract description 25
- 230000009466 transformation Effects 0.000 claims abstract description 5
- 230000003993 interaction Effects 0.000 claims description 41
- 239000013598 vector Substances 0.000 claims description 24
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 230000001419 dependent effect Effects 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000010200 validation analysis Methods 0.000 claims description 4
- 230000014509 gene expression Effects 0.000 claims description 3
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 230000001953 sensory effect Effects 0.000 claims 1
- 238000012512 characterization method Methods 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000002194 synthesizing effect Effects 0.000 abstract description 2
- 238000002474 experimental method Methods 0.000 description 16
- 238000002679 ablation Methods 0.000 description 6
- 230000003044 adaptive effect Effects 0.000 description 4
- 101100268668 Caenorhabditis elegans acc-2 gene Proteins 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000007500 overflow downdraw method Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 101000782621 Bacillus subtilis (strain 168) Biotin carboxylase 2 Proteins 0.000 description 1
- 206010044565 Tremor Diseases 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance, and belongs to the field of natural language processing. The invention comprises the following steps: firstly, three modal characteristics of text, voice and vision are respectively extracted, then a cross-modal attention mechanism is adopted, text modal information is used as guidance to realize the characterization between every two modes, and the voice characteristic and the visual characteristic which are closely related to the text are obtained; then, a multi-mode self-adaptive gating mechanism is adopted to effectively screen three single-mode characteristics by using the mode related characteristics to obtain the characteristic characteristics of the three modes; then, synthesizing multi-modal characteristics and modal important information by adopting a multi-modal hierarchical fusion strategy; finally, the output uses linear transformation to predict emotion polarity. The present invention trains a model using a common dataset CMU-MOSI dataset. Experimental results show that the method is effective in improving the performance of multi-modal emotion analysis.
Description
Technical Field
The invention relates to a hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance, and belongs to the field of natural language processing.
Background
With the development of internet technology, social media communication methods such as trembling and fast-handedness have been rapidly developed in recent years. More and more users choose to express their own opinions and emotions using videos, which provide a large amount of multimodal data. Multimodal Sentiment Analysis (MSA) has therefore received increasing attention, and relevant research has been widely applied in various fields, such as social media public opinion supervision, personalized recommendations, etc. Therefore, the multi-modal emotion analysis has important research significance and application value.
The multi-modal sentiment analysis not only needs to fully represent single-modal information, but also needs to consider interaction and fusion among different modal characteristics. Zadeh et al propose a Tensor Fusion Network (TFN) and a Memory Fusion Network (MFN) that uses LSTM to learn view-specific interactions. Tsai et al propose a cross-modal transformer that learns cross-modal attention to emphasize the target modality. Yu et al introduced a unimodal subtask to help with modal characterization learning.
Although these methods have met with some success in the field of multimodal sentiment analysis. However, in the conventional research, the multi-modal fusion method usually considers three modal features as being equally important, focuses on the fusion of the multi-modal features, ignores the contribution of different modalities to the final emotion analysis result, and insufficiently utilizes modal importance information, which may cause the loss of important information in the modalities and influence the multi-modal emotion analysis performance.
Disclosure of Invention
The invention provides a multi-modal emotion analysis method based on hierarchical self-adaptive fusion of text guidance, which takes text modal information as guidance to realize hierarchical self-adaptive screening of multi-modal information and fusion to improve the performance of multi-modal emotion analysis.
The technical scheme of the invention is as follows: the multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion comprises the following specific steps:
step1, preparing a data set, and preprocessing the public data set data;
step2, inputting the processed data into a hierarchical self-adaptive fusion model based on text guidance, and characterizing the information of three modes, namely text, voice and vision, through a feature representation module; extracting modal related features from the obtained three features of text, voice and vision through a local cross-modal feature interaction module; filtering the relevant modal characteristics by a gating mechanism through a global multi-modal interaction module to obtain modal characteristic characteristics; and effectively fusing the modality-related features and the modality characteristic features through a local-global feature fusion module.
As a further scheme of the invention, the Step1 comprises the following specific steps:
step1.1, downloading a CMU-MOSI data set, wherein the CMU-MOSI data set comprises 2199 short uniwhite video clips, each video clip is annotated with emotion scores manually, the emotion scores are [ -3, +3], and the polarity representing the emotion intensity is from negative polarity to positive polarity; wherein the CMU-MOSI training, validation and test sets contain 1284, 229, 686 video segments, respectively; and then forming a pkl format file through preprocessing.
As a further aspect of the present invention, in Step2, the characterizing information of three modalities, namely, text, voice, and vision, by a feature representation module specifically includes:
step2.1, a multimodal language sequence, involves three modalities: a text mode T, a voice mode A and a visual mode V, and an input sequence is defined asWherein l {t,a,v} A sequence length representing a modality; three independent sub-networks are adopted to obtain the feature representation of three modes; for the text modality, a pre-trained 12-layer BERT is used to extract sentence representations, and the first word vector in the last layer is taken as the representation of the whole sentence; obtaining a feature representation of a text modality by using BERT, wherein the feature representation of the text modality is as follows:
H t =BERT(F t ,θ bert )
wherein,H t representing a text modal feature,/ t Length of sequence, d, representing text modality t Characteristic dimension, θ, representing text modality bert Network parameters of the BERT model;
for a voice mode and a visual mode, acquiring time characteristics corresponding to the two modes by using a one-way LSTM, and adopting a hidden moment state of the last layer as the representation of the whole sequence; f a ,F v Respectively obtaining a voice modal feature representation and a visual modal feature representation through the one-way LSTM, wherein the voice modal feature representation and the visual modal feature representation are as follows:
wherein,a feature of a speech modality is represented,representing features of visual modality,/ a ,l v Sequence length, d, representing speech modality and visual modality, respectively a ,d v Characteristic dimensions, theta, representing speech and visual modalities, respectively lstm Network parameters of the LSTM model.
As a further aspect of the present invention, in Step2, the extracting modal-related features of the obtained text, speech, and vision through the local cross-modal feature interaction module specifically includes:
step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two modes of visual mode V and text mode T, the characteristic is expressed as H v 、H t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:
wherein,for linear transformation of the weight matrix, d k Representing the dimensions of the Q and K vectors, d V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment t Providing K and V vectors from the speech modal characteristics H a Visual modal characteristics H v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:
then connects text modal feature H t Text-to-speech interactive featuresText visual interaction featuresAnd maps them into a low dimensional space, the process is represented as follows:
wherein,d t feature dimension, d, representing text modality a ,d v Feature dimensions, d, representing speech and visual modalities, respectively m Representing the dimension of the space in low dimensions, ReLU being the activation function, H m Are relevant features of three modes.
As a further aspect of the present invention, in Step2, the filtering, by using a gating mechanism, the modality-related features by using the global multi-modal interaction module to obtain the modality-specific features specifically includes:
step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice modality as an example, firstly, the output modality related feature H of the local cross-modality feature interaction module is used m Output speech modal characteristic H of characteristic representation module a The method comprises the following steps of respectively inputting two independent linear layers, using the outputs of the two linear layers as the inputs of a gating unit, and filtering the unique characteristics of a single mode by using the multi-mode related characteristics, wherein the multi-mode self-adaptive gating module comprises the following steps:
λ a =sigmoid(W m H m +W a H a )
wherein λ is a Is a similarity weight between the multimodal correlation feature and the speech feature, W m And W a Is a matrix of parameters that is,is a characteristic feature of a speech modality;
repeating the above step2.3 to obtain the characteristic features of the text mode and the visual mode, which are respectively expressed asd t Feature dimension, d, representing text modality a ,d v Feature dimensions, l, representing speech and visual modalities, respectively t Sequence length, l, representing text modality a ,l v Sequence lengths representing a speech modality and a visual modality, respectively;
then concatenating text-specific featuresCharacteristic features of speechCharacteristic features of visionAnd maps them to a low dimensional spaceThe procedure is represented as follows:
wherein,d m representing a low dimensional spatial dimension, ReLU is an activation function,are characteristic features of different modalities.
As a further aspect of the present invention, in Step2, the effectively fusing the modality-related features and the modality-specific features by the local-global feature fusion module specifically includes:
step2.4, obtaining the modal correlation characteristic H through a local cross-modal characteristic interaction module m Obtaining the special characteristics of the modality through a global multi-modality interaction moduleThen designing a local-global feature fusion module based on the Transformer;
first, the modality-dependent features and the modality-specific features are superimposed on the matrixThen, taking the matrix M as the input of a transform, enabling each vector to learn other cross modal expressions based on a multi-head self-attention mechanism, and comprehensively utilizing global multi-modal characteristics to realize comprehensive judgment of multi-modal emotion;
for the self-attention mechanism, defineThe Transformer generates a new matrixThe procedure is represented as follows:
head i θAttention(QW i q ,KW i k ,VW i v )
wherein,W o in order to linearly transform the weight matrix, the weight matrix is,indicating a splice, theta att ={W q ,W k ,W v ,W o };
And finally, acquiring the output of the Transformer, splicing the output vectors, and sending the output vectors into a linear layer to obtain a final prediction result, wherein the process is represented as follows:
wherein,for the modality-dependent properties obtained after Transformer,is the special characteristics of the modality obtained after the Transformer,d m is a low-dimensional space dimension, and is,is a bias factor.
The invention has the beneficial effects that:
1. aiming at the multi-modal emotion analysis, under the condition of considering modal importance information, the invention effectively explores the relationship among different modes and in the modes and improves the accuracy rate of the multi-modal emotion analysis. A multi-mode hierarchical self-adaptive fusion method based on text mode guidance is provided, and hierarchical self-adaptive screening and fusion of multi-mode information are realized by taking a text mode as guidance.
2. The cross-modal attention mechanism is used for fully learning modal correlation characteristics, and the special characteristics of the fusion modality are screened through the multi-modal adaptive gating mechanism, so that multi-modal fusion and emotion prediction are facilitated.
3. Experiments are carried out on CMU-MOSI and CMU-MOSEI data sets, and results show that the multi-modal emotion analysis performance is remarkably improved.
Drawings
FIG. 1 is a graph of the results of a CMU-MOSI dataset modal importance ablation experiment of the present invention;
FIG. 2 is a schematic flow chart of a hierarchical adaptive fusion multi-modal emotion analysis method based on text guidance.
Detailed Description
Example 1: as shown in fig. 1-2, a hierarchical adaptive fusion multimodal emotion analysis method based on text guidance trains a model by taking a CMU-MOSI data set as an example, and the method specifically includes the following steps:
step1, preparing a data set, and preprocessing the CMU-MOSI data of the public data set;
step1.1, downloading a CMU-MOSI data set, wherein the CMU-MOSI data set comprises 2199 short uniwhite video clips, each video clip is annotated with emotion scores manually, the emotion scores are [ -3, +3], and the polarity representing the emotion intensity is from negative polarity to positive polarity; wherein the CMU-MOSI training, validation and test sets contain 1284, 229, 686 video segments, respectively; and then forming a pkl format file through preprocessing.
Step2, inputting the processed data into a hierarchical self-adaptive fusion model based on text guidance, and characterizing the information of three modes, namely text, voice and vision, through a feature representation module; extracting modal related features from the obtained three features of text, voice and vision through a local cross-modal feature interaction module; filtering the relevant modal characteristics by a gating mechanism through a global multi-modal interaction module to obtain modal characteristic characteristics; and effectively fusing the modality-related features and the modality characteristic features through a local-global feature fusion module.
The specific steps of Step2 are as follows:
step2.1, a multimodal language sequence, involves three modalities: a text mode T, a voice mode A and a visual mode V, and an input sequence is defined asWherein l {t,a,v} A sequence length representing a modality; three independent sub-networks are adopted to obtain the feature representation of three modes; for the text modality, a pre-trained 12-layer BERT is used to extract sentence representations, and the first word vector in the last layer is taken as the representation of the whole sentence; obtaining a feature representation of a text modality by using BERT, wherein the feature representation of the text modality is as follows:
H t =BERT(F t ,θ bert )
wherein,H t representing a text modal feature,/ t Length of sequence, d, representing text modality t Characteristic dimension, θ, representing text modality bert Network parameters of the BERT model;
for a voice mode and a visual mode, acquiring time characteristics corresponding to the two modes by using a one-way LSTM, and adopting a hidden moment state of the last layer as the representation of the whole sequence; f a ,F v Respectively obtaining a voice modal feature representation and a visual modal feature representation through the one-way LSTM, wherein the voice modal feature representation and the visual modal feature representation are as follows:
wherein,a feature of a speech modality is represented,representing features of visual modality,/ a ,l v Sequence length, d, representing speech modality and visual modality, respectively a ,d v Characteristic dimensions, theta, representing speech and visual modalities, respectively lstm Network parameters of the LSTM model.
Step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two modes of visual mode V and text mode T, the characteristic is expressed as H v 、H t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:
wherein,for linear transformation of the weight matrix, d k Representing the dimensions of the Q and K vectors, d V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment t Providing K and V vectors from speech modality features H a Visual modal characteristics H v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:
then connect text modality feature H t Text-to-speech interactive featuresText visual interaction featuresAnd maps them into a low dimensional space, the process is represented as follows:
wherein,d t characteristic dimension representing text modality, d a ,d v Respectively representing speechCharacteristic dimensions of modalities and visual modalities, d m Representing a low dimensional spatial dimension, ReLU being the activation function, H m Are relevant features of three modes.
Step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice modality as an example, firstly, the output modality related feature H of the local cross-modality feature interaction module is used m Output speech modal characteristic H of characteristic representation module a The method comprises the following steps of respectively inputting two independent linear layers, using the outputs of the two linear layers as the inputs of a gating unit, and filtering the unique characteristics of a single mode by using the multi-mode related characteristics, wherein the multi-mode self-adaptive gating module comprises the following steps:
λ a =sigmoid(W m H m +W a H a )
wherein λ is a Is a similarity weight between the multimodal correlation feature and the speech feature, W m And W a Is a matrix of parameters that is,is a characteristic feature of a speech modality;
repeating the above step2.3 to obtain the characteristic features of the text mode and the visual mode, which are respectively expressed asd t Characteristic dimension representing text modality, d a ,d v Feature dimensions, l, representing speech and visual modalities, respectively t Sequence length, l, representing text modality a ,l v Sequence lengths representing a speech modality and a visual modality, respectively;
then concatenating text-specific featuresCharacteristic features of speechCharacteristic features of visionAnd maps them to a low dimensional spaceThe procedure is represented as follows:
wherein,d m representing a low dimensional spatial dimension, ReLU is an activation function,are characteristic features of different modalities.
Step2.4, obtaining the modal correlation characteristic H through a local cross-modal characteristic interaction module m Obtaining the special characteristics of the modality through a global multi-modality interaction moduleThen designing a local-global feature fusion module based on the Transformer;
first, the modality-dependent features and the modality-specific features are superimposed on the matrixThen, taking the matrix M as the input of a transform, enabling each vector to learn other cross modal expressions based on a multi-head self-attention mechanism, and comprehensively utilizing global multi-modal characteristics to realize comprehensive judgment of multi-modal emotion;
for self-attention machineSystem, definitionThe Transformer generates a new matrixThe procedure is represented as follows:
head i =Attention(QW i q ,KW i k ,VW i v )
wherein,W o in order to linearly transform the weight matrix, the weight matrix is,indicating a splice, theta att ={W q ,W k ,W v ,W o };
And finally, acquiring the output of the Transformer, splicing the output vectors, and sending the output vectors into a linear layer to obtain a final prediction result, wherein the process is represented as follows:
wherein,is a mold obtained after being subjected to TransformerThe characteristics of the state-related,is the special characteristics of the modality obtained after the Transformer,d m is a low-dimensional space dimension, and is,is the bias coefficient.
In order to illustrate the effect of the invention, 3 groups of comparative experiments are set, the 1 st group is the main experiment result, and the improvement of the multi-modal emotion analysis performance is verified by comparing the results with some previous works in the field. The 2 nd set of experiments are model ablation experiments, verifying the validity of the proposed model. The experiment of group 3 is a modal importance ablation experiment to verify the importance of the text modality.
(1) Results of the Main experiment
The CMU-MOSI dataset is used as in most previous works. The training, validation and test sets contained 1284, 229, 686 video clips, respectively. The parameter settings are shown in table 1 below.
Table 1: parameter setting of model
Four evaluation indices were used to evaluate the affective analysis performance of the model. The evaluation indexes are respectively as follows: 1) mean Absolute Error (MAE)2) correlation coefficient (Corr)3) ACC _2, dichotomy precision; 4) f1 Score, weight ACC 2. Among the above indexes, the higher score of the other indexes except the MAE represents the more excellent performance. In order to fully verify the performance of the proposed model, several mainstream and high-performance models in multi-modal emotion analysis are selected, the performance is fully discussed by using the four indexes under the condition of the same experimental environment and data set, and the experimental result is shown in the following table 2.
TABLE 2 Experimental results of different models on CMU-MOSI data sets
Analysis table 2 shows that the model provided herein is superior to other comparative models in performance of the two evaluation indexes, namely emotion binary classification accuracy and F1 score on the CMU-MOSI data set. Compared with other models, the accuracy is improved by 0.76-5.62%, and the F1 value is improved by 0.7-5.64%. Compared with the existing advanced Self-MM model, Acc _2 is improved by 0.76%, and F1 value is improved by 0.7%, because the importance of text mode is considered in the text model, and text mode information is fully utilized to help multimodal information fusion. Compared with an ICCN model, Acc _2 is promoted by 3.36%, and F1 is promoted by 3.36%, because the model considers the relevance and difference of modal information while considering the importance of text modal, and makes full use of the relevant features and the unique features of three modalities, thereby improving the model performance. The experimental results fully demonstrate the effectiveness and advancement of the model herein on the multi-modal emotion classification task.
(2) Model ablation experiment
The invention tests the model and its simplified model performance against the training data and test data in table 2, and the experimental results are shown in table 3 below:
(-) Cross-modal attention: and removing the partial cross-modal interaction module (guided by the text modality) on the basis of the complete model.
A (-) gating unit: and removing the global multi-modal interaction module on the basis of the complete model.
(-) text gate, (-) speech gate, (-) visual gate, where the text gate, speech gate, and visual gate are removed in sequence at the global multimodal interaction module.
4. And (3) relevant feature fusion: in the local-global feature fusion module, modality-specific features are removed, and only modality-related features are used.
5. Fusing the unique characteristics: in the local-global feature fusion module, modality-related features are removed, and only modality-specific features are used.
TABLE 3 CMU-MOSI dataset model ablation experimental results
1. When the single-modality interaction module is removed, the accuracy and the F1 score are reduced. The result shows that the local cross-modal interaction module effectively reduces the difference between different modalities, and the complementary features of the text modality are learned from the non-text modality.
2. When the global multi-modal interaction module or the text gating network, the voice gating network and the visual gating network are removed, the accuracy and the F1 score are reduced. Therefore, the global multi-modal interaction module learns the characteristic features of different modes and provides additional information for emotion prediction. The results show that the multi-modal adaptive gating mechanism is very helpful for filtering the specific information of the single-modal characteristics.
3. In the global-local feature fusion module, when only relevant modalities are fused or only specific modalities are fused, the accuracy and the F1 score are reduced. The result shows that the removal of the relevant characteristics or the specific characteristics of the mode can influence the performance of the model, and when the two characteristics are fused, the model can learn more characteristic information, thereby being beneficial to emotion prediction.
(3) Modal importance ablation experiment
In order to verify that different modes have different importance degrees on the final emotion analysis result, the model respectively takes a Text mode as a guidance mode (Text attack), a voice mode as a guidance mode (Audio attack) and a Visual mode as a guidance mode (Visual attack), and emotion analysis experiments are respectively carried out on the Text mode, the voice mode and the Visual mode, and the experiment results are compared. The results of the experiment are shown in FIG. 1 below.
The experimental result of fig. 1 shows that when the text mode is the guidance mode, the model performance is the best, and when the speech mode is the guidance mode or the visual mode is the main mode, the accuracy of emotion analysis and the F1 score are both significantly reduced. This shows that different modalities have different degrees of importance for the final emotion analysis result in the multi-modal emotion analysis task. The text mode has the largest contribution to the emotion analysis result, and the importance of the text mode is reflected.
The method extracts three modal characteristics of text, voice and vision, then adopts a cross-modal attention mechanism to realize the characterization between every two modes by taking text modal information as guidance, and obtains the voice characteristic and the visual characteristic which are closely related to the text; then, a multi-mode self-adaptive gating mechanism is adopted to effectively screen three single-mode characteristics by using the mode related characteristics to obtain the characteristic characteristics of the three modes; then, synthesizing multi-modal characteristics and modal important information by adopting a multi-modal hierarchical fusion strategy; finally, the output uses linear transformation to predict emotion polarity.
The above experiments prove that the problem of insufficient information fusion between modalities is solved by introducing the local cross-modality interaction module. The method comprises the steps of taking a text mode with a large contribution degree as a guide mode, taking a voice mode and a visual mode with a small contribution degree as auxiliary modes, utilizing a cross-mode attention mechanism to realize importance information representation between every two modes, then realizing multi-mode hierarchical self-adaptive fusion under multi-mode importance information guide based on a multi-mode self-adaptive gating mechanism, and finally simultaneously applying relevant characteristics and special characteristics of the modes to fully explore relevant relations between the modes and in the modes. Experiments show that compared with a plurality of baseline models, the method provided by the invention achieves a better result. Aiming at a multi-modal emotion analysis task, the multi-modal emotion analysis method based on hierarchical self-adaptive fusion of text guidance provided by the invention is effective in improving the performance of multi-modal emotion analysis.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (6)
1. The multi-modal emotion analysis method based on the hierarchical self-adaptive fusion of text guidance is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, preparing a data set, and preprocessing the public data set data;
step2, inputting the processed data into a hierarchical self-adaptive fusion model based on text guidance, and characterizing the information of three modes, namely text, voice and vision, through a feature representation module; extracting modal related features from the obtained three features of text, voice and vision through a local cross-modal feature interaction module; filtering the relevant modal characteristics by a gating mechanism through a global multi-modal interaction module to obtain modal characteristic characteristics; and effectively fusing the modality-related features and the modality characteristic features through a local-global feature fusion module.
2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:
step1.1, downloading a CMU-MOSI data set, wherein the CMU-MOSI data set comprises 2199 short uniwhite video clips, each video clip is annotated with emotion scores manually, the emotion scores are [ -3, +3], and the polarity representing the emotion intensity is from negative polarity to positive polarity; wherein the CMU-MOSI training, validation and test sets contain 1284, 229, 686 video segments, respectively; and then forming a pkl format file through preprocessing.
3. The method of claim 1, wherein the method comprises: in Step2, the characterizing information of three modalities, namely text, voice and vision, by a feature representation module specifically includes:
step2.1, a multimodal language sequence, involves three modalities: a text mode T, a voice mode A and a visual mode V, and an input sequence is defined asWherein l {t,a,v} A sequence length representing a modality; three independent sub-networks are adopted to obtain the feature representation of three modes; for the text modality, a pre-trained 12-layer BERT is used to extract sentence representations, and the first word vector in the last layer is taken as the representation of the whole sentence; obtaining a feature representation of a text modality by using BERT, wherein the feature representation of the text modality is as follows:
H t =BERT(F t ,θ bert )
wherein,H t representing a text modal feature,/ t Length of sequence, d, representing text modality t Characteristic dimension, θ, representing text modality bert Network parameters of the BERT model;
for a voice mode and a visual mode, acquiring time characteristics corresponding to the two modes by using a one-way LSTM, and adopting a hidden moment state of the last layer as the representation of the whole sequence; f a ,F v Respectively obtaining a voice modal feature representation and a visual modal feature representation through the one-way LSTM, wherein the voice modal feature representation and the visual modal feature representation are as follows:
wherein,a feature of a speech modality is represented,representing features of visual modality,/ a ,l v Respectively representing speech modality and viewLength of the sequence of the sensory modalities, d a ,d v Characteristic dimensions, theta, representing speech and visual modalities, respectively lstm Network parameters of the LSTM model.
4. The method of claim 1, wherein the method comprises: in Step2, the extracting modal-related features of the obtained text, voice, and vision features by the local cross-modal feature interaction module specifically includes:
step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two visual modes V and text mode T, the character is expressed as H v 、H t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:
wherein,for linear transformation of the weight matrix, d k Representing the dimensions of the Q and K vectors, d V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment t Providing K and V vectors from speech modality features H a Visual modal characteristics H v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:
then connect text modality feature H t Text-to-speech interactive featuresText visual interaction featuresAnd maps them into a low dimensional space, the process is represented as follows:
5. The method of claim 1, wherein the method comprises: in Step2, the filtering, by using a gating mechanism, the modality-related features by the global multi-modal interaction module to obtain the modality-specific features specifically includes:
step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice mode as an example, the output mode related characteristics H of the local cross-mode characteristic interaction module are firstly compared m Output speech modal characteristic H of characteristic representation module a Two independent linear layers are respectively input, the outputs of the two linear layers are used as the input of a gate control unit, the special characteristics of a single mode are filtered by utilizing the multi-mode related characteristics, and the multi-mode self-adaptive gate is providedThe control module comprises the following components:
λ a =sigmoid(W m H m +W a H a )
wherein λ is a Is a similarity weight between the multimodal correlation feature and the speech feature, W m And W a Is a matrix of parameters that is,is a characteristic feature of a speech modality;
repeating the above step2.3 to obtain the characteristic features of the text mode and the visual mode, which are respectively expressed asd t Characteristic dimension representing text modality, d a ,d v Feature dimensions, l, representing speech and visual modalities, respectively t Sequence length, l, representing text modality a ,l v Sequence lengths representing a speech modality and a visual modality, respectively;
then concatenating text-specific featuresCharacteristic features of speechCharacteristic features of visionAnd maps them to a low dimensional spaceThe procedure is represented as follows:
6. The method of claim 1, wherein the method comprises: in Step2, the effective fusion of the modality-related features and the modality-specific features by the local-global feature fusion module specifically includes:
step2.4, obtaining the modal correlation characteristic H through a local cross-modal characteristic interaction module m Obtaining the special characteristics of the modality through a global multi-modality interaction moduleThen designing a local-global feature fusion module based on the Transformer;
first, the modality-dependent features and the modality-specific features are superimposed on the matrixThen, taking the matrix M as the input of a transform, enabling each vector to learn other cross modal expressions based on a multi-head self-attention mechanism, and comprehensively utilizing global multi-modal characteristics to realize comprehensive judgment of multi-modal emotion;
for the self-attention mechanism, defineThe Transformer generates a new matrixThe procedure is represented as follows:
head i =Attention(QW i q ,KW i k ,VW i v )
wherein,W o in order to linearly transform the weight matrix,indicating a splice, theta att ={W q ,W k ,W v ,W o };
And finally, acquiring the output of the Transformer, splicing the output vectors, and sending the output vectors into a linear layer to obtain a final prediction result, wherein the process is represented as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210743773.3A CN114969458B (en) | 2022-06-28 | 2022-06-28 | Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210743773.3A CN114969458B (en) | 2022-06-28 | 2022-06-28 | Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114969458A true CN114969458A (en) | 2022-08-30 |
CN114969458B CN114969458B (en) | 2024-04-26 |
Family
ID=82965492
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210743773.3A Active CN114969458B (en) | 2022-06-28 | 2022-06-28 | Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114969458B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115544279A (en) * | 2022-10-11 | 2022-12-30 | 合肥工业大学 | Multi-modal emotion classification method based on cooperative attention and application thereof |
CN115809438A (en) * | 2023-01-18 | 2023-03-17 | 中国科学技术大学 | Multi-modal emotion analysis method, system, device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528004A (en) * | 2020-12-24 | 2021-03-19 | 北京百度网讯科技有限公司 | Voice interaction method, voice interaction device, electronic equipment, medium and computer program product |
CN112651448A (en) * | 2020-12-29 | 2021-04-13 | 中山大学 | Multi-modal emotion analysis method for social platform expression package |
CN113420807A (en) * | 2021-06-22 | 2021-09-21 | 哈尔滨理工大学 | Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method |
CN113435496A (en) * | 2021-06-24 | 2021-09-24 | 湖南大学 | Self-adaptive fusion multi-mode emotion classification method based on attention mechanism |
CN113704552A (en) * | 2021-08-31 | 2021-11-26 | 哈尔滨工业大学 | Cross-modal automatic alignment and pre-training language model-based emotion analysis method, system and equipment |
US11281945B1 (en) * | 2021-02-26 | 2022-03-22 | Institute Of Automation, Chinese Academy Of Sciences | Multimodal dimensional emotion recognition method |
CN114463688A (en) * | 2022-04-12 | 2022-05-10 | 之江实验室 | Cross-modal context coding dialogue emotion recognition method and system |
-
2022
- 2022-06-28 CN CN202210743773.3A patent/CN114969458B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528004A (en) * | 2020-12-24 | 2021-03-19 | 北京百度网讯科技有限公司 | Voice interaction method, voice interaction device, electronic equipment, medium and computer program product |
CN112651448A (en) * | 2020-12-29 | 2021-04-13 | 中山大学 | Multi-modal emotion analysis method for social platform expression package |
US11281945B1 (en) * | 2021-02-26 | 2022-03-22 | Institute Of Automation, Chinese Academy Of Sciences | Multimodal dimensional emotion recognition method |
CN113420807A (en) * | 2021-06-22 | 2021-09-21 | 哈尔滨理工大学 | Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method |
CN113435496A (en) * | 2021-06-24 | 2021-09-24 | 湖南大学 | Self-adaptive fusion multi-mode emotion classification method based on attention mechanism |
CN113704552A (en) * | 2021-08-31 | 2021-11-26 | 哈尔滨工业大学 | Cross-modal automatic alignment and pre-training language model-based emotion analysis method, system and equipment |
CN114463688A (en) * | 2022-04-12 | 2022-05-10 | 之江实验室 | Cross-modal context coding dialogue emotion recognition method and system |
Non-Patent Citations (2)
Title |
---|
LONG YING等: "Multi-level Multi-Modal Cross-Attention network for Fake news detection", IEEE ACCESS, 20 September 2021 (2021-09-20), pages 1 - 10 * |
卢婵等, 山东大学学报(理学版), vol. 58, no. 12, 6 September 2023 (2023-09-06), pages 31 - 40 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115544279A (en) * | 2022-10-11 | 2022-12-30 | 合肥工业大学 | Multi-modal emotion classification method based on cooperative attention and application thereof |
CN115544279B (en) * | 2022-10-11 | 2024-01-26 | 合肥工业大学 | Multi-mode emotion classification method based on cooperative attention and application thereof |
CN115809438A (en) * | 2023-01-18 | 2023-03-17 | 中国科学技术大学 | Multi-modal emotion analysis method, system, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114969458B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
De Rosa et al. | A survey on text generation using generative adversarial networks | |
Huan et al. | Video multimodal emotion recognition based on Bi-GRU and attention fusion | |
CN114969458B (en) | Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion | |
CN110765264A (en) | Text abstract generation method for enhancing semantic relevance | |
CN114529758A (en) | Multi-modal emotion analysis method based on contrast learning and multi-head self-attention mechanism | |
CN117391051B (en) | Emotion-fused common attention network multi-modal false news detection method | |
CN118114188B (en) | False news detection method based on multi-view and layered fusion | |
Lian et al. | A survey of deep learning-based multimodal emotion recognition: Speech, text, and face | |
US20240119716A1 (en) | Method for multimodal emotion classification based on modal space assimilation and contrastive learning | |
CN111563373A (en) | Attribute-level emotion classification method for focused attribute-related text | |
CN116304984A (en) | Multi-modal intention recognition method and system based on contrast learning | |
CN117371456A (en) | Multi-mode irony detection method and system based on feature fusion | |
CN115858728A (en) | Multi-mode data based emotion analysis method | |
CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
Gandhi et al. | Multimodal sentiment analysis: review, application domains and future directions | |
CN115481679A (en) | Multi-modal emotion analysis method and system | |
Rani et al. | Deep learning with big data: an emerging trend | |
CN117765450B (en) | Video language understanding method, device, equipment and readable storage medium | |
Zeng et al. | Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities | |
Jia et al. | Semantic association enhancement transformer with relative position for image captioning | |
CN113807307A (en) | Multi-mode joint learning method for video multi-behavior recognition | |
Xue et al. | Intent-enhanced attentive Bert capsule network for zero-shot intention detection | |
CN117893948A (en) | Multi-mode emotion analysis method based on multi-granularity feature comparison and fusion framework | |
Liu et al. | TACFN: transformer-based adaptive cross-modal fusion network for multimodal emotion recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |