CN114969458A - Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance - Google Patents

Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance Download PDF

Info

Publication number
CN114969458A
CN114969458A CN202210743773.3A CN202210743773A CN114969458A CN 114969458 A CN114969458 A CN 114969458A CN 202210743773 A CN202210743773 A CN 202210743773A CN 114969458 A CN114969458 A CN 114969458A
Authority
CN
China
Prior art keywords
modal
text
modality
features
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210743773.3A
Other languages
Chinese (zh)
Other versions
CN114969458B (en
Inventor
郭军军
卢婵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210743773.3A priority Critical patent/CN114969458B/en
Publication of CN114969458A publication Critical patent/CN114969458A/en
Application granted granted Critical
Publication of CN114969458B publication Critical patent/CN114969458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance, and belongs to the field of natural language processing. The invention comprises the following steps: firstly, three modal characteristics of text, voice and vision are respectively extracted, then a cross-modal attention mechanism is adopted, text modal information is used as guidance to realize the characterization between every two modes, and the voice characteristic and the visual characteristic which are closely related to the text are obtained; then, a multi-mode self-adaptive gating mechanism is adopted to effectively screen three single-mode characteristics by using the mode related characteristics to obtain the characteristic characteristics of the three modes; then, synthesizing multi-modal characteristics and modal important information by adopting a multi-modal hierarchical fusion strategy; finally, the output uses linear transformation to predict emotion polarity. The present invention trains a model using a common dataset CMU-MOSI dataset. Experimental results show that the method is effective in improving the performance of multi-modal emotion analysis.

Description

Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance
Technical Field
The invention relates to a hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance, and belongs to the field of natural language processing.
Background
With the development of internet technology, social media communication methods such as trembling and fast-handedness have been rapidly developed in recent years. More and more users choose to express their own opinions and emotions using videos, which provide a large amount of multimodal data. Multimodal Sentiment Analysis (MSA) has therefore received increasing attention, and relevant research has been widely applied in various fields, such as social media public opinion supervision, personalized recommendations, etc. Therefore, the multi-modal emotion analysis has important research significance and application value.
The multi-modal sentiment analysis not only needs to fully represent single-modal information, but also needs to consider interaction and fusion among different modal characteristics. Zadeh et al propose a Tensor Fusion Network (TFN) and a Memory Fusion Network (MFN) that uses LSTM to learn view-specific interactions. Tsai et al propose a cross-modal transformer that learns cross-modal attention to emphasize the target modality. Yu et al introduced a unimodal subtask to help with modal characterization learning.
Although these methods have met with some success in the field of multimodal sentiment analysis. However, in the conventional research, the multi-modal fusion method usually considers three modal features as being equally important, focuses on the fusion of the multi-modal features, ignores the contribution of different modalities to the final emotion analysis result, and insufficiently utilizes modal importance information, which may cause the loss of important information in the modalities and influence the multi-modal emotion analysis performance.
Disclosure of Invention
The invention provides a multi-modal emotion analysis method based on hierarchical self-adaptive fusion of text guidance, which takes text modal information as guidance to realize hierarchical self-adaptive screening of multi-modal information and fusion to improve the performance of multi-modal emotion analysis.
The technical scheme of the invention is as follows: the multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion comprises the following specific steps:
step1, preparing a data set, and preprocessing the public data set data;
step2, inputting the processed data into a hierarchical self-adaptive fusion model based on text guidance, and characterizing the information of three modes, namely text, voice and vision, through a feature representation module; extracting modal related features from the obtained three features of text, voice and vision through a local cross-modal feature interaction module; filtering the relevant modal characteristics by a gating mechanism through a global multi-modal interaction module to obtain modal characteristic characteristics; and effectively fusing the modality-related features and the modality characteristic features through a local-global feature fusion module.
As a further scheme of the invention, the Step1 comprises the following specific steps:
step1.1, downloading a CMU-MOSI data set, wherein the CMU-MOSI data set comprises 2199 short uniwhite video clips, each video clip is annotated with emotion scores manually, the emotion scores are [ -3, +3], and the polarity representing the emotion intensity is from negative polarity to positive polarity; wherein the CMU-MOSI training, validation and test sets contain 1284, 229, 686 video segments, respectively; and then forming a pkl format file through preprocessing.
As a further aspect of the present invention, in Step2, the characterizing information of three modalities, namely, text, voice, and vision, by a feature representation module specifically includes:
step2.1, a multimodal language sequence, involves three modalities: a text mode T, a voice mode A and a visual mode V, and an input sequence is defined as
Figure BDA0003718934570000021
Wherein l {t,a,v} A sequence length representing a modality; three independent sub-networks are adopted to obtain the feature representation of three modes; for the text modality, a pre-trained 12-layer BERT is used to extract sentence representations, and the first word vector in the last layer is taken as the representation of the whole sentence; obtaining a feature representation of a text modality by using BERT, wherein the feature representation of the text modality is as follows:
H t =BERT(F tbert )
wherein,
Figure BDA0003718934570000022
H t representing a text modal feature,/ t Length of sequence, d, representing text modality t Characteristic dimension, θ, representing text modality bert Network parameters of the BERT model;
for a voice mode and a visual mode, acquiring time characteristics corresponding to the two modes by using a one-way LSTM, and adopting a hidden moment state of the last layer as the representation of the whole sequence; f a ,F v Respectively obtaining a voice modal feature representation and a visual modal feature representation through the one-way LSTM, wherein the voice modal feature representation and the visual modal feature representation are as follows:
Figure BDA0003718934570000023
Figure BDA0003718934570000024
wherein,
Figure BDA0003718934570000025
a feature of a speech modality is represented,
Figure BDA0003718934570000026
representing features of visual modality,/ a ,l v Sequence length, d, representing speech modality and visual modality, respectively a ,d v Characteristic dimensions, theta, representing speech and visual modalities, respectively lstm Network parameters of the LSTM model.
As a further aspect of the present invention, in Step2, the extracting modal-related features of the obtained text, speech, and vision through the local cross-modal feature interaction module specifically includes:
step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two modes of visual mode V and text mode T, the characteristic is expressed as H v 、H t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:
Figure BDA0003718934570000031
wherein,
Figure BDA0003718934570000032
for linear transformation of the weight matrix, d k Representing the dimensions of the Q and K vectors, d V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment t Providing K and V vectors from the speech modal characteristics H a Visual modal characteristics H v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:
Figure BDA0003718934570000033
Figure BDA0003718934570000034
then connects text modal feature H t Text-to-speech interactive features
Figure BDA0003718934570000035
Text visual interaction features
Figure BDA0003718934570000036
And maps them into a low dimensional space, the process is represented as follows:
Figure BDA0003718934570000037
wherein,
Figure BDA0003718934570000038
d t feature dimension, d, representing text modality a ,d v Feature dimensions, d, representing speech and visual modalities, respectively m Representing the dimension of the space in low dimensions, ReLU being the activation function, H m Are relevant features of three modes.
As a further aspect of the present invention, in Step2, the filtering, by using a gating mechanism, the modality-related features by using the global multi-modal interaction module to obtain the modality-specific features specifically includes:
step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice modality as an example, firstly, the output modality related feature H of the local cross-modality feature interaction module is used m Output speech modal characteristic H of characteristic representation module a The method comprises the following steps of respectively inputting two independent linear layers, using the outputs of the two linear layers as the inputs of a gating unit, and filtering the unique characteristics of a single mode by using the multi-mode related characteristics, wherein the multi-mode self-adaptive gating module comprises the following steps:
λ a =sigmoid(W m H m +W a H a )
Figure BDA0003718934570000039
wherein λ is a Is a similarity weight between the multimodal correlation feature and the speech feature, W m And W a Is a matrix of parameters that is,
Figure BDA0003718934570000041
is a characteristic feature of a speech modality;
repeating the above step2.3 to obtain the characteristic features of the text mode and the visual mode, which are respectively expressed as
Figure BDA0003718934570000042
d t Feature dimension, d, representing text modality a ,d v Feature dimensions, l, representing speech and visual modalities, respectively t Sequence length, l, representing text modality a ,l v Sequence lengths representing a speech modality and a visual modality, respectively;
then concatenating text-specific features
Figure BDA0003718934570000043
Characteristic features of speech
Figure BDA0003718934570000044
Characteristic features of vision
Figure BDA0003718934570000045
And maps them to a low dimensional space
Figure BDA0003718934570000046
The procedure is represented as follows:
Figure BDA0003718934570000047
wherein,
Figure BDA0003718934570000048
d m representing a low dimensional spatial dimension, ReLU is an activation function,
Figure BDA0003718934570000049
are characteristic features of different modalities.
As a further aspect of the present invention, in Step2, the effectively fusing the modality-related features and the modality-specific features by the local-global feature fusion module specifically includes:
step2.4, obtaining the modal correlation characteristic H through a local cross-modal characteristic interaction module m Obtaining the special characteristics of the modality through a global multi-modality interaction module
Figure BDA00037189345700000410
Then designing a local-global feature fusion module based on the Transformer;
first, the modality-dependent features and the modality-specific features are superimposed on the matrix
Figure BDA00037189345700000411
Then, taking the matrix M as the input of a transform, enabling each vector to learn other cross modal expressions based on a multi-head self-attention mechanism, and comprehensively utilizing global multi-modal characteristics to realize comprehensive judgment of multi-modal emotion;
for the self-attention mechanism, define
Figure BDA00037189345700000412
The Transformer generates a new matrix
Figure BDA00037189345700000413
The procedure is represented as follows:
Figure BDA00037189345700000414
head i θAttention(QW i q ,KW i k ,VW i v )
Figure BDA00037189345700000415
wherein,
Figure BDA00037189345700000416
W o in order to linearly transform the weight matrix, the weight matrix is,
Figure BDA00037189345700000417
indicating a splice, theta att ={W q ,W k ,W v ,W o };
And finally, acquiring the output of the Transformer, splicing the output vectors, and sending the output vectors into a linear layer to obtain a final prediction result, wherein the process is represented as follows:
Figure BDA00037189345700000418
Figure BDA00037189345700000419
wherein,
Figure BDA0003718934570000051
for the modality-dependent properties obtained after Transformer,
Figure BDA0003718934570000052
is the special characteristics of the modality obtained after the Transformer,
Figure BDA0003718934570000053
d m is a low-dimensional space dimension, and is,
Figure BDA0003718934570000054
is a bias factor.
The invention has the beneficial effects that:
1. aiming at the multi-modal emotion analysis, under the condition of considering modal importance information, the invention effectively explores the relationship among different modes and in the modes and improves the accuracy rate of the multi-modal emotion analysis. A multi-mode hierarchical self-adaptive fusion method based on text mode guidance is provided, and hierarchical self-adaptive screening and fusion of multi-mode information are realized by taking a text mode as guidance.
2. The cross-modal attention mechanism is used for fully learning modal correlation characteristics, and the special characteristics of the fusion modality are screened through the multi-modal adaptive gating mechanism, so that multi-modal fusion and emotion prediction are facilitated.
3. Experiments are carried out on CMU-MOSI and CMU-MOSEI data sets, and results show that the multi-modal emotion analysis performance is remarkably improved.
Drawings
FIG. 1 is a graph of the results of a CMU-MOSI dataset modal importance ablation experiment of the present invention;
FIG. 2 is a schematic flow chart of a hierarchical adaptive fusion multi-modal emotion analysis method based on text guidance.
Detailed Description
Example 1: as shown in fig. 1-2, a hierarchical adaptive fusion multimodal emotion analysis method based on text guidance trains a model by taking a CMU-MOSI data set as an example, and the method specifically includes the following steps:
step1, preparing a data set, and preprocessing the CMU-MOSI data of the public data set;
step1.1, downloading a CMU-MOSI data set, wherein the CMU-MOSI data set comprises 2199 short uniwhite video clips, each video clip is annotated with emotion scores manually, the emotion scores are [ -3, +3], and the polarity representing the emotion intensity is from negative polarity to positive polarity; wherein the CMU-MOSI training, validation and test sets contain 1284, 229, 686 video segments, respectively; and then forming a pkl format file through preprocessing.
Step2, inputting the processed data into a hierarchical self-adaptive fusion model based on text guidance, and characterizing the information of three modes, namely text, voice and vision, through a feature representation module; extracting modal related features from the obtained three features of text, voice and vision through a local cross-modal feature interaction module; filtering the relevant modal characteristics by a gating mechanism through a global multi-modal interaction module to obtain modal characteristic characteristics; and effectively fusing the modality-related features and the modality characteristic features through a local-global feature fusion module.
The specific steps of Step2 are as follows:
step2.1, a multimodal language sequence, involves three modalities: a text mode T, a voice mode A and a visual mode V, and an input sequence is defined as
Figure BDA0003718934570000061
Wherein l {t,a,v} A sequence length representing a modality; three independent sub-networks are adopted to obtain the feature representation of three modes; for the text modality, a pre-trained 12-layer BERT is used to extract sentence representations, and the first word vector in the last layer is taken as the representation of the whole sentence; obtaining a feature representation of a text modality by using BERT, wherein the feature representation of the text modality is as follows:
H t =BERT(F tbert )
wherein,
Figure BDA0003718934570000062
H t representing a text modal feature,/ t Length of sequence, d, representing text modality t Characteristic dimension, θ, representing text modality bert Network parameters of the BERT model;
for a voice mode and a visual mode, acquiring time characteristics corresponding to the two modes by using a one-way LSTM, and adopting a hidden moment state of the last layer as the representation of the whole sequence; f a ,F v Respectively obtaining a voice modal feature representation and a visual modal feature representation through the one-way LSTM, wherein the voice modal feature representation and the visual modal feature representation are as follows:
Figure BDA0003718934570000063
Figure BDA0003718934570000064
wherein,
Figure BDA0003718934570000065
a feature of a speech modality is represented,
Figure BDA0003718934570000066
representing features of visual modality,/ a ,l v Sequence length, d, representing speech modality and visual modality, respectively a ,d v Characteristic dimensions, theta, representing speech and visual modalities, respectively lstm Network parameters of the LSTM model.
Step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two modes of visual mode V and text mode T, the characteristic is expressed as H v 、H t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:
Figure BDA0003718934570000067
wherein,
Figure BDA0003718934570000068
for linear transformation of the weight matrix, d k Representing the dimensions of the Q and K vectors, d V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment t Providing K and V vectors from speech modality features H a Visual modal characteristics H v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:
Figure BDA0003718934570000071
Figure BDA0003718934570000072
then connect text modality feature H t Text-to-speech interactive features
Figure BDA0003718934570000073
Text visual interaction features
Figure BDA0003718934570000074
And maps them into a low dimensional space, the process is represented as follows:
Figure BDA0003718934570000075
wherein,
Figure BDA0003718934570000076
d t characteristic dimension representing text modality, d a ,d v Respectively representing speechCharacteristic dimensions of modalities and visual modalities, d m Representing a low dimensional spatial dimension, ReLU being the activation function, H m Are relevant features of three modes.
Step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice modality as an example, firstly, the output modality related feature H of the local cross-modality feature interaction module is used m Output speech modal characteristic H of characteristic representation module a The method comprises the following steps of respectively inputting two independent linear layers, using the outputs of the two linear layers as the inputs of a gating unit, and filtering the unique characteristics of a single mode by using the multi-mode related characteristics, wherein the multi-mode self-adaptive gating module comprises the following steps:
λ a =sigmoid(W m H m +W a H a )
Figure BDA0003718934570000077
wherein λ is a Is a similarity weight between the multimodal correlation feature and the speech feature, W m And W a Is a matrix of parameters that is,
Figure BDA0003718934570000078
is a characteristic feature of a speech modality;
repeating the above step2.3 to obtain the characteristic features of the text mode and the visual mode, which are respectively expressed as
Figure BDA0003718934570000079
d t Characteristic dimension representing text modality, d a ,d v Feature dimensions, l, representing speech and visual modalities, respectively t Sequence length, l, representing text modality a ,l v Sequence lengths representing a speech modality and a visual modality, respectively;
then concatenating text-specific features
Figure BDA00037189345700000710
Characteristic features of speech
Figure BDA00037189345700000711
Characteristic features of vision
Figure BDA00037189345700000712
And maps them to a low dimensional space
Figure BDA00037189345700000713
The procedure is represented as follows:
Figure BDA00037189345700000714
wherein,
Figure BDA00037189345700000715
d m representing a low dimensional spatial dimension, ReLU is an activation function,
Figure BDA00037189345700000716
are characteristic features of different modalities.
Step2.4, obtaining the modal correlation characteristic H through a local cross-modal characteristic interaction module m Obtaining the special characteristics of the modality through a global multi-modality interaction module
Figure BDA00037189345700000717
Then designing a local-global feature fusion module based on the Transformer;
first, the modality-dependent features and the modality-specific features are superimposed on the matrix
Figure BDA00037189345700000718
Then, taking the matrix M as the input of a transform, enabling each vector to learn other cross modal expressions based on a multi-head self-attention mechanism, and comprehensively utilizing global multi-modal characteristics to realize comprehensive judgment of multi-modal emotion;
for self-attention machineSystem, definition
Figure BDA0003718934570000081
The Transformer generates a new matrix
Figure BDA0003718934570000082
The procedure is represented as follows:
Figure BDA0003718934570000083
head i =Attention(QW i q ,KW i k ,VW i v )
Figure BDA0003718934570000084
wherein,
Figure BDA0003718934570000085
W o in order to linearly transform the weight matrix, the weight matrix is,
Figure BDA0003718934570000086
indicating a splice, theta att ={W q ,W k ,W v ,W o };
And finally, acquiring the output of the Transformer, splicing the output vectors, and sending the output vectors into a linear layer to obtain a final prediction result, wherein the process is represented as follows:
Figure BDA0003718934570000087
Figure BDA0003718934570000088
wherein,
Figure BDA0003718934570000089
is a mold obtained after being subjected to TransformerThe characteristics of the state-related,
Figure BDA00037189345700000810
is the special characteristics of the modality obtained after the Transformer,
Figure BDA00037189345700000811
d m is a low-dimensional space dimension, and is,
Figure BDA00037189345700000812
is the bias coefficient.
In order to illustrate the effect of the invention, 3 groups of comparative experiments are set, the 1 st group is the main experiment result, and the improvement of the multi-modal emotion analysis performance is verified by comparing the results with some previous works in the field. The 2 nd set of experiments are model ablation experiments, verifying the validity of the proposed model. The experiment of group 3 is a modal importance ablation experiment to verify the importance of the text modality.
(1) Results of the Main experiment
The CMU-MOSI dataset is used as in most previous works. The training, validation and test sets contained 1284, 229, 686 video clips, respectively. The parameter settings are shown in table 1 below.
Table 1: parameter setting of model
Figure BDA00037189345700000813
Figure BDA0003718934570000091
Four evaluation indices were used to evaluate the affective analysis performance of the model. The evaluation indexes are respectively as follows: 1) mean Absolute Error (MAE)2) correlation coefficient (Corr)3) ACC _2, dichotomy precision; 4) f1 Score, weight ACC 2. Among the above indexes, the higher score of the other indexes except the MAE represents the more excellent performance. In order to fully verify the performance of the proposed model, several mainstream and high-performance models in multi-modal emotion analysis are selected, the performance is fully discussed by using the four indexes under the condition of the same experimental environment and data set, and the experimental result is shown in the following table 2.
TABLE 2 Experimental results of different models on CMU-MOSI data sets
Figure BDA0003718934570000092
Analysis table 2 shows that the model provided herein is superior to other comparative models in performance of the two evaluation indexes, namely emotion binary classification accuracy and F1 score on the CMU-MOSI data set. Compared with other models, the accuracy is improved by 0.76-5.62%, and the F1 value is improved by 0.7-5.64%. Compared with the existing advanced Self-MM model, Acc _2 is improved by 0.76%, and F1 value is improved by 0.7%, because the importance of text mode is considered in the text model, and text mode information is fully utilized to help multimodal information fusion. Compared with an ICCN model, Acc _2 is promoted by 3.36%, and F1 is promoted by 3.36%, because the model considers the relevance and difference of modal information while considering the importance of text modal, and makes full use of the relevant features and the unique features of three modalities, thereby improving the model performance. The experimental results fully demonstrate the effectiveness and advancement of the model herein on the multi-modal emotion classification task.
(2) Model ablation experiment
The invention tests the model and its simplified model performance against the training data and test data in table 2, and the experimental results are shown in table 3 below:
(-) Cross-modal attention: and removing the partial cross-modal interaction module (guided by the text modality) on the basis of the complete model.
A (-) gating unit: and removing the global multi-modal interaction module on the basis of the complete model.
(-) text gate, (-) speech gate, (-) visual gate, where the text gate, speech gate, and visual gate are removed in sequence at the global multimodal interaction module.
4. And (3) relevant feature fusion: in the local-global feature fusion module, modality-specific features are removed, and only modality-related features are used.
5. Fusing the unique characteristics: in the local-global feature fusion module, modality-related features are removed, and only modality-specific features are used.
TABLE 3 CMU-MOSI dataset model ablation experimental results
Figure BDA0003718934570000101
1. When the single-modality interaction module is removed, the accuracy and the F1 score are reduced. The result shows that the local cross-modal interaction module effectively reduces the difference between different modalities, and the complementary features of the text modality are learned from the non-text modality.
2. When the global multi-modal interaction module or the text gating network, the voice gating network and the visual gating network are removed, the accuracy and the F1 score are reduced. Therefore, the global multi-modal interaction module learns the characteristic features of different modes and provides additional information for emotion prediction. The results show that the multi-modal adaptive gating mechanism is very helpful for filtering the specific information of the single-modal characteristics.
3. In the global-local feature fusion module, when only relevant modalities are fused or only specific modalities are fused, the accuracy and the F1 score are reduced. The result shows that the removal of the relevant characteristics or the specific characteristics of the mode can influence the performance of the model, and when the two characteristics are fused, the model can learn more characteristic information, thereby being beneficial to emotion prediction.
(3) Modal importance ablation experiment
In order to verify that different modes have different importance degrees on the final emotion analysis result, the model respectively takes a Text mode as a guidance mode (Text attack), a voice mode as a guidance mode (Audio attack) and a Visual mode as a guidance mode (Visual attack), and emotion analysis experiments are respectively carried out on the Text mode, the voice mode and the Visual mode, and the experiment results are compared. The results of the experiment are shown in FIG. 1 below.
The experimental result of fig. 1 shows that when the text mode is the guidance mode, the model performance is the best, and when the speech mode is the guidance mode or the visual mode is the main mode, the accuracy of emotion analysis and the F1 score are both significantly reduced. This shows that different modalities have different degrees of importance for the final emotion analysis result in the multi-modal emotion analysis task. The text mode has the largest contribution to the emotion analysis result, and the importance of the text mode is reflected.
The method extracts three modal characteristics of text, voice and vision, then adopts a cross-modal attention mechanism to realize the characterization between every two modes by taking text modal information as guidance, and obtains the voice characteristic and the visual characteristic which are closely related to the text; then, a multi-mode self-adaptive gating mechanism is adopted to effectively screen three single-mode characteristics by using the mode related characteristics to obtain the characteristic characteristics of the three modes; then, synthesizing multi-modal characteristics and modal important information by adopting a multi-modal hierarchical fusion strategy; finally, the output uses linear transformation to predict emotion polarity.
The above experiments prove that the problem of insufficient information fusion between modalities is solved by introducing the local cross-modality interaction module. The method comprises the steps of taking a text mode with a large contribution degree as a guide mode, taking a voice mode and a visual mode with a small contribution degree as auxiliary modes, utilizing a cross-mode attention mechanism to realize importance information representation between every two modes, then realizing multi-mode hierarchical self-adaptive fusion under multi-mode importance information guide based on a multi-mode self-adaptive gating mechanism, and finally simultaneously applying relevant characteristics and special characteristics of the modes to fully explore relevant relations between the modes and in the modes. Experiments show that compared with a plurality of baseline models, the method provided by the invention achieves a better result. Aiming at a multi-modal emotion analysis task, the multi-modal emotion analysis method based on hierarchical self-adaptive fusion of text guidance provided by the invention is effective in improving the performance of multi-modal emotion analysis.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1. The multi-modal emotion analysis method based on the hierarchical self-adaptive fusion of text guidance is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, preparing a data set, and preprocessing the public data set data;
step2, inputting the processed data into a hierarchical self-adaptive fusion model based on text guidance, and characterizing the information of three modes, namely text, voice and vision, through a feature representation module; extracting modal related features from the obtained three features of text, voice and vision through a local cross-modal feature interaction module; filtering the relevant modal characteristics by a gating mechanism through a global multi-modal interaction module to obtain modal characteristic characteristics; and effectively fusing the modality-related features and the modality characteristic features through a local-global feature fusion module.
2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:
step1.1, downloading a CMU-MOSI data set, wherein the CMU-MOSI data set comprises 2199 short uniwhite video clips, each video clip is annotated with emotion scores manually, the emotion scores are [ -3, +3], and the polarity representing the emotion intensity is from negative polarity to positive polarity; wherein the CMU-MOSI training, validation and test sets contain 1284, 229, 686 video segments, respectively; and then forming a pkl format file through preprocessing.
3. The method of claim 1, wherein the method comprises: in Step2, the characterizing information of three modalities, namely text, voice and vision, by a feature representation module specifically includes:
step2.1, a multimodal language sequence, involves three modalities: a text mode T, a voice mode A and a visual mode V, and an input sequence is defined as
Figure FDA0003718934560000011
Wherein l {t,a,v} A sequence length representing a modality; three independent sub-networks are adopted to obtain the feature representation of three modes; for the text modality, a pre-trained 12-layer BERT is used to extract sentence representations, and the first word vector in the last layer is taken as the representation of the whole sentence; obtaining a feature representation of a text modality by using BERT, wherein the feature representation of the text modality is as follows:
H t =BERT(F tbert )
wherein,
Figure FDA0003718934560000012
H t representing a text modal feature,/ t Length of sequence, d, representing text modality t Characteristic dimension, θ, representing text modality bert Network parameters of the BERT model;
for a voice mode and a visual mode, acquiring time characteristics corresponding to the two modes by using a one-way LSTM, and adopting a hidden moment state of the last layer as the representation of the whole sequence; f a ,F v Respectively obtaining a voice modal feature representation and a visual modal feature representation through the one-way LSTM, wherein the voice modal feature representation and the visual modal feature representation are as follows:
Figure FDA0003718934560000021
Figure FDA0003718934560000022
wherein,
Figure FDA0003718934560000023
a feature of a speech modality is represented,
Figure FDA0003718934560000024
representing features of visual modality,/ a ,l v Respectively representing speech modality and viewLength of the sequence of the sensory modalities, d a ,d v Characteristic dimensions, theta, representing speech and visual modalities, respectively lstm Network parameters of the LSTM model.
4. The method of claim 1, wherein the method comprises: in Step2, the extracting modal-related features of the obtained text, voice, and vision features by the local cross-modal feature interaction module specifically includes:
step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two visual modes V and text mode T, the character is expressed as H v 、H t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:
Figure FDA0003718934560000025
wherein,
Figure FDA0003718934560000026
for linear transformation of the weight matrix, d k Representing the dimensions of the Q and K vectors, d V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment t Providing K and V vectors from speech modality features H a Visual modal characteristics H v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:
Figure FDA0003718934560000027
Figure FDA0003718934560000028
then connect text modality feature H t Text-to-speech interactive features
Figure FDA0003718934560000029
Text visual interaction features
Figure FDA00037189345600000210
And maps them into a low dimensional space, the process is represented as follows:
Figure FDA00037189345600000211
wherein,
Figure FDA00037189345600000212
d t feature dimension, d, representing text modality a ,d v Feature dimensions, d, representing speech and visual modalities, respectively m Representing a low dimensional spatial dimension, ReLU being the activation function, H m Are relevant features of three modes.
5. The method of claim 1, wherein the method comprises: in Step2, the filtering, by using a gating mechanism, the modality-related features by the global multi-modal interaction module to obtain the modality-specific features specifically includes:
step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice mode as an example, the output mode related characteristics H of the local cross-mode characteristic interaction module are firstly compared m Output speech modal characteristic H of characteristic representation module a Two independent linear layers are respectively input, the outputs of the two linear layers are used as the input of a gate control unit, the special characteristics of a single mode are filtered by utilizing the multi-mode related characteristics, and the multi-mode self-adaptive gate is providedThe control module comprises the following components:
λ a =sigmoid(W m H m +W a H a )
Figure FDA0003718934560000031
wherein λ is a Is a similarity weight between the multimodal correlation feature and the speech feature, W m And W a Is a matrix of parameters that is,
Figure FDA0003718934560000032
is a characteristic feature of a speech modality;
repeating the above step2.3 to obtain the characteristic features of the text mode and the visual mode, which are respectively expressed as
Figure FDA0003718934560000033
d t Characteristic dimension representing text modality, d a ,d v Feature dimensions, l, representing speech and visual modalities, respectively t Sequence length, l, representing text modality a ,l v Sequence lengths representing a speech modality and a visual modality, respectively;
then concatenating text-specific features
Figure FDA0003718934560000034
Characteristic features of speech
Figure FDA0003718934560000035
Characteristic features of vision
Figure FDA0003718934560000036
And maps them to a low dimensional space
Figure FDA0003718934560000037
The procedure is represented as follows:
Figure FDA0003718934560000038
wherein,
Figure FDA0003718934560000039
d m representing a low dimensional spatial dimension, ReLU is an activation function,
Figure FDA00037189345600000310
are characteristic features of different modalities.
6. The method of claim 1, wherein the method comprises: in Step2, the effective fusion of the modality-related features and the modality-specific features by the local-global feature fusion module specifically includes:
step2.4, obtaining the modal correlation characteristic H through a local cross-modal characteristic interaction module m Obtaining the special characteristics of the modality through a global multi-modality interaction module
Figure FDA00037189345600000311
Then designing a local-global feature fusion module based on the Transformer;
first, the modality-dependent features and the modality-specific features are superimposed on the matrix
Figure FDA00037189345600000312
Then, taking the matrix M as the input of a transform, enabling each vector to learn other cross modal expressions based on a multi-head self-attention mechanism, and comprehensively utilizing global multi-modal characteristics to realize comprehensive judgment of multi-modal emotion;
for the self-attention mechanism, define
Figure FDA00037189345600000313
The Transformer generates a new matrix
Figure FDA00037189345600000314
The procedure is represented as follows:
Figure FDA0003718934560000041
head i =Attention(QW i q ,KW i k ,VW i v )
Figure FDA0003718934560000042
wherein,
Figure FDA0003718934560000043
W o in order to linearly transform the weight matrix,
Figure FDA0003718934560000044
indicating a splice, theta att ={W q ,W k ,W v ,W o };
And finally, acquiring the output of the Transformer, splicing the output vectors, and sending the output vectors into a linear layer to obtain a final prediction result, wherein the process is represented as follows:
Figure FDA0003718934560000045
Figure FDA0003718934560000046
wherein,
Figure FDA0003718934560000047
for the modality-dependent properties obtained after Transformer,
Figure FDA0003718934560000048
is the special characteristics of the modality obtained after the Transformer,
Figure FDA0003718934560000049
d m is a low-dimensional space dimension, and is,
Figure FDA00037189345600000410
is a bias factor.
CN202210743773.3A 2022-06-28 2022-06-28 Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion Active CN114969458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210743773.3A CN114969458B (en) 2022-06-28 2022-06-28 Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210743773.3A CN114969458B (en) 2022-06-28 2022-06-28 Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion

Publications (2)

Publication Number Publication Date
CN114969458A true CN114969458A (en) 2022-08-30
CN114969458B CN114969458B (en) 2024-04-26

Family

ID=82965492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210743773.3A Active CN114969458B (en) 2022-06-28 2022-06-28 Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion

Country Status (1)

Country Link
CN (1) CN114969458B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544279A (en) * 2022-10-11 2022-12-30 合肥工业大学 Multi-modal emotion classification method based on cooperative attention and application thereof
CN115809438A (en) * 2023-01-18 2023-03-17 中国科学技术大学 Multi-modal emotion analysis method, system, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528004A (en) * 2020-12-24 2021-03-19 北京百度网讯科技有限公司 Voice interaction method, voice interaction device, electronic equipment, medium and computer program product
CN112651448A (en) * 2020-12-29 2021-04-13 中山大学 Multi-modal emotion analysis method for social platform expression package
CN113420807A (en) * 2021-06-22 2021-09-21 哈尔滨理工大学 Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
CN113435496A (en) * 2021-06-24 2021-09-24 湖南大学 Self-adaptive fusion multi-mode emotion classification method based on attention mechanism
CN113704552A (en) * 2021-08-31 2021-11-26 哈尔滨工业大学 Cross-modal automatic alignment and pre-training language model-based emotion analysis method, system and equipment
US11281945B1 (en) * 2021-02-26 2022-03-22 Institute Of Automation, Chinese Academy Of Sciences Multimodal dimensional emotion recognition method
CN114463688A (en) * 2022-04-12 2022-05-10 之江实验室 Cross-modal context coding dialogue emotion recognition method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528004A (en) * 2020-12-24 2021-03-19 北京百度网讯科技有限公司 Voice interaction method, voice interaction device, electronic equipment, medium and computer program product
CN112651448A (en) * 2020-12-29 2021-04-13 中山大学 Multi-modal emotion analysis method for social platform expression package
US11281945B1 (en) * 2021-02-26 2022-03-22 Institute Of Automation, Chinese Academy Of Sciences Multimodal dimensional emotion recognition method
CN113420807A (en) * 2021-06-22 2021-09-21 哈尔滨理工大学 Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
CN113435496A (en) * 2021-06-24 2021-09-24 湖南大学 Self-adaptive fusion multi-mode emotion classification method based on attention mechanism
CN113704552A (en) * 2021-08-31 2021-11-26 哈尔滨工业大学 Cross-modal automatic alignment and pre-training language model-based emotion analysis method, system and equipment
CN114463688A (en) * 2022-04-12 2022-05-10 之江实验室 Cross-modal context coding dialogue emotion recognition method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LONG YING等: "Multi-level Multi-Modal Cross-Attention network for Fake news detection", IEEE ACCESS, 20 September 2021 (2021-09-20), pages 1 - 10 *
卢婵等, 山东大学学报(理学版), vol. 58, no. 12, 6 September 2023 (2023-09-06), pages 31 - 40 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544279A (en) * 2022-10-11 2022-12-30 合肥工业大学 Multi-modal emotion classification method based on cooperative attention and application thereof
CN115544279B (en) * 2022-10-11 2024-01-26 合肥工业大学 Multi-mode emotion classification method based on cooperative attention and application thereof
CN115809438A (en) * 2023-01-18 2023-03-17 中国科学技术大学 Multi-modal emotion analysis method, system, device and storage medium

Also Published As

Publication number Publication date
CN114969458B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
De Rosa et al. A survey on text generation using generative adversarial networks
Huan et al. Video multimodal emotion recognition based on Bi-GRU and attention fusion
CN114969458B (en) Multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion
CN110765264A (en) Text abstract generation method for enhancing semantic relevance
CN114529758A (en) Multi-modal emotion analysis method based on contrast learning and multi-head self-attention mechanism
CN117391051B (en) Emotion-fused common attention network multi-modal false news detection method
CN118114188B (en) False news detection method based on multi-view and layered fusion
Lian et al. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face
US20240119716A1 (en) Method for multimodal emotion classification based on modal space assimilation and contrastive learning
CN111563373A (en) Attribute-level emotion classification method for focused attribute-related text
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
CN117371456A (en) Multi-mode irony detection method and system based on feature fusion
CN115858728A (en) Multi-mode data based emotion analysis method
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
Gandhi et al. Multimodal sentiment analysis: review, application domains and future directions
CN115481679A (en) Multi-modal emotion analysis method and system
Rani et al. Deep learning with big data: an emerging trend
CN117765450B (en) Video language understanding method, device, equipment and readable storage medium
Zeng et al. Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities
Jia et al. Semantic association enhancement transformer with relative position for image captioning
CN113807307A (en) Multi-mode joint learning method for video multi-behavior recognition
Xue et al. Intent-enhanced attentive Bert capsule network for zero-shot intention detection
CN117893948A (en) Multi-mode emotion analysis method based on multi-granularity feature comparison and fusion framework
Liu et al. TACFN: transformer-based adaptive cross-modal fusion network for multimodal emotion recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant