CN114969458A

CN114969458A - Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance

Info

Publication number: CN114969458A
Application number: CN202210743773.3A
Authority: CN
Inventors: 郭军军; 卢婵
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-08-30
Anticipated expiration: 2042-06-28
Also published as: CN114969458B

Abstract

The invention relates to a hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance, and belongs to the field of natural language processing. The invention comprises the following steps: firstly, three modal characteristics of text, voice and vision are respectively extracted, then a cross-modal attention mechanism is adopted, text modal information is used as guidance to realize the characterization between every two modes, and the voice characteristic and the visual characteristic which are closely related to the text are obtained; then, a multi-mode self-adaptive gating mechanism is adopted to effectively screen three single-mode characteristics by using the mode related characteristics to obtain the characteristic characteristics of the three modes; then, synthesizing multi-modal characteristics and modal important information by adopting a multi-modal hierarchical fusion strategy; finally, the output uses linear transformation to predict emotion polarity. The present invention trains a model using a common dataset CMU-MOSI dataset. Experimental results show that the method is effective in improving the performance of multi-modal emotion analysis.

Description

Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance

Technical Field

The invention relates to a hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance, and belongs to the field of natural language processing.

Background

With the development of internet technology, social media communication methods such as trembling and fast-handedness have been rapidly developed in recent years. More and more users choose to express their own opinions and emotions using videos, which provide a large amount of multimodal data. Multimodal Sentiment Analysis (MSA) has therefore received increasing attention, and relevant research has been widely applied in various fields, such as social media public opinion supervision, personalized recommendations, etc. Therefore, the multi-modal emotion analysis has important research significance and application value.

The multi-modal sentiment analysis not only needs to fully represent single-modal information, but also needs to consider interaction and fusion among different modal characteristics. Zadeh et al propose a Tensor Fusion Network (TFN) and a Memory Fusion Network (MFN) that uses LSTM to learn view-specific interactions. Tsai et al propose a cross-modal transformer that learns cross-modal attention to emphasize the target modality. Yu et al introduced a unimodal subtask to help with modal characterization learning.

Although these methods have met with some success in the field of multimodal sentiment analysis. However, in the conventional research, the multi-modal fusion method usually considers three modal features as being equally important, focuses on the fusion of the multi-modal features, ignores the contribution of different modalities to the final emotion analysis result, and insufficiently utilizes modal importance information, which may cause the loss of important information in the modalities and influence the multi-modal emotion analysis performance.

Disclosure of Invention

The invention provides a multi-modal emotion analysis method based on hierarchical self-adaptive fusion of text guidance, which takes text modal information as guidance to realize hierarchical self-adaptive screening of multi-modal information and fusion to improve the performance of multi-modal emotion analysis.

The technical scheme of the invention is as follows: the multi-modal emotion analysis method based on text guidance and hierarchical self-adaptive fusion comprises the following specific steps:

step1, preparing a data set, and preprocessing the public data set data;

step2, inputting the processed data into a hierarchical self-adaptive fusion model based on text guidance, and characterizing the information of three modes, namely text, voice and vision, through a feature representation module; extracting modal related features from the obtained three features of text, voice and vision through a local cross-modal feature interaction module; filtering the relevant modal characteristics by a gating mechanism through a global multi-modal interaction module to obtain modal characteristic characteristics; and effectively fusing the modality-related features and the modality characteristic features through a local-global feature fusion module.

As a further scheme of the invention, the Step1 comprises the following specific steps:

step1.1, downloading a CMU-MOSI data set, wherein the CMU-MOSI data set comprises 2199 short uniwhite video clips, each video clip is annotated with emotion scores manually, the emotion scores are [ -3, +3], and the polarity representing the emotion intensity is from negative polarity to positive polarity; wherein the CMU-MOSI training, validation and test sets contain 1284, 229, 686 video segments, respectively; and then forming a pkl format file through preprocessing.

As a further aspect of the present invention, in Step2, the characterizing information of three modalities, namely, text, voice, and vision, by a feature representation module specifically includes:

step2.1, a multimodal language sequence, involves three modalities: a text mode T, a voice mode A and a visual mode V, and an input sequence is defined as

Wherein l _{t,a,v} A sequence length representing a modality; three independent sub-networks are adopted to obtain the feature representation of three modes; for the text modality, a pre-trained 12-layer BERT is used to extract sentence representations, and the first word vector in the last layer is taken as the representation of the whole sentence; obtaining a feature representation of a text modality by using BERT, wherein the feature representation of the text modality is as follows:

H _t ＝BERT(F _t ,θ ^bert )

wherein,

H _t representing a text modal feature,/ _t Length of sequence, d, representing text modality _t Characteristic dimension, θ, representing text modality ^bert Network parameters of the BERT model;

for a voice mode and a visual mode, acquiring time characteristics corresponding to the two modes by using a one-way LSTM, and adopting a hidden moment state of the last layer as the representation of the whole sequence; f _a ，F _v Respectively obtaining a voice modal feature representation and a visual modal feature representation through the one-way LSTM, wherein the voice modal feature representation and the visual modal feature representation are as follows:

wherein,

a feature of a speech modality is represented,

representing features of visual modality,/ _a ，l _v Sequence length, d, representing speech modality and visual modality, respectively _a ,d _v Characteristic dimensions, theta, representing speech and visual modalities, respectively ^lstm Network parameters of the LSTM model.

As a further aspect of the present invention, in Step2, the extracting modal-related features of the obtained text, speech, and vision through the local cross-modal feature interaction module specifically includes:

step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two modes of visual mode V and text mode T, the characteristic is expressed as H _v 、H _t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:

wherein,

for linear transformation of the weight matrix, d _k Representing the dimensions of the Q and K vectors, d _V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment _t Providing K and V vectors from the speech modal characteristics H _a Visual modal characteristics H _v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:

then connects text modal feature H _t Text-to-speech interactive features

Text visual interaction features

And maps them into a low dimensional space, the process is represented as follows:

wherein,

d _t feature dimension, d, representing text modality _a ,d _v Feature dimensions, d, representing speech and visual modalities, respectively _m Representing the dimension of the space in low dimensions, ReLU being the activation function, H _m Are relevant features of three modes.

As a further aspect of the present invention, in Step2, the filtering, by using a gating mechanism, the modality-related features by using the global multi-modal interaction module to obtain the modality-specific features specifically includes:

step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice modality as an example, firstly, the output modality related feature H of the local cross-modality feature interaction module is used _m Output speech modal characteristic H of characteristic representation module _a The method comprises the following steps of respectively inputting two independent linear layers, using the outputs of the two linear layers as the inputs of a gating unit, and filtering the unique characteristics of a single mode by using the multi-mode related characteristics, wherein the multi-mode self-adaptive gating module comprises the following steps:

λ _a ＝sigmoid(W _m H _m +W _a H _a )

wherein λ is _a Is a similarity weight between the multimodal correlation feature and the speech feature, W _m And W _a Is a matrix of parameters that is,

is a characteristic feature of a speech modality;

repeating the above step2.3 to obtain the characteristic features of the text mode and the visual mode, which are respectively expressed as

d _t Feature dimension, d, representing text modality _a ,d _v Feature dimensions, l, representing speech and visual modalities, respectively _t Sequence length, l, representing text modality _a ，l _v Sequence lengths representing a speech modality and a visual modality, respectively;

then concatenating text-specific features

Characteristic features of speech

Characteristic features of vision

And maps them to a low dimensional space

The procedure is represented as follows:

wherein,

d _m representing a low dimensional spatial dimension, ReLU is an activation function,

are characteristic features of different modalities.

As a further aspect of the present invention, in Step2, the effectively fusing the modality-related features and the modality-specific features by the local-global feature fusion module specifically includes:

step2.4, obtaining the modal correlation characteristic H through a local cross-modal characteristic interaction module _m Obtaining the special characteristics of the modality through a global multi-modality interaction module

Then designing a local-global feature fusion module based on the Transformer;

first, the modality-dependent features and the modality-specific features are superimposed on the matrix

Then, taking the matrix M as the input of a transform, enabling each vector to learn other cross modal expressions based on a multi-head self-attention mechanism, and comprehensively utilizing global multi-modal characteristics to realize comprehensive judgment of multi-modal emotion;

for the self-attention mechanism, define

The Transformer generates a new matrix

The procedure is represented as follows:

head _i θAttention(QW _i ^q ,KW _i ^k ,VW _i ^v )

wherein,

W ^o in order to linearly transform the weight matrix, the weight matrix is,

indicating a splice, theta ^att ＝{W ^q ,W ^k ,W ^v ，W ^o }；

And finally, acquiring the output of the Transformer, splicing the output vectors, and sending the output vectors into a linear layer to obtain a final prediction result, wherein the process is represented as follows:

wherein,

for the modality-dependent properties obtained after Transformer,

is the special characteristics of the modality obtained after the Transformer,

d _m is a low-dimensional space dimension, and is,

is a bias factor.

The invention has the beneficial effects that:

1. aiming at the multi-modal emotion analysis, under the condition of considering modal importance information, the invention effectively explores the relationship among different modes and in the modes and improves the accuracy rate of the multi-modal emotion analysis. A multi-mode hierarchical self-adaptive fusion method based on text mode guidance is provided, and hierarchical self-adaptive screening and fusion of multi-mode information are realized by taking a text mode as guidance.

2. The cross-modal attention mechanism is used for fully learning modal correlation characteristics, and the special characteristics of the fusion modality are screened through the multi-modal adaptive gating mechanism, so that multi-modal fusion and emotion prediction are facilitated.

3. Experiments are carried out on CMU-MOSI and CMU-MOSEI data sets, and results show that the multi-modal emotion analysis performance is remarkably improved.

Drawings

FIG. 1 is a graph of the results of a CMU-MOSI dataset modal importance ablation experiment of the present invention;

FIG. 2 is a schematic flow chart of a hierarchical adaptive fusion multi-modal emotion analysis method based on text guidance.

Detailed Description

Example 1: as shown in fig. 1-2, a hierarchical adaptive fusion multimodal emotion analysis method based on text guidance trains a model by taking a CMU-MOSI data set as an example, and the method specifically includes the following steps:

step1, preparing a data set, and preprocessing the CMU-MOSI data of the public data set;

The specific steps of Step2 are as follows:

H _t ＝BERT(F _t ,θ ^bert )

wherein,

wherein,

a feature of a speech modality is represented,

wherein,

for linear transformation of the weight matrix, d _k Representing the dimensions of the Q and K vectors, d _V Representing the dimension of V vector, utilizing two cross-attention modules to obtain two groups of modal interaction features of text-to-speech and text-to-vision, and using the modal feature H of text at the moment _t Providing K and V vectors from speech modality features H _a Visual modal characteristics H _v Respectively providing Q vectors, and representing the cross-modal interaction process as follows:

then connect text modality feature H _t Text-to-speech interactive features

Text visual interaction features

wherein,

d _t characteristic dimension representing text modality, d _a ,d _v Respectively representing speechCharacteristic dimensions of modalities and visual modalities, d _m Representing a low dimensional spatial dimension, ReLU being the activation function, H _m Are relevant features of three modes.

λ _a ＝sigmoid(W _m H _m +W _a H _a )

is a characteristic feature of a speech modality;

d _t Characteristic dimension representing text modality, d _a ,d _v Feature dimensions, l, representing speech and visual modalities, respectively _t Sequence length, l, representing text modality _a ，l _v Sequence lengths representing a speech modality and a visual modality, respectively;

then concatenating text-specific features

Characteristic features of speech

Characteristic features of vision

And maps them to a low dimensional space

The procedure is represented as follows:

wherein,

are characteristic features of different modalities.

Then designing a local-global feature fusion module based on the Transformer;

for self-attention machineSystem, definition

The Transformer generates a new matrix

The procedure is represented as follows:

head _i ＝Attention(QW _i ^q ,KW _i ^k ,VW _i ^v )

wherein,

W ^o in order to linearly transform the weight matrix, the weight matrix is,

indicating a splice, theta ^att ＝{W ^q ,W ^k ,W ^v ，W ^o }；

wherein,

is a mold obtained after being subjected to TransformerThe characteristics of the state-related,

is the special characteristics of the modality obtained after the Transformer,

d _m is a low-dimensional space dimension, and is,

is the bias coefficient.

In order to illustrate the effect of the invention, 3 groups of comparative experiments are set, the 1 st group is the main experiment result, and the improvement of the multi-modal emotion analysis performance is verified by comparing the results with some previous works in the field. The 2 nd set of experiments are model ablation experiments, verifying the validity of the proposed model. The experiment of group 3 is a modal importance ablation experiment to verify the importance of the text modality.

(1) Results of the Main experiment

The CMU-MOSI dataset is used as in most previous works. The training, validation and test sets contained 1284, 229, 686 video clips, respectively. The parameter settings are shown in table 1 below.

Table 1: parameter setting of model

Four evaluation indices were used to evaluate the affective analysis performance of the model. The evaluation indexes are respectively as follows: 1) mean Absolute Error (MAE)2) correlation coefficient (Corr)3) ACC _2, dichotomy precision; 4) f1 Score, weight ACC 2. Among the above indexes, the higher score of the other indexes except the MAE represents the more excellent performance. In order to fully verify the performance of the proposed model, several mainstream and high-performance models in multi-modal emotion analysis are selected, the performance is fully discussed by using the four indexes under the condition of the same experimental environment and data set, and the experimental result is shown in the following table 2.

TABLE 2 Experimental results of different models on CMU-MOSI data sets

Analysis table 2 shows that the model provided herein is superior to other comparative models in performance of the two evaluation indexes, namely emotion binary classification accuracy and F1 score on the CMU-MOSI data set. Compared with other models, the accuracy is improved by 0.76-5.62%, and the F1 value is improved by 0.7-5.64%. Compared with the existing advanced Self-MM model, Acc _2 is improved by 0.76%, and F1 value is improved by 0.7%, because the importance of text mode is considered in the text model, and text mode information is fully utilized to help multimodal information fusion. Compared with an ICCN model, Acc _2 is promoted by 3.36%, and F1 is promoted by 3.36%, because the model considers the relevance and difference of modal information while considering the importance of text modal, and makes full use of the relevant features and the unique features of three modalities, thereby improving the model performance. The experimental results fully demonstrate the effectiveness and advancement of the model herein on the multi-modal emotion classification task.

(2) Model ablation experiment

The invention tests the model and its simplified model performance against the training data and test data in table 2, and the experimental results are shown in table 3 below:

(-) Cross-modal attention: and removing the partial cross-modal interaction module (guided by the text modality) on the basis of the complete model.

A (-) gating unit: and removing the global multi-modal interaction module on the basis of the complete model.

(-) text gate, (-) speech gate, (-) visual gate, where the text gate, speech gate, and visual gate are removed in sequence at the global multimodal interaction module.

4. And (3) relevant feature fusion: in the local-global feature fusion module, modality-specific features are removed, and only modality-related features are used.

5. Fusing the unique characteristics: in the local-global feature fusion module, modality-related features are removed, and only modality-specific features are used.

TABLE 3 CMU-MOSI dataset model ablation experimental results

1. When the single-modality interaction module is removed, the accuracy and the F1 score are reduced. The result shows that the local cross-modal interaction module effectively reduces the difference between different modalities, and the complementary features of the text modality are learned from the non-text modality.

2. When the global multi-modal interaction module or the text gating network, the voice gating network and the visual gating network are removed, the accuracy and the F1 score are reduced. Therefore, the global multi-modal interaction module learns the characteristic features of different modes and provides additional information for emotion prediction. The results show that the multi-modal adaptive gating mechanism is very helpful for filtering the specific information of the single-modal characteristics.

3. In the global-local feature fusion module, when only relevant modalities are fused or only specific modalities are fused, the accuracy and the F1 score are reduced. The result shows that the removal of the relevant characteristics or the specific characteristics of the mode can influence the performance of the model, and when the two characteristics are fused, the model can learn more characteristic information, thereby being beneficial to emotion prediction.

(3) Modal importance ablation experiment

In order to verify that different modes have different importance degrees on the final emotion analysis result, the model respectively takes a Text mode as a guidance mode (Text attack), a voice mode as a guidance mode (Audio attack) and a Visual mode as a guidance mode (Visual attack), and emotion analysis experiments are respectively carried out on the Text mode, the voice mode and the Visual mode, and the experiment results are compared. The results of the experiment are shown in FIG. 1 below.

The experimental result of fig. 1 shows that when the text mode is the guidance mode, the model performance is the best, and when the speech mode is the guidance mode or the visual mode is the main mode, the accuracy of emotion analysis and the F1 score are both significantly reduced. This shows that different modalities have different degrees of importance for the final emotion analysis result in the multi-modal emotion analysis task. The text mode has the largest contribution to the emotion analysis result, and the importance of the text mode is reflected.

The method extracts three modal characteristics of text, voice and vision, then adopts a cross-modal attention mechanism to realize the characterization between every two modes by taking text modal information as guidance, and obtains the voice characteristic and the visual characteristic which are closely related to the text; then, a multi-mode self-adaptive gating mechanism is adopted to effectively screen three single-mode characteristics by using the mode related characteristics to obtain the characteristic characteristics of the three modes; then, synthesizing multi-modal characteristics and modal important information by adopting a multi-modal hierarchical fusion strategy; finally, the output uses linear transformation to predict emotion polarity.

The above experiments prove that the problem of insufficient information fusion between modalities is solved by introducing the local cross-modality interaction module. The method comprises the steps of taking a text mode with a large contribution degree as a guide mode, taking a voice mode and a visual mode with a small contribution degree as auxiliary modes, utilizing a cross-mode attention mechanism to realize importance information representation between every two modes, then realizing multi-mode hierarchical self-adaptive fusion under multi-mode importance information guide based on a multi-mode self-adaptive gating mechanism, and finally simultaneously applying relevant characteristics and special characteristics of the modes to fully explore relevant relations between the modes and in the modes. Experiments show that compared with a plurality of baseline models, the method provided by the invention achieves a better result. Aiming at a multi-modal emotion analysis task, the multi-modal emotion analysis method based on hierarchical self-adaptive fusion of text guidance provided by the invention is effective in improving the performance of multi-modal emotion analysis.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The multi-modal emotion analysis method based on the hierarchical self-adaptive fusion of text guidance is characterized by comprising the following steps of: the method comprises the following specific steps:

step1, preparing a data set, and preprocessing the public data set data;

2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:

3. The method of claim 1, wherein the method comprises: in Step2, the characterizing information of three modalities, namely text, voice and vision, by a feature representation module specifically includes:

H _t ＝BERT(F _t ,θ ^bert )

wherein,

wherein,

a feature of a speech modality is represented,

representing features of visual modality,/ _a ，l _v Respectively representing speech modality and viewLength of the sequence of the sensory modalities, d _a ,d _v Characteristic dimensions, theta, representing speech and visual modalities, respectively ^lstm Network parameters of the LSTM model.

4. The method of claim 1, wherein the method comprises: in Step2, the extracting modal-related features of the obtained text, voice, and vision features by the local cross-modal feature interaction module specifically includes:

step2.2, learning the correlation between the text mode and the non-text mode by utilizing a cross-mode attention mechanism; when there are two visual modes V and text mode T, the character is expressed as H _v 、H _t The Cross-Modal Attention, CM, from text modality to visual modality is expressed as follows:

wherein,

then connect text modality feature H _t Text-to-speech interactive features

Text visual interaction features

wherein,

d _t feature dimension, d, representing text modality _a ,d _v Feature dimensions, d, representing speech and visual modalities, respectively _m Representing a low dimensional spatial dimension, ReLU being the activation function, H _m Are relevant features of three modes.

5. The method of claim 1, wherein the method comprises: in Step2, the filtering, by using a gating mechanism, the modality-related features by the global multi-modal interaction module to obtain the modality-specific features specifically includes:

step2.3, designing a global multi-modal feature interaction module by using a gate control unit, learning the unique features of different modes, and obtaining the unique features of three modes by using a gate control mechanism under the guidance of relevant features mainly including text modes; taking the voice mode as an example, the output mode related characteristics H of the local cross-mode characteristic interaction module are firstly compared _m Output speech modal characteristic H of characteristic representation module _a Two independent linear layers are respectively input, the outputs of the two linear layers are used as the input of a gate control unit, the special characteristics of a single mode are filtered by utilizing the multi-mode related characteristics, and the multi-mode self-adaptive gate is providedThe control module comprises the following components:

λ _a ＝sigmoid(W _m H _m +W _a H _a )

is a characteristic feature of a speech modality;

then concatenating text-specific features

Characteristic features of speech

Characteristic features of vision

And maps them to a low dimensional space

The procedure is represented as follows:

wherein,

are characteristic features of different modalities.

6. The method of claim 1, wherein the method comprises: in Step2, the effective fusion of the modality-related features and the modality-specific features by the local-global feature fusion module specifically includes: