CN111680541B

CN111680541B - Multi-modal emotion analysis method based on multi-dimensional attention fusion network

Info

Publication number: CN111680541B
Application number: CN202010292014.0A
Authority: CN
Inventors: 冯镔; 付彦喆; 王耀平; 江子文; 杭浩然; 李瑞达; 刘文予
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2022-06-21
Anticipated expiration: 2040-04-14
Also published as: CN111680541A

Abstract

The invention discloses a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which comprises the following steps of: extracting voice preprocessing characteristics, video preprocessing characteristics and text preprocessing characteristics aiming at sample data containing multiple modals such as voice, video and text; then, constructing the multi-dimensional attention fusion network for each mode, extracting a first-level autocorrelation feature and a second-level autocorrelation feature by using an autocorrelation feature extraction module in the network, combining autocorrelation information of the three modes, and obtaining cross-mode fusion features of the three modes by using a cross-mode fusion module in the network; combining the secondary autocorrelation characteristics and the cross-modal fusion characteristics to obtain modal multi-dimensional characteristics; finally, splicing the modal multi-dimensional characteristics, determining emotion scores and performing emotion analysis; the method can effectively perform feature fusion in a non-aligned multi-modal data scene, and perform emotion analysis by fully utilizing multi-modal associated information.

Description

Multi-modal emotion analysis method based on multi-dimensional attention fusion network

Technical Field

The invention belongs to the field of multi-modal emotion calculation, and particularly relates to a multi-modal emotion analysis method based on a multi-dimensional attention fusion network.

Background

Mood analysis has numerous applications in daily life. With the development of big data and multimedia technology, different modes of voice, video and text of data are analyzed by means of a multi-mode emotion analysis technology, and shallow meanings behind the data are better mined. In a return visit survey, for example, the degree of satisfaction of a user with a service or a product is known through comprehensive analysis of the user's voice, face, and speech content.

At present, the difficulty of multimodal emotion analysis lies in how to effectively fuse multimodal information, and the acquisition modes of voice, video and text characteristics are completely different. When the same content is described, the sequence length of the two modalities of voice and video is greatly different from the text in the time dimension, and the characteristics of the three modalities are in one-to-one correspondence in the time dimension, which causes great difficulty in the fusion between the modalities.

At present, two common methods are available, one method is based on modal integration, namely a data layer, a feature layer and a decision layer are selected from the whole emotion analysis system to splice intermediate results, and then emotion prediction is carried out. The method only simply integrates the results of three modes, does not consider the correlation information among the modes, and is easy to cause model overfitting due to information redundancy. The other method is based on mode labeling alignment, namely, when data labeling is carried out, the three modes are forcibly aligned in a time dimension according to characters or phonemes, so that the corresponding relation of the three modes in time is guaranteed, and mode fusion is carried out by utilizing a cyclic neural network, a convolutional neural network, an attention mechanism and a Seq2Seq framework, so that the mode labeling cost is high, and the mode labeling is not beneficial to the actual production and living environment.

Disclosure of Invention

The invention aims to provide a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which can effectively avoid the problems of overfitting caused by an integration method and overlarge labeling cost caused by modal-based labeling alignment, and fully utilizes multi-dimensional information in and among the modes to obtain more accurate and reliable emotion analysis results.

The invention adopts the following steps to solve the technical problem:

step one, a multi-modal emotion analysis database is established, the size of the database is N, each sample in the database contains three target modal data of voice, video and text, preprocessing characteristics of the three target modalities are extracted in advance, and emotion labeling is carried out on each sample.

And step two, constructing respective multidimensional attention fusion networks by using the three target modes in the step one.

The multidimensional attention fusion network of each of the three target modes comprises an autocorrelation feature extraction module and a cross-mode fusion module, wherein the autocorrelation feature extraction module and the cross-mode fusion module are formed by a transform network.

And step three, the preprocessed features in the step one are respectively processed by the autocorrelation feature extraction module in the step two, and autocorrelation information of three modes, namely voice autocorrelation information, text autocorrelation information and video autocorrelation information, is extracted.

The voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor.

The voice autocorrelation feature extraction module is configured to extract autocorrelation information of the input voice pre-processing features.

The text autocorrelation feature extraction module is configured to extract autocorrelation information of input text pre-processing features.

The video autocorrelation feature extraction module is configured to extract autocorrelation information of input video pre-processing features.

The autocorrelation information of the three modes comprises a first-level autocorrelation characteristic and a second-level autocorrelation characteristic.

And step four, selecting the first-level autocorrelation characteristics of any target modality in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modalities as auxiliary fusion characteristics, and sending the target characteristics to a cross-modality fusion module where the target modalities are located according to a certain grouping mode to respectively obtain voice-based, text-based cross-modality fusion characteristics and video-based cross-modality fusion characteristics.

The cross-modal fusion module comprises two bimodal fusion devices and a weighted integration network.

And step five, adding the cross-modal fusion characteristics of each target mode and the secondary autocorrelation characteristics of the step three to obtain multi-dimensional fusion characteristics.

And step six, splicing the voice, text and video multi-dimensional fusion characteristics obtained in the step five to obtain full-scale multi-dimensional characteristics, and sending the full-scale multi-dimensional characteristics to a scoring module to obtain emotion scores.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the three target modal preprocessing feature extractions in the step one are as follows: voice preprocessing features are obtained by extracting mfcc features from voice by using a kaldi voice recognition tool packet, video preprocessing features are obtained by extracting facial expression unit features from Facet, and text preprocessing features can be obtained by extracting word vector features from word2 vec.

And respectively carrying out feature dimension alignment on the voice preprocessing feature, the video preprocessing feature and the text preprocessing feature through linear transformation.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the emotion in the first step is marked as a limited continuous interval, the interior of the interval can be continuously and equidistantly divided into M sub-intervals, and each sub-interval represents the emotion degree range.

Wherein the interval can be divided into integer intervals of [ -K, K ], wherein more than 0 is judged positive, equal to 0 is judged neutral, less than 0 is judged negative, and the specific subinterval can be further divided according to the required emotion granularity size ].

The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the multi-dimensional attention fusion network in the step two has the same structure aiming at the three target modes of voice, video and text.

The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the self-correlation feature extraction module in the step two is a Transformer network.

The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the first-level autocorrelation feature extractor and the first-level autocorrelation feature extractor in the third step are in a cascading mode.

The first-stage autocorrelation feature extractor is configured to extract first-stage autocorrelation features from input target modal preprocessing features based on a transform multi-head self-attention mechanism.

The calculation formula of the multi-head self-attention mechanism is as follows:

Q_i＝XW_i ^Q；K_i＝XW_i ^K；V_i＝XW_i ^V；

MultiHead_X＝Concat(head₁,…,head_n)

wherein X is the target modal preprocessing characteristic in the second step, W^Q′Query mapping weight, W, for target modality m^K′Weight of key mapping for target modality m, W^V′Mapping weights for the values of the target modality m, softmax being a weight normalization function, K_i ^TIs K_iTransposed matrix of d_kFor the scaling factor, n is the number of headers, MultiHead _ X₁The obtained first-order autocorrelation characteristics.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the second-level correlation feature extractor in step three is configured to extract second-level autocorrelation features from the first-level autocorrelation features of the input target modality based on a feedforward neural network.

Wherein the feed-forward neural network is configured to:

MultiHead_X₂＝max(0,MultiHead_X₁·W₁+b₁)·W₂+b₂

wherein Multihead _ X₂Is a second order autocorrelation feature of the target mode, W₁Is the weight of a first-order autocorrelation feature, W₂For network hidden layer weights, b₁Biasing for first order autocorrelation characteristics, b₂Biasing for the network hidden layer.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the grouping mode in the fourth step is (X)_0,main,X_1,aide),(X_0,main,X_2,aide) For subsequent packet fusion, wherein X_0,mainFor the target feature to be fused in step four, X_1,aide，X_2,aideIs an auxiliary fusion feature of the other two modalities.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is used for the grouping fusion and is configured to input the grouping to obtain two groups of cross-modal fusion characteristics.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is cross-modal fusion based on a transform multi-head attention mechanism, and the calculation method is as follows:

Q_m,main＝X_m,mainW^Q′；K＝X_aideW^K′；V＝X_j,aideW^V′；

CrossFusion_X_aide→main＝Concat(head₁,…,head_n)

wherein X_m,mainSelf-attention feature, X, representing the current target modality m_j,aideIndicating a secondary fusion feature, CrossFusion _ X_aide→mainDenotes the fusion result, W^Q′Query mapping weight, W, for target modality m^K′Weight of key mapping for target modality m, W^V′Mapping weights for values of the target modality m, d_k' is the scaling factor.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the cross-modal fusion of the dual-modal fusion device based on the multi-head attention mechanism is carried out, and the specific fusion process is as follows:

(1) the target feature X to be fused in the step four is_0,mainAnd performing query mapping. The mapping method is as follows:

Q_0,main＝X_0,mainW^Q′

wherein, W_QWeights are learned for the query mappings.

(2) Fusing the auxiliary fusion feature X described in the step four_1,aide，X_2,aideAnd mapping the grouping key and the value.

The mapping method is as follows:

for X_1,aide：

For X_2,aide：

Wherein the content of the first and second substances,

to assist the key mapping weights of modality 1,

weights are mapped for the values of the auxiliary modality 1,

to assist the key mapping weights of modality 2,

mapping weights for auxiliary modality 2 values, K₁Representing auxiliary fusion features X_1,aideMapped key features, V₁Representing auxiliary fusion features X_1,aideThe value characteristic of the mapping. K is₂Representing auxiliary fusion features X_2,aideMapped key features, V₂Representing auxiliary fusion features X_2,aideThe value characteristic of the mapping.

(3) Performing cross-modal fusion based on a multi-head attention mechanism based on the mapping result:

for the packet (X)_0,main,X_1,aide) The fusion mode is as follows:

CrossFusion_X_1→0＝Concat(head₁,…,head_n)

for the packet (X)_0,main,X_2,aide) The fusion mode is as follows:

CrossFusion_X_2→0＝Concat(head₁,…,head_n)

wherein X_0,mainSelf-attention feature, X, representing the current target modality_1,aide、X_2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ X_aide→mainRepresenting the target modality based on the auxiliary modality fusion result.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the weighted integration network in the fourth step is based on an adaptive weighted fusion algorithm and is configured to input the two sets of cross-modal fusion features to extract cross-modal fusion features of the target.

The formula of the self-adaptive weighting fusion algorithm is as follows:

wherein W_j，b_jIs the hidden layer network weight and bias, λ, of the jth fusion submodule_nCross fusion _ X, the integration weight of all fusion submodules_mThe cross-modal fusion characteristics obtained in the fourth step.

As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the scoring module in the sixth step is configured to input the full-scale multi-dimensional features to obtain a final emotion score based on a regression network, and the calculation process is as follows:

Score＝W_outRelu(W_sConcat([CrossFusion_X₁…CrossFusion_X_m]))

wherein Relu is the activation function, W_sAs weights of said full-scale multi-dimensional features, W_outConcat is a hidden layer parameter of the last fully connected layer, and is a matrix splicing operation.

Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:

(1) in the invention, by utilizing the multi-attention mechanism based on the Transformer to process the preprocessing characteristics of the three modes, compared with the traditional circulating neural network and the traditional convolution neural network, the multi-mode data does not need to be aligned in the time dimension in advance. The method is beneficial to reducing the data marking cost and is better applied to the actual production environment;

(2) in the invention, by utilizing the transform-based autocorrelation feature extraction module and the cross-modal fusion module, compared with the traditional method, the method considers the inherent information which is beneficial to depicting emotion in the modal and the fusion information between the modalities, and avoids the problem of model overfitting caused by directly cascading modal features;

(3) in the invention, when the mode fusion is carried out by the self-adaptive weighting fusion algorithm, different self-adaptive weights are given during the mode fusion by learning the dependency relationship between the modes, and the inherent difference between the modes is better considered compared with the traditional method.

Drawings

FIG. 1 is a flow chart of a multi-modal emotion analysis method based on a multi-dimensional attention fusion network according to the present invention;

FIG. 2 is a schematic structural diagram of a multidimensional attention fusion network based on an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a cross-mode fusion module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, and the specific flow is as shown in fig. 1, in addition, fig. 2 is a structural schematic diagram of the multi-dimensional attention fusion network in the embodiment of the invention, and fig. 3 is a structural schematic diagram of a cross-modal fusion module in the embodiment of the invention. The method comprises the following implementation steps:

1. and processing the multi-mode emotion database, and aligning the feature dimensions.

The experiment of the invention is based on MOSEI multi-modal emotion database, the database comprises 23454 data samples, each data sample comprises preprocessing characteristics of three modes of voice, video and text, wherein the video preprocessing characteristics are obtained by extracting 35-dimensional facial expression unit characteristics from Facet, the voice preprocessing characteristics are obtained by extracting 39-dimensional mfcc characteristics from voice by using a kaldi voice recognition tool package, and the text preprocessing characteristics can be obtained by extracting 300-dimensional word vector characteristics from word2 vec. For each data sample, an emotion mark score is included, and the emotion mark score of the whole sample is in the range of [ -3, 3], wherein (0,3] is positive emotion and [ -3, 0) is negative emotion. The defined emotion classes are determined according to the interval, for example, if the interval is 1, then the interval is divided into [ -3, -2, -1, 0, 1, 2, 3], i.e., there are 7 emotion classes.

Because the three features are not distributed uniformly and have different feature dimensions, the three features are mapped into the same dimension through linear transformation in order to facilitate subsequent cross-modal fusion.

2. And extracting autocorrelation information of three modes.

The method comprises the step of extracting autocorrelation information of three modes by using a Transformer feature extractor, wherein the autocorrelation information of the modes is important information which is helpful to emotion recognition in the modes extracted by using a Transformer network. The Transformer itself contains two important parts of a multi-headed self-attention mechanism and a feedforward network. As shown in FIG. 2, the present invention uses a first-order autocorrelation feature extractor based on a multi-head attention mechanism to extract important information of a modal from a preprocessed feature, regards the result as a first-order autocorrelation feature, uses a feedforward network as a second-order autocorrelation feature extractor to perform nonlinear fitting on the preprocessed feature, and regards the result as a second-order autocorrelation feature. Three sets of autocorrelation information of video, voice and text are obtained.

3. And extracting multi-dimensional fusion features.

And (3-1) extracting the autocorrelation characteristics.

Based on the autocorrelation information of the three modalities extracted in step 2 and the cross-modality fusion module shown in fig. 3, the primary autocorrelation feature of one target modality is used as a target feature to be fused, and the secondary autocorrelation features of the other two modalities are used as auxiliary fusion modality features, for example, the primary autocorrelation feature of voice is selected, and the primary autocorrelation features of video and text are sent to the cross-modality fusion module; selecting first-level autocorrelation characteristics of the video, and first-level autocorrelation characteristics of the voice and the text, and sending the first-level autocorrelation characteristics to a cross-mode fusion module; selecting first-level autocorrelation characteristics of the text and first-level autocorrelation characteristics of the voice and the video, and sending the first-level autocorrelation characteristics of the voice and the video into a cross-mode fusion module; wherein the cross-modality fusion module comprises two bimodal fusions and a weighted integration network.

And (3-2) extracting cross-modal fusion features.

The feature combinations in (3-1) are sent to the cross-modal fusion module shown in fig. 3, and then cross-modal fusion features are calculated based on the key and value query concept of the attention mechanism in the Transformer. For example, one-level autocorrelation characteristic Q of the speech would be selected_0,mainLinear mapping is carried out to obtain a query vector, and the video X is processed_1,aideText X_2,aideLinear mapping is carried out on the two-level autocorrelation characteristics of the two modes to obtain respective key and value vectors, then multi-mode fusion is carried out according to a graph 3 to respectively obtain videos>Cross-modal fusion of features, text, and speech>Features are fused across modalities for speech. The specific calculation process is as follows:

for X_0,main：Q_0,main＝X_0,mainW^Q′

For X_1,aide：

For X_2,aide：

For packet (X)_0,main,X_1,aide) The fusion mode is as follows:

CrossFusion_X_1→0＝Concat(head₁,…,head_n)

for packet (X)_0,main,X_2,aide) The fusion mode is as follows:

CrossFusion_X_2→0＝Concat(head₁,…,head_n)

wherein X_0,mainSelf-attention feature, X, representing the current target modality_1,aide、X_2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ X_aide→mainThe representation target modality main is based on the auxiliary modality aide fusion result.

The two groups of features are sent to a weighted integration network shown in fig. 2, and a (video, text) - > voice cross-modal fusion feature is obtained. The specific calculation process is as follows

CrossFusion_X_m＝λ*CrossFusion_X_1→0+(1-λ)*CrossFusion_X_2→0

The above processes are performed simultaneously in three sets of multidimensional attention fusion networks as shown in fig. 1, and finally, a (video, text) - > speech cross-modal fusion feature, a (video, speech) - > text cross-modal fusion feature, and a (speech, text) - > video cross-modal fusion feature are obtained

And (3-3) extracting multi-dimensional characteristics of videos, voices and texts.

In order to take the characteristics of two features into consideration, in this embodiment, a multi-dimensional fusion feature is obtained by fusing an autocorrelation feature and a cross-modal feature, and the specific fusion process is as follows:

and adding the secondary auto-correlation characteristics of the voice and the (voice and text) - > video cross-modal fusion characteristics to obtain the video multi-dimensional fusion characteristics.

And adding the two-stage autocorrelation characteristics of the voice and the (video, text) - > voice cross-modal fusion characteristics to obtain the voice multi-dimensional fusion characteristics.

And adding the secondary autocorrelation characteristics of the text and the (video and voice) -text cross-modal fusion characteristics to obtain the text multi-dimensional fusion characteristics.

4. An emotion score is calculated.

As shown in fig. 1, the obtained voice multidimensional fusion features, video multidimensional fusion features and text multidimensional fusion features are spliced, and then regression calculation is performed to obtain specific emotion scores, wherein the calculation process is as follows:

Score＝W_outRRelu(W_sRConcat([CrossFusion_X₁，CrossFusion_X₂，CrossFusion_X₃]))

wherein Cross fusion _ X₁，CrossFusion_X₂，CrossFusion_X₃Respectively representing video multi-dimensional fusion characteristics, voice multi-dimensional fusion characteristics and text multi-dimensional fusion characteristics. W_outHiding layer parameters for the regression network.

And determining the emotion interval in which the score specifically falls by calculating the emotion score of the sample and combining the emotion label of the database sample to obtain the final emotion grade.

The effectiveness of the invention is proved by the following experimental examples, and the experimental results prove that the invention can improve the recognition accuracy of emotion analysis.

The method is compared with 4 existing representative emotion analysis methods on an MOSEI data set, and Table 1 shows that the method and the 4 comparison methods for comparison are based on the Accuracy (ACC) of 2 classification and 7 classification and the performance of an F1 index on the data set, the larger the numerical value of the result is, the higher the emotion analysis quality is, and the improvement of the method (namely OurRMethod noted in Table 1) is very obvious.

TABLE 1R Performance of ACCR and F1 indices on MOSEI data set by different methods

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A multi-modal emotion analysis method based on a multi-dimensional attention fusion network is characterized by comprising the following steps:

step one, constructing a multi-modal emotion analysis database, wherein each sample in the database comprises three target modal data of voice, video and text, preprocessing characteristics of the three target modals are extracted in advance, and emotion labeling is carried out on each sample; the emotion is marked as a limited continuous interval, the inside of the interval is continuously and equidistantly divided into M sub-intervals, and each sub-interval represents the degree range of the emotion;

step two, constructing respective multidimensional attention fusion networks for the three target modalities in the step one, wherein the respective multidimensional attention fusion networks for the three target modalities all comprise an autocorrelation feature extraction module and a cross-modal fusion module, the multidimensional attention fusion network for the voice target modality comprises a voice autocorrelation feature extraction module and a voice cross-modal fusion module, the multidimensional attention fusion network for the video target modality comprises a video autocorrelation feature extraction module and a video cross-modal fusion module, and the multidimensional attention fusion network for the text target modality comprises a text autocorrelation feature extraction module and a text cross-modal fusion module; the autocorrelation characteristic extraction module is a Transformer network;

step three, the preprocessing characteristics of the three target modes in the step one are respectively passed through the self-correlation characteristic extraction module corresponding to each target module in the step two, and the self-correlation information of the three modes, namely the voice self-correlation information, the text self-correlation information and the video self-correlation information, is extracted; the voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor, and the autocorrelation information of the three target modes comprises primary autocorrelation features and secondary autocorrelation features; the first-stage autocorrelation feature extractor and the second-stage autocorrelation feature extractor adopt a cascading mode, wherein the first-stage autocorrelation feature extractor is configured to input preprocessing features of a target modality to extract first-stage autocorrelation features based on a multi-head self-attention mechanism of a Transformer, and the second-stage autocorrelation feature extractor is configured to input the first-stage autocorrelation features of the target modality to extract second-stage autocorrelation features based on a feedforward neural network of the Transformer;

step four, selecting the first-level autocorrelation characteristics of any one target mode in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modes as auxiliary fusion characteristics, and sending the target characteristics to a cross-mode fusion module where the target modes are located according to a preset grouping mode to respectively obtain voice cross-mode fusion characteristics, text cross-mode fusion characteristics and video cross-mode fusion characteristics, wherein the cross-mode fusion module comprises two dual-mode fusion devices and a weighted integration network;

fifthly, adding the cross-modal fusion characteristics of each target mode in the fourth step and the secondary autocorrelation characteristics in the third step to obtain multi-dimensional fusion characteristics;

2. The method for multi-modal emotion analysis based on multi-dimensional attention fusion network as claimed in claim 1, wherein the transform's multi-head self-attention mechanism is calculated by the following formula:

Q_i＝XW_i ^Q；

MultiHead_X₁＝Concat(head₁，...，head_n)

wherein X is the target modal preprocessing characteristic in the second step, W_i ^QIs the query mapping weight of the target modality pre-processing feature ith header,

is the key mapping weight of the ith head of the target modality preprocessing feature,

is the value mapping weight of the ith head of the target modal preprocessing characteristic, softmax is the weight normalization function, K_i ^TIs K_iTransposed matrix of d_kFor the scaling factor, n is the number of headers, MultiHead _ X₁The obtained first-order autocorrelation characteristics.

3. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1, wherein the feedforward neural network formula is:

MultiHead_X₂＝max(0，MultiHead_X₁·W₁+b₁)·W₂+b₂

wherein Multihead _ X₂Is a second order autocorrelation feature of the target mode, W₁Is the weight of a first-order autocorrelation feature, W₂Hiding layer weights for the network, b₁Biasing for first order autocorrelation characteristics, b₂Biasing for the network hidden layer.

4. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1, wherein the grouping manner in step four is (X)_0，main，X_1，aide)，(X_0，main，X_2，aide) For subsequent packet fusion, wherein X_0，mainFor the target feature to be fused in step four, X_1，aide，X_2，aideFor the auxiliary fusion feature of the other two modalities, (X) is_0，main，X_1，aide) Inputting a bimodal fusion device, (X)_0，main，X_2，aide) Input into another bimodal fuser.

5. The multi-modal emotion analysis method based on the multidimensional attention fusion network as claimed in claim 1, wherein the bimodal fusion device in step four is a cross-modal fusion based on a transform's multi-head attention mechanism, and the calculation method is as follows:

Q_m，main＝X_m，mainW^Q′；K＝X_j，aideW^K′；V＝X_j，aideW^V′；

CrossFusion_X_aide→main＝Concat(head₁，...，head_n)

wherein X_m，mainSelf-attention feature, X, representing the current target modality m_j，aideIndicating a secondary fusion feature, CrossFusion _ X_aide→mainDenotes the fusion result, W^Q′Query mapping weight, W, for target modality m^K′Weight of key mapping for target modality m, W^V′Mapping weights for values of target modality m, d_k' is the scaling factor.