CN115858726A

CN115858726A - Multi-stage multi-modal emotion analysis method based on mutual information method representation

Info

Publication number: CN115858726A
Application number: CN202211465914.6A
Authority: CN
Inventors: 侯金鑫; 李希城; 徐明成; 谢杰
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-28

Abstract

The invention provides a multi-stage multi-modal emotion analysis method based on mutual information method representation, and relates to the field of artificial intelligence. Acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features; extracting intra-modal high-dimensional features according to the characteristics of different modes of sound, language and vision; performing multi-modal characteristic collaborative representation of a mutual information maximization method on the voice, language, vision and language modal characteristics to obtain characteristic representation of maximum correlation among the modalities; when the characteristics are fused, a new fusion network structure is adopted for information fusion among different modes, and the hierarchical adjustable modeling of the interaction among single mode, double mode and triple mode containing voice, text and visual characteristics is realized. The problems of information loss, noise interference, partial feature redundancy and the like of all related keys are solved, and the multi-modal emotion analysis effect is improved.

Description

Multi-stage multi-modal emotion analysis method based on mutual information method representation

Technical Field

The invention relates to the field of artificial intelligence, in particular to a multi-stage multi-modal emotion analysis method based on mutual information method representation.

Background

With the popularization of social media, image and video data on the network are gradually increased, and the research task of emotion analysis is expanded from a single language mode to multi-mode emotion prediction. Many data on the network contain multi-mode information such as vision, language and sound, the data reflect the real attitude and emotional state of the user, and the method has high application value in realistic scenes such as box-office prediction, political election, public opinion supervision and the like. Therefore, effective fusion and representation of multi-modal data and improvement of the accuracy of emotion analysis, so as to more truly reveal the emotion of the user, have become the main research problem of multi-modal emotion analysis at present.

The invention mainly focuses on a fusion strategy of multi-modal data, and mainly comprises early fusion, later fusion and mixed fusion which are divided from a fusion stage, and a fusion method based on a tensor model, a fusion based on a time sequence model and a fusion method based on an attention model which are divided from a fusion method. At present, the accuracy of a multi-modal emotion analysis task is improved through selection of a fusion mode in multi-modal emotion analysis, but places needing improvement still exist, for example, problems of key information loss, characteristic noise interference and the like possibly exist in the fusion process of multi-modal characteristics to influence a prediction result.

The method can make up the defects of a multi-mode fusion strategy to a certain extent, catch the relation among different modes and eliminate the noise of the mode characteristics. The multi-modal representation learning work mainly comprises joint representation and collaborative representation, and the representation method for mutual information maximization in the structured collaborative representation can enhance the dependency of different modal characteristics and enhance the common information representation among the modalities. However, most of the related work of multi-modal representation learning is simply output in a splicing or weighting mode on the output multi-modal sequence features, which may cause insufficient interaction among the modalities and the situation of feature redundancy.

Disclosure of Invention

The invention aims to provide a multi-stage multi-modal emotion analysis method based on mutual information method representation, which can provide a multi-stage multi-modal emotion analysis method based on mutual information method representation aiming at the defects of a single multi-modal fusion strategy and single multi-modal representation learning. On the basis of feature extraction, a multi-mode hierarchical fusion network which is maximally represented, learned and innovatively provided by mutual information is adopted to be combined, so that the problems of key information loss, noise interference, partial feature redundancy and the like existing in each single stage are mutually compensated, and the multi-mode emotion analysis effect is further improved.

The embodiment of the invention is realized by the following steps:

in a first aspect, an embodiment of the present application provides a multi-stage multi-modal sentiment analysis method based on mutual information method representation, which includes the following steps, step (1): acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features; step (2): extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision; and (3): performing multi-modal characteristic collaborative representation of a mutual information maximization method on the voice, language, vision and language modal characteristics to obtain characteristic representation of maximum correlation among the modalities; and (4): when the characteristics are fused, a new fusion network structure is adopted for information fusion among different modes, and the hierarchical adjustable modeling is carried out on the interaction among single mode, double mode and triple mode containing voice, text and visual characteristics; and (5): and performing repeated iterative training, and applying the model with the highest evaluation index to multi-modal emotion analysis.

In some embodiments of the present invention, in the step (1), MOSI and MOSI emotion videos are selected as the original multi-modal data.

In some embodiments of the present invention, in the step (1), in the process of feature encoding of the original multi-modal data, the visual modality uses Facet to perform feature capture on the motion used for expressing the human emotion information in the video; the voice modality uses covanep to collect features in the audio; after the text mode is trained by a large-scale corpus by adopting a pre-trained BERT model, the output of the BERT is adopted as feature codes in a multi-mode emotion analysis task.

In some embodiments of the invention, the above-mentioned visual modality acquisition comprises any one or more of eye closure, neck muscles, head movements, hand movements and leg movements; the sound modality acquisition includes any one or more of intensity, pitch, audio peak slope, and voiced-unvoiced segment characteristics.

In some embodiments of the present invention, in the step (2), in the intra-modal high-dimensional feature extraction process, two independent LSTM models are used to extract temporal features of different modalities.

In some embodiments of the present invention, in the step (3), in the multi-modal feature collaborative representation process, a representation method of maximizing mutual information is adopted to learn projection representations among different modalities; in the specific calculation process, a mutual information target is optimized through a feedforward neural network, and nonlinear projection which enables each mode to be maximally correlated is output; the loss function that maximizes the inter-modal mutual information is expressed as follows:

wherein q (y) _i |x _i ) Is a multivariate Gaussian distribution, N is the batch size in training, m ₁ ，m ₂ Summing the likelihoods of the two target modalities; optimized and expressed as T by mutual information method _m ＝D _m (H _m ) Comprising a two-layer feedforward neural network D corresponding to each mode _m And outputting the represented modal characteristics.

In some embodiments of the present invention, in the step (4), in the feature fusion process, a multi-modal hierarchical fusion network is used to complete the fusion interaction between modal features; the calculation process of the multi-modal hierarchical fusion network is as follows: t is _[L,V] ＝D _L,V (T _L ,T _V )，T _[L,A] ＝D _L,A (T _L ,T _A )，T _[V,A] ＝D _V,A (T _V ,T _A )，T _[L,V,A] ＝D _L,V,A (T _L ,T _V ,T _A ,T _[L,V] ,T _[L,A] ,T _[V,A] )，

Wherein->

Respectively correspondingly represents the characteristics of the language, vision and sound modes>

All represent bimodal features, respectively corresponding to different modal features through a plurality of independent double-layer feedforward neural networks D _L,V ，D _L,A ，D _L,A Learning to obtain; />

Representing trimodal features by bimodal features

Through a double layer feedforward neural network D _L,V,A Learning to obtain; after learning layer by layer, the single mode, double mode and triple mode characteristics are processed by D _f The fusion results in a multi-modal fusion signature Z.

In some embodiments of the present invention, the multi-modal hierarchical fusion network models interactions among single-modal, dual-modal, and tri-modal hierarchically, and dynamically adjusts an internal structure according to an interaction process.

Compared with the prior art, the embodiment of the invention has at least the following advantages or beneficial effects:

firstly, the method of mutual information maximization is adopted in the multi-modal characteristic representation stage, dependence of different modes can be captured, correlation expression among different modes is improved, and noise of various modal characteristics is eliminated to a great extent;

secondly, the invention provides a multi-mode hierarchical fusion network for interactive fusion of interaction between different modes layer by layer in a feature fusion stage, so that the problem of feature information redundancy caused by past low-efficiency fusion is reduced;

thirdly, the invention adopts the thought of multi-stage modeling, effectively combines multi-mode representation learning and multi-mode fusion method, and can solve the problems of noise interference, loss of key emotion information and redundancy of characteristic information existing in a single stage to a great extent.

The invention provides a multi-stage multi-modal emotion analysis method based on mutual information method representation, aiming at the defects of the single multi-modal fusion strategy and the single multi-modal representation learning. On the basis of feature extraction, a multi-mode hierarchical fusion network which is maximally represented and learned by mutual information and creatively provided is adopted to be combined, the problems of relevant key information loss, noise interference, partial feature redundancy and the like in each single stage are mutually compensated, and the multi-mode emotion analysis effect is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flow chart of a multi-stage multi-modal sentiment analysis method based on mutual information method representation according to an embodiment of the present invention;

FIG. 2 is a model diagram of a multi-stage multi-modal sentiment analysis method represented based on a mutual information method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments and features of the embodiments described below can be combined with one another without conflict.

Examples

Referring to fig. 1 to 2, fig. 1 to 2 are schematic diagrams illustrating a multi-stage multi-modal sentiment analysis method based on a mutual information method representation according to an embodiment of the present application. A multi-stage multi-modal emotion analysis method based on mutual information method representation comprises the following steps of (1): acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features; step (2): extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision; and (3): performing multi-modal characteristic collaborative representation of a mutual information maximization method on the modal characteristics of sound, language, vision and language to obtain characteristic representation of maximum correlation among the modes; and (4): when in feature fusion, a new fusion network structure is adopted for information fusion among different modes, and the modeling of the hierarchical adjustment of the interaction among single mode, double mode and triple mode including voice, text and visual features is carried out; and (5): and performing repeated iterative training, and applying the model with the highest evaluation index to multi-modal emotion analysis.

The method adopts a mutual information maximization method in a multi-modal characteristic representation stage, can capture the dependence of different modes, improve the correlation expression among the different modes, and eliminate the noise of each modal characteristic to a great extent; the interaction fusion of the multi-mode hierarchical fusion network on the interaction among different modes layer by layer is provided at the characteristic fusion stage, so that the problem of characteristic information redundancy caused by low-efficiency fusion in the past is reduced; by adopting the multi-stage modeling idea, the multi-mode representation learning and the multi-mode fusion method are effectively combined, and the problems of noise interference, key emotion information loss and characteristic information redundancy existing in a single stage in the prior art can be solved to a great extent.

In some embodiments of the present invention, in the step (1), in the process of feature coding of the original multi-modal data, the visual modality performs feature capture on the action used for expressing the human emotion information in the video by adopting Facet; the voice modality uses covaprep to collect features in the audio; after the text mode is trained by a large-scale corpus by adopting a pre-trained BERT model, the output of the BERT is adopted as feature codes in a multi-mode emotion analysis task.

wherein q (y) _i |x _i ) Is a multivariate Gaussian distribution, N is the batch size in training, m ₁ ，m ₂ Summing the likelihoods of the two target modalities; optimized and expressed as T by mutual information method _m ＝D _m (H _m ) Comprising a two-layer feedforward neural network D corresponding to each mode _m And outputting the modal characteristics of the representation.

In some embodiments of the present invention, in the step (4), in the feature fusion process, a multi-modal hierarchical fusion network is used to complete the fusion interaction between modal features; the calculation process of the multi-modal hierarchical fusion network is as follows: t is a unit of _[L,V] ＝D _L,V (T _L ,T _V )，T _[L,A] ＝D _L,A (T _L ,T _A )，T _[V,A] ＝D _V,A (T _V ,T _A )，T _[L,V,A] ＝D _L,V,A (T _L ,T _V ,T _A ,T _[L,V] ,T _[L,A] ,T _[V,A] )，

Wherein +>

Respectively corresponding to a respective modal characteristic representing language, vision and sound>

Representing trimodal features by bimodal features

Via a double-layer feedforward neural network D _L,V,A Learning to obtain; single mode, double mode and three mode characteristic after study layer by layer _f The fusion results in a multi-modal fusion signature Z.

Referring to fig. 1, the specific implementation steps in application are as follows:

step 1, selecting and acquiring data in an original multi-modal form: the CMU-MOSI original video was taken as the initial data, containing 2199 individual self-describing video segments, each unit segment lasting about 10 seconds. The training set, the verification set and the test set are divided into 1284 video segments, 229 video segments and 686 video segments. CMU-MOSEI raw video data contains movie ratings video clips that are cast by thousands of video websites, for a total time up to 65 hours. 16265, 1869, and 4643 video segments were divided over the training set, validation set, and test set, respectively. Two classification tags of CMU-MOSI and CMU-MOSEI including negative emotion and positive emotion, and seven classification tags of labeled-3 (strongly negative emotion) to +3 (strongly positive emotion).

Step 2, performing feature coding on the voice and visual multi-modal original video: in the process of original multi-modal data feature coding, the visual modality adopts Facet to capture 35 actions which may express emotional information, such as human eye closure, neck muscle and head action, in the video. The video is divided into unit segments lasting about ten seconds, emotional information implied by each frame is captured through a Facet system, unit visual feature codes are obtained after characteristics of each frame are averaged, and the size of the initial visual code is 35. The voice modality uses covanep to collect intensity, pitch, audio peak slope, voiced-unvoiced segment features, and the like in the audio. The segmentation of the audio segments is aligned with the video segments, the features of the audio frames with the total length of ten seconds are averaged to obtain feature codes containing the emotion information of the sound modality, and the initial coding size is 74.

Step 3, pre-training the text feature coding and feature extraction of the Bert model: the text mode adopts a pre-training BERT model to convert MOSI and MOSEI original video subtitles into 768-dimensional vectors, the structure of the BERT model is formed by partially stacking Encoders of a bidirectional Transformer, the training position embedding retains position information depending on an attention mechanism, and after being trained by a large-scale corpus, the output of the BERT is adopted as feature coding in a multi-mode emotion analysis task without excessive debugging.

And 4, extracting visual and sound modal coding features: the visual and sound characteristics have time sequence in the emotion analysis task, two independent LSTM models are further adopted to extract the time characteristics of visual and sound modes on the basis of characteristic coding, and the characteristic extraction part is as follows:

the updating process of the sound and visual features in the corresponding LSTM at each time step is as follows: i.e. i _t ＝σ(W _i [h _t-1 ，x _t ]+b _i )，f _t ＝σ(W _f [h _t-1 ，x _t ]+b _f )，o _t ＝σ(W _o [h _t-1 ，x _t ]+b _o )，/>

h _t ＝o _t ⊙tanh(c _t ) Wherein i _t ，f _t ，o _t An input gate, a forgetting gate and an output gate at the moment t respectively. W _i ，W _f ，W _o ，/>

Respectively, a parameter matrix in the transformation. σ denotes a Sigmoid activation function, and an indicates a Hadamard product.

Step 5, a characteristic representation stage based on mutual information maximization: in the multi-modal characteristic collaborative representation process, a mutual information maximization representation method is adopted to learn projection representation among different modes so as to enable important information of the modes to be maximally related. And respectively obtaining nonlinear projections which enable each mode to be maximally correlated between the sound text and the text visual characteristics through a feedforward neural network and a mutual information method. The mutual information maximization optimization objective is as follows:

here q (y) _i |x _i ) Expressed as a multivariate Gaussian distribution, where N is the batch size in the training and l, v, l, a represent the sum of the likelihoods of the two target modalities, respectively. Features of language, vision and sound mode are calculated by a double-layer neural network D in a specific calculation process through a method of maximizing mutual information _L ，D _V ，D _A Output representation characteristic T _L ，T _V ，T _A Is shown asT _L ＝D _L (H _L )，T _V ＝D _V (H _V )，T _A ＝D _L (H _A )。

And 6, performing single-mode feature fusion by the multi-mode hierarchical fusion network: after feature representation, the three modal features of language, vision and sound are combined pairwise and output three bimodal features of sound language, sound vision and language vision respectively through three independent feedforward neural networks. The single-mode feature fusion calculation process of the multi-mode hierarchical fusion network is as follows: t is _[L,V] ＝D _L,V (T _L ,T _V )，T _[L,A] ＝D _L,A (T _L ,T _A )，T _[V,A] ＝D _V,A (T _V ,T _A ) Wherein

Representing language, visual and audio modal characteristics. />

Representing bimodal features, each corresponding to a different modal feature, via a plurality of independent dual-layer feedforward neural networks D _L,V ，D _L,A ，D _L,A And (5) learning to obtain.

And 7, performing bimodal feature fusion by the multimodal hierarchical fusion network: inputting the three bimodal features into a double-layer feedforward neural network to obtain the trimodal features fusing the information of the three modes: t is _[L,V,A] ＝D _L,V,A (T _L ,T _V ,T _A ,T _[L,V] ,T _[L,A] ,T _[V,A] ) Wherein

Indicating that a tri-modal feature is overridden by a bi-modal feature>

Through a double layer feedforward neural network D _L,V,A And (5) learning to obtain.

Step 8, finally, the single-mode features, the double-mode features and the triple-mode features of different layers are combinedAnd (3) performing final fusion characteristics of the input and feedforward neural network fusion output:

finally, the single mode, double mode and triple mode characteristics after layer-by-layer learning are processed by D _f The fusion results in a multi-modal fusion signature Z.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In summary, the embodiment of the present application provides a multi-stage multi-modal emotion analysis method based on mutual information method representation:

1. acquiring text, visual and sound modal data with corresponding relations, and performing feature coding on the original multi-modal data to obtain model input features: 1) Adopting original videos related to MOSI and MOSEI as original data of multi-modal emotion analysis; 2) Audio and video features are extracted using COVAREP and face 2 for encoding of visual and sound modalities, respectively. Feature encoding of the input text is done using a pre-trained Bert for the text modality.

2. And extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision, and adopting different feature extraction modes according to the characteristics of different modes. For linguistic modalities, pre-training Bert is employed consistent with text feature coding. Visual and acoustic features are time-sequenced in the emotion analysis task, so two independent unidirectional LSTM are used to capture temporal features of these modalities: h _L ＝BERT(X _L ；θ _BERT )，

3. Performing multi-modal characteristic collaborative representation on the voice, language, vision and language modal characteristics to obtain the maximum correlation representation among the modals, highlighting key emotion information: 1) Adopting a mutual information maximization method to perform feature representation, increasing the dependency relationship among different modal features according to modal characteristics, and finding out each modal by a mutual information methodThe vector correlation is highest, and uncorrelated noise of each mode is filtered; 2) Through a double-layer feedforward neural network, calculating the nonlinear projection with the maximum correlation among sound, language and visual modalities respectively:

4. the feature fusion process provides a new network structure, namely a multi-mode hierarchical fusion network, takes the sound, text and visual features after the feature representation stage as input, respectively aims at the modeling of interaction hierarchy among single mode, double mode and triple mode, and dynamically adjusts the internal structure according to the interaction process: 1) The multi-mode hierarchical fusion network is implemented as follows: t is _[L,V] ＝D _L,V (T _L ,T _V )，T _[L,A] ＝D _L,A (T _L ,T _A )，T _[V,A] ＝D _V,A (T _V ,T _A )，T _[L,V,A] ＝D _L,V,A (T _L ,T _V ,T _A ,T _[L,V] ,T _[L,A] ,T _[V,A] )，

Representing language, visual and audio modal characteristics. />

Representing bimodal features, each corresponding to a different modal feature, via a plurality of independent dual-layer feedforward neural networks D _L,V ，D _L,A ，D _L,A And (5) learning to obtain. />

Representing trimodal features by bimodal features

Through a double layer feedforward neural network D _L,V,A And (5) learning to obtain. Single mode, dual after final layer-by-layer learningModal and trimodal features D _f Fusing to obtain multi-modal fusion characteristics Z; 2) And dynamically adjusting the multi-modal hierarchical fusion network structure through multiple iterations to output final prediction characteristics.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The multi-stage multi-modal emotion analysis method based on mutual information method representation is characterized by comprising the following steps,

step (1): acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features;

step (2): extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision;

and (3): performing multi-modal characteristic collaborative representation of a mutual information maximization method on the voice, language, vision and language modal characteristics to obtain characteristic representation of maximum correlation among the modalities;

and (4): when the characteristics are fused, a new fusion network structure is adopted for information fusion among different modes, and the hierarchical adjustable modeling is carried out on the interaction among single mode, double mode and triple mode containing voice, text and visual characteristics;

and (5): and performing repeated iterative training, and applying the model with the highest evaluation index to multi-modal emotion analysis.

2. The multi-stage multi-modal sentiment analysis method of claim 1 wherein the mutual information method representation-based multi-stage multi-modal sentiment analysis method comprises the step of selecting MOSI and MOSEI sentiment videos as the original multi-modal data in step (1).

3. The multi-stage multi-modal sentiment analysis method based on mutual information method representation as claimed in claim 1, characterized in that, the method comprises the following steps, in the step (1), in the process of original multi-modal data feature coding, the visual modality adopts Facet to carry out feature capture on the action for expressing the human sentiment information in the video; the voice modality uses covanep to collect features in the audio; after the text mode is trained by a large-scale corpus by adopting a pre-trained BERT model, the output of the BERT is adopted as feature codes in a multi-mode emotion analysis task.

4. A multi-stage multi-modal sentiment analysis method according to claim 3 wherein the mutual information method representation based on comprises the steps of the visual modality acquiring any one or more of features including eye closure, neck muscles, head movements, hand movements and leg movements; the sound modality acquisition includes any one or more of intensity, pitch, audio peak slope, and voiced-unvoiced segment characteristics.

5. The multi-stage multi-modal sentiment analysis method based on mutual information method representation according to claim 1, comprising the step of (2) extracting time features of different modalities by using two independent LSTM models in the intra-modality high-dimensional feature extraction process.

6. The multi-stage multi-modal sentiment analysis method based on mutual information method representation as claimed in claim 1, characterized in that, the method comprises the following steps, in the step (3), in the multi-modal characteristic collaborative representation process, a representation method of mutual information maximization is adopted to learn projection representation among different modes; in the specific calculation process, a mutual information target is optimized through a feedforward neural network, and nonlinear projection which enables each mode to be maximally correlated is output; the loss function that maximizes the inter-modal mutual information is expressed as follows:

wherein q (y) _i |x _i ) Is a multivariate Gaussian distribution, N is the batch size in training, m ₁ ，m ₂ Summing the likelihoods of the two target modalities; optimized and expressed as T by mutual information method _m ＝D _m (H _m ) Comprising a double layer feedforward neural network D corresponding to each mode _m And outputting the modal characteristics of the representation.

7. The multi-stage multi-modal sentiment analysis method based on mutual information method representation according to claim 1, comprising the steps of, in the step (4), performing fusion interaction between modal features by using a multi-modal hierarchical fusion network in the feature fusion process; the calculation process of the multi-modal hierarchical fusion network is as follows: t is _[L,V] ＝D _L,V (T _L ,T _V )，T _[L,A] ＝D _L,A (T _L ,T _A )，T _[V,A] ＝D _V,A (T _V ,T _A )，T _[L,V,A] ＝D _L,V,A (T _L ,T _V ,T _A ,T _[L,V] ,T _[L,A] ,T _[V,A] )，

Wherein T is _L ^t ，T _V ^t ，/>

Representing trimodal features by bimodal features

Via a double-layer feedforward neural network D _L,V,A Learning to obtain; after learning layer by layer, the single mode, double mode and triple mode characteristics are processed by D _f The fusion results in a multi-modal fusion signature Z.

8. The multi-stage multi-modal sentiment analysis method based on mutual information method representation according to claim 7, wherein the multi-modal hierarchical fusion network models interactions among single-modal, bi-modal and tri-modal hierarchically, and dynamically adjusts an internal structure according to an interaction process.