CN115858726A - Multi-stage multi-modal emotion analysis method based on mutual information method representation - Google Patents

Multi-stage multi-modal emotion analysis method based on mutual information method representation Download PDF

Info

Publication number
CN115858726A
CN115858726A CN202211465914.6A CN202211465914A CN115858726A CN 115858726 A CN115858726 A CN 115858726A CN 202211465914 A CN202211465914 A CN 202211465914A CN 115858726 A CN115858726 A CN 115858726A
Authority
CN
China
Prior art keywords
modal
representation
mutual information
mode
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211465914.6A
Other languages
Chinese (zh)
Inventor
侯金鑫
李希城
徐明成
谢杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Electronic Commerce Co Ltd
Original Assignee
Tianyi Electronic Commerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Electronic Commerce Co Ltd filed Critical Tianyi Electronic Commerce Co Ltd
Priority to CN202211465914.6A priority Critical patent/CN115858726A/en
Publication of CN115858726A publication Critical patent/CN115858726A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a multi-stage multi-modal emotion analysis method based on mutual information method representation, and relates to the field of artificial intelligence. Acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features; extracting intra-modal high-dimensional features according to the characteristics of different modes of sound, language and vision; performing multi-modal characteristic collaborative representation of a mutual information maximization method on the voice, language, vision and language modal characteristics to obtain characteristic representation of maximum correlation among the modalities; when the characteristics are fused, a new fusion network structure is adopted for information fusion among different modes, and the hierarchical adjustable modeling of the interaction among single mode, double mode and triple mode containing voice, text and visual characteristics is realized. The problems of information loss, noise interference, partial feature redundancy and the like of all related keys are solved, and the multi-modal emotion analysis effect is improved.

Description

Multi-stage multi-modal emotion analysis method based on mutual information method representation
Technical Field
The invention relates to the field of artificial intelligence, in particular to a multi-stage multi-modal emotion analysis method based on mutual information method representation.
Background
With the popularization of social media, image and video data on the network are gradually increased, and the research task of emotion analysis is expanded from a single language mode to multi-mode emotion prediction. Many data on the network contain multi-mode information such as vision, language and sound, the data reflect the real attitude and emotional state of the user, and the method has high application value in realistic scenes such as box-office prediction, political election, public opinion supervision and the like. Therefore, effective fusion and representation of multi-modal data and improvement of the accuracy of emotion analysis, so as to more truly reveal the emotion of the user, have become the main research problem of multi-modal emotion analysis at present.
The invention mainly focuses on a fusion strategy of multi-modal data, and mainly comprises early fusion, later fusion and mixed fusion which are divided from a fusion stage, and a fusion method based on a tensor model, a fusion based on a time sequence model and a fusion method based on an attention model which are divided from a fusion method. At present, the accuracy of a multi-modal emotion analysis task is improved through selection of a fusion mode in multi-modal emotion analysis, but places needing improvement still exist, for example, problems of key information loss, characteristic noise interference and the like possibly exist in the fusion process of multi-modal characteristics to influence a prediction result.
The method can make up the defects of a multi-mode fusion strategy to a certain extent, catch the relation among different modes and eliminate the noise of the mode characteristics. The multi-modal representation learning work mainly comprises joint representation and collaborative representation, and the representation method for mutual information maximization in the structured collaborative representation can enhance the dependency of different modal characteristics and enhance the common information representation among the modalities. However, most of the related work of multi-modal representation learning is simply output in a splicing or weighting mode on the output multi-modal sequence features, which may cause insufficient interaction among the modalities and the situation of feature redundancy.
Disclosure of Invention
The invention aims to provide a multi-stage multi-modal emotion analysis method based on mutual information method representation, which can provide a multi-stage multi-modal emotion analysis method based on mutual information method representation aiming at the defects of a single multi-modal fusion strategy and single multi-modal representation learning. On the basis of feature extraction, a multi-mode hierarchical fusion network which is maximally represented, learned and innovatively provided by mutual information is adopted to be combined, so that the problems of key information loss, noise interference, partial feature redundancy and the like existing in each single stage are mutually compensated, and the multi-mode emotion analysis effect is further improved.
The embodiment of the invention is realized by the following steps:
in a first aspect, an embodiment of the present application provides a multi-stage multi-modal sentiment analysis method based on mutual information method representation, which includes the following steps, step (1): acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features; step (2): extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision; and (3): performing multi-modal characteristic collaborative representation of a mutual information maximization method on the voice, language, vision and language modal characteristics to obtain characteristic representation of maximum correlation among the modalities; and (4): when the characteristics are fused, a new fusion network structure is adopted for information fusion among different modes, and the hierarchical adjustable modeling is carried out on the interaction among single mode, double mode and triple mode containing voice, text and visual characteristics; and (5): and performing repeated iterative training, and applying the model with the highest evaluation index to multi-modal emotion analysis.
In some embodiments of the present invention, in the step (1), MOSI and MOSI emotion videos are selected as the original multi-modal data.
In some embodiments of the present invention, in the step (1), in the process of feature encoding of the original multi-modal data, the visual modality uses Facet to perform feature capture on the motion used for expressing the human emotion information in the video; the voice modality uses covanep to collect features in the audio; after the text mode is trained by a large-scale corpus by adopting a pre-trained BERT model, the output of the BERT is adopted as feature codes in a multi-mode emotion analysis task.
In some embodiments of the invention, the above-mentioned visual modality acquisition comprises any one or more of eye closure, neck muscles, head movements, hand movements and leg movements; the sound modality acquisition includes any one or more of intensity, pitch, audio peak slope, and voiced-unvoiced segment characteristics.
In some embodiments of the present invention, in the step (2), in the intra-modal high-dimensional feature extraction process, two independent LSTM models are used to extract temporal features of different modalities.
In some embodiments of the present invention, in the step (3), in the multi-modal feature collaborative representation process, a representation method of maximizing mutual information is adopted to learn projection representations among different modalities; in the specific calculation process, a mutual information target is optimized through a feedforward neural network, and nonlinear projection which enables each mode to be maximally correlated is output; the loss function that maximizes the inter-modal mutual information is expressed as follows:
Figure BDA0003957491220000031
wherein q (y) i |x i ) Is a multivariate Gaussian distribution, N is the batch size in training, m 1 ,m 2 Summing the likelihoods of the two target modalities; optimized and expressed as T by mutual information method m =D m (H m ) Comprising a two-layer feedforward neural network D corresponding to each mode m And outputting the represented modal characteristics.
In some embodiments of the present invention, in the step (4), in the feature fusion process, a multi-modal hierarchical fusion network is used to complete the fusion interaction between modal features; the calculation process of the multi-modal hierarchical fusion network is as follows: t is [L,V] =D L,V (T L ,T V ),T [L,A] =D L,A (T L ,T A ),T [V,A] =D V,A (T V ,T A ),T [L,V,A] =D L,V,A (T L ,T V ,T A ,T [L,V] ,T [L,A] ,T [V,A] ),
Figure BDA0003957491220000041
Wherein->
Figure BDA0003957491220000042
Respectively correspondingly represents the characteristics of the language, vision and sound modes>
Figure BDA0003957491220000043
All represent bimodal features, respectively corresponding to different modal features through a plurality of independent double-layer feedforward neural networks D L,V ,D L,A ,D L,A Learning to obtain; />
Figure BDA0003957491220000044
Representing trimodal features by bimodal features
Figure BDA0003957491220000045
Figure BDA0003957491220000046
Through a double layer feedforward neural network D L,V,A Learning to obtain; after learning layer by layer, the single mode, double mode and triple mode characteristics are processed by D f The fusion results in a multi-modal fusion signature Z.
In some embodiments of the present invention, the multi-modal hierarchical fusion network models interactions among single-modal, dual-modal, and tri-modal hierarchically, and dynamically adjusts an internal structure according to an interaction process.
Compared with the prior art, the embodiment of the invention has at least the following advantages or beneficial effects:
firstly, the method of mutual information maximization is adopted in the multi-modal characteristic representation stage, dependence of different modes can be captured, correlation expression among different modes is improved, and noise of various modal characteristics is eliminated to a great extent;
secondly, the invention provides a multi-mode hierarchical fusion network for interactive fusion of interaction between different modes layer by layer in a feature fusion stage, so that the problem of feature information redundancy caused by past low-efficiency fusion is reduced;
thirdly, the invention adopts the thought of multi-stage modeling, effectively combines multi-mode representation learning and multi-mode fusion method, and can solve the problems of noise interference, loss of key emotion information and redundancy of characteristic information existing in a single stage to a great extent.
The invention provides a multi-stage multi-modal emotion analysis method based on mutual information method representation, aiming at the defects of the single multi-modal fusion strategy and the single multi-modal representation learning. On the basis of feature extraction, a multi-mode hierarchical fusion network which is maximally represented and learned by mutual information and creatively provided is adopted to be combined, the problems of relevant key information loss, noise interference, partial feature redundancy and the like in each single stage are mutually compensated, and the multi-mode emotion analysis effect is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
FIG. 1 is a flow chart of a multi-stage multi-modal sentiment analysis method based on mutual information method representation according to an embodiment of the present invention;
FIG. 2 is a model diagram of a multi-stage multi-modal sentiment analysis method represented based on a mutual information method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments and features of the embodiments described below can be combined with one another without conflict.
Examples
Referring to fig. 1 to 2, fig. 1 to 2 are schematic diagrams illustrating a multi-stage multi-modal sentiment analysis method based on a mutual information method representation according to an embodiment of the present application. A multi-stage multi-modal emotion analysis method based on mutual information method representation comprises the following steps of (1): acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features; step (2): extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision; and (3): performing multi-modal characteristic collaborative representation of a mutual information maximization method on the modal characteristics of sound, language, vision and language to obtain characteristic representation of maximum correlation among the modes; and (4): when in feature fusion, a new fusion network structure is adopted for information fusion among different modes, and the modeling of the hierarchical adjustment of the interaction among single mode, double mode and triple mode including voice, text and visual features is carried out; and (5): and performing repeated iterative training, and applying the model with the highest evaluation index to multi-modal emotion analysis.
The method adopts a mutual information maximization method in a multi-modal characteristic representation stage, can capture the dependence of different modes, improve the correlation expression among the different modes, and eliminate the noise of each modal characteristic to a great extent; the interaction fusion of the multi-mode hierarchical fusion network on the interaction among different modes layer by layer is provided at the characteristic fusion stage, so that the problem of characteristic information redundancy caused by low-efficiency fusion in the past is reduced; by adopting the multi-stage modeling idea, the multi-mode representation learning and the multi-mode fusion method are effectively combined, and the problems of noise interference, key emotion information loss and characteristic information redundancy existing in a single stage in the prior art can be solved to a great extent.
In some embodiments of the present invention, in the step (1), MOSI and MOSI emotion videos are selected as the original multi-modal data.
In some embodiments of the present invention, in the step (1), in the process of feature coding of the original multi-modal data, the visual modality performs feature capture on the action used for expressing the human emotion information in the video by adopting Facet; the voice modality uses covaprep to collect features in the audio; after the text mode is trained by a large-scale corpus by adopting a pre-trained BERT model, the output of the BERT is adopted as feature codes in a multi-mode emotion analysis task.
In some embodiments of the invention, the above-mentioned visual modality acquisition comprises any one or more of eye closure, neck muscles, head movements, hand movements and leg movements; the sound modality acquisition includes any one or more of intensity, pitch, audio peak slope, and voiced-unvoiced segment characteristics.
In some embodiments of the present invention, in the step (2), in the intra-modal high-dimensional feature extraction process, two independent LSTM models are used to extract temporal features of different modalities.
In some embodiments of the present invention, in the step (3), in the multi-modal feature collaborative representation process, a representation method of maximizing mutual information is adopted to learn projection representations among different modalities; in the specific calculation process, a mutual information target is optimized through a feedforward neural network, and nonlinear projection which enables each mode to be maximally correlated is output; the loss function that maximizes the inter-modal mutual information is expressed as follows:
Figure BDA0003957491220000071
wherein q (y) i |x i ) Is a multivariate Gaussian distribution, N is the batch size in training, m 1 ,m 2 Summing the likelihoods of the two target modalities; optimized and expressed as T by mutual information method m =D m (H m ) Comprising a two-layer feedforward neural network D corresponding to each mode m And outputting the modal characteristics of the representation.
In some embodiments of the present invention, in the step (4), in the feature fusion process, a multi-modal hierarchical fusion network is used to complete the fusion interaction between modal features; the calculation process of the multi-modal hierarchical fusion network is as follows: t is a unit of [L,V] =D L,V (T L ,T V ),T [L,A] =D L,A (T L ,T A ),T [V,A] =D V,A (T V ,T A ),T [L,V,A] =D L,V,A (T L ,T V ,T A ,T [L,V] ,T [L,A] ,T [V,A] ),
Figure BDA0003957491220000072
Wherein +>
Figure BDA0003957491220000073
Respectively corresponding to a respective modal characteristic representing language, vision and sound>
Figure BDA0003957491220000074
All represent bimodal features, respectively corresponding to different modal features through a plurality of independent double-layer feedforward neural networks D L,V ,D L,A ,D L,A Learning to obtain; />
Figure BDA0003957491220000081
Representing trimodal features by bimodal features
Figure BDA0003957491220000082
Figure BDA0003957491220000083
Via a double-layer feedforward neural network D L,V,A Learning to obtain; single mode, double mode and three mode characteristic after study layer by layer f The fusion results in a multi-modal fusion signature Z.
In some embodiments of the present invention, the multi-modal hierarchical fusion network models interactions among single-modal, dual-modal, and tri-modal hierarchically, and dynamically adjusts an internal structure according to an interaction process.
Referring to fig. 1, the specific implementation steps in application are as follows:
step 1, selecting and acquiring data in an original multi-modal form: the CMU-MOSI original video was taken as the initial data, containing 2199 individual self-describing video segments, each unit segment lasting about 10 seconds. The training set, the verification set and the test set are divided into 1284 video segments, 229 video segments and 686 video segments. CMU-MOSEI raw video data contains movie ratings video clips that are cast by thousands of video websites, for a total time up to 65 hours. 16265, 1869, and 4643 video segments were divided over the training set, validation set, and test set, respectively. Two classification tags of CMU-MOSI and CMU-MOSEI including negative emotion and positive emotion, and seven classification tags of labeled-3 (strongly negative emotion) to +3 (strongly positive emotion).
Step 2, performing feature coding on the voice and visual multi-modal original video: in the process of original multi-modal data feature coding, the visual modality adopts Facet to capture 35 actions which may express emotional information, such as human eye closure, neck muscle and head action, in the video. The video is divided into unit segments lasting about ten seconds, emotional information implied by each frame is captured through a Facet system, unit visual feature codes are obtained after characteristics of each frame are averaged, and the size of the initial visual code is 35. The voice modality uses covanep to collect intensity, pitch, audio peak slope, voiced-unvoiced segment features, and the like in the audio. The segmentation of the audio segments is aligned with the video segments, the features of the audio frames with the total length of ten seconds are averaged to obtain feature codes containing the emotion information of the sound modality, and the initial coding size is 74.
Step 3, pre-training the text feature coding and feature extraction of the Bert model: the text mode adopts a pre-training BERT model to convert MOSI and MOSEI original video subtitles into 768-dimensional vectors, the structure of the BERT model is formed by partially stacking Encoders of a bidirectional Transformer, the training position embedding retains position information depending on an attention mechanism, and after being trained by a large-scale corpus, the output of the BERT is adopted as feature coding in a multi-mode emotion analysis task without excessive debugging.
And 4, extracting visual and sound modal coding features: the visual and sound characteristics have time sequence in the emotion analysis task, two independent LSTM models are further adopted to extract the time characteristics of visual and sound modes on the basis of characteristic coding, and the characteristic extraction part is as follows:
Figure BDA0003957491220000091
the updating process of the sound and visual features in the corresponding LSTM at each time step is as follows: i.e. i t =σ(W i [h t-1 ,x t ]+b i ),f t =σ(W f [h t-1 ,x t ]+b f ),o t =σ(W o [h t-1 ,x t ]+b o ),/>
Figure BDA0003957491220000092
Figure BDA0003957491220000093
h t =o t ⊙tanh(c t ) Wherein i t ,f t ,o t An input gate, a forgetting gate and an output gate at the moment t respectively. W i ,W f ,W o ,/>
Figure BDA0003957491220000094
Respectively, a parameter matrix in the transformation. σ denotes a Sigmoid activation function, and an indicates a Hadamard product.
Step 5, a characteristic representation stage based on mutual information maximization: in the multi-modal characteristic collaborative representation process, a mutual information maximization representation method is adopted to learn projection representation among different modes so as to enable important information of the modes to be maximally related. And respectively obtaining nonlinear projections which enable each mode to be maximally correlated between the sound text and the text visual characteristics through a feedforward neural network and a mutual information method. The mutual information maximization optimization objective is as follows:
Figure BDA0003957491220000095
here q (y) i |x i ) Expressed as a multivariate Gaussian distribution, where N is the batch size in the training and l, v, l, a represent the sum of the likelihoods of the two target modalities, respectively. Features of language, vision and sound mode are calculated by a double-layer neural network D in a specific calculation process through a method of maximizing mutual information L ,D V ,D A Output representation characteristic T L ,T V ,T A Is shown asT L =D L (H L ),T V =D V (H V ),T A =D L (H A )。
And 6, performing single-mode feature fusion by the multi-mode hierarchical fusion network: after feature representation, the three modal features of language, vision and sound are combined pairwise and output three bimodal features of sound language, sound vision and language vision respectively through three independent feedforward neural networks. The single-mode feature fusion calculation process of the multi-mode hierarchical fusion network is as follows: t is [L,V] =D L,V (T L ,T V ),T [L,A] =D L,A (T L ,T A ),T [V,A] =D V,A (T V ,T A ) Wherein
Figure BDA0003957491220000101
Representing language, visual and audio modal characteristics. />
Figure BDA0003957491220000102
Representing bimodal features, each corresponding to a different modal feature, via a plurality of independent dual-layer feedforward neural networks D L,V ,D L,A ,D L,A And (5) learning to obtain.
And 7, performing bimodal feature fusion by the multimodal hierarchical fusion network: inputting the three bimodal features into a double-layer feedforward neural network to obtain the trimodal features fusing the information of the three modes: t is [L,V,A] =D L,V,A (T L ,T V ,T A ,T [L,V] ,T [L,A] ,T [V,A] ) Wherein
Figure BDA0003957491220000103
Indicating that a tri-modal feature is overridden by a bi-modal feature>
Figure BDA0003957491220000104
Through a double layer feedforward neural network D L,V,A And (5) learning to obtain.
Step 8, finally, the single-mode features, the double-mode features and the triple-mode features of different layers are combinedAnd (3) performing final fusion characteristics of the input and feedforward neural network fusion output:
Figure BDA0003957491220000105
finally, the single mode, double mode and triple mode characteristics after layer-by-layer learning are processed by D f The fusion results in a multi-modal fusion signature Z.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.
The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In summary, the embodiment of the present application provides a multi-stage multi-modal emotion analysis method based on mutual information method representation:
1. acquiring text, visual and sound modal data with corresponding relations, and performing feature coding on the original multi-modal data to obtain model input features: 1) Adopting original videos related to MOSI and MOSEI as original data of multi-modal emotion analysis; 2) Audio and video features are extracted using COVAREP and face 2 for encoding of visual and sound modalities, respectively. Feature encoding of the input text is done using a pre-trained Bert for the text modality.
2. And extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision, and adopting different feature extraction modes according to the characteristics of different modes. For linguistic modalities, pre-training Bert is employed consistent with text feature coding. Visual and acoustic features are time-sequenced in the emotion analysis task, so two independent unidirectional LSTM are used to capture temporal features of these modalities: h L =BERT(X L ;θ BERT ),
Figure BDA0003957491220000121
3. Performing multi-modal characteristic collaborative representation on the voice, language, vision and language modal characteristics to obtain the maximum correlation representation among the modals, highlighting key emotion information: 1) Adopting a mutual information maximization method to perform feature representation, increasing the dependency relationship among different modal features according to modal characteristics, and finding out each modal by a mutual information methodThe vector correlation is highest, and uncorrelated noise of each mode is filtered; 2) Through a double-layer feedforward neural network, calculating the nonlinear projection with the maximum correlation among sound, language and visual modalities respectively:
Figure BDA0003957491220000122
4. the feature fusion process provides a new network structure, namely a multi-mode hierarchical fusion network, takes the sound, text and visual features after the feature representation stage as input, respectively aims at the modeling of interaction hierarchy among single mode, double mode and triple mode, and dynamically adjusts the internal structure according to the interaction process: 1) The multi-mode hierarchical fusion network is implemented as follows: t is [L,V] =D L,V (T L ,T V ),T [L,A] =D L,A (T L ,T A ),T [V,A] =D V,A (T V ,T A ),T [L,V,A] =D L,V,A (T L ,T V ,T A ,T [L,V] ,T [L,A] ,T [V,A] ),
Figure BDA0003957491220000123
Figure BDA0003957491220000124
Representing language, visual and audio modal characteristics. />
Figure BDA0003957491220000125
Representing bimodal features, each corresponding to a different modal feature, via a plurality of independent dual-layer feedforward neural networks D L,V ,D L,A ,D L,A And (5) learning to obtain. />
Figure BDA0003957491220000126
Representing trimodal features by bimodal features
Figure BDA0003957491220000131
Through a double layer feedforward neural network D L,V,A And (5) learning to obtain. Single mode, dual after final layer-by-layer learningModal and trimodal features D f Fusing to obtain multi-modal fusion characteristics Z; 2) And dynamically adjusting the multi-modal hierarchical fusion network structure through multiple iterations to output final prediction characteristics.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (8)

1. The multi-stage multi-modal emotion analysis method based on mutual information method representation is characterized by comprising the following steps,
step (1): acquiring text, visual and sound modal data with corresponding relations through original multi-modal data, and performing feature coding on the original multi-modal data to obtain model input features;
step (2): extracting high-dimensional features in the modes according to the characteristics of different modes of sound, language and vision;
and (3): performing multi-modal characteristic collaborative representation of a mutual information maximization method on the voice, language, vision and language modal characteristics to obtain characteristic representation of maximum correlation among the modalities;
and (4): when the characteristics are fused, a new fusion network structure is adopted for information fusion among different modes, and the hierarchical adjustable modeling is carried out on the interaction among single mode, double mode and triple mode containing voice, text and visual characteristics;
and (5): and performing repeated iterative training, and applying the model with the highest evaluation index to multi-modal emotion analysis.
2. The multi-stage multi-modal sentiment analysis method of claim 1 wherein the mutual information method representation-based multi-stage multi-modal sentiment analysis method comprises the step of selecting MOSI and MOSEI sentiment videos as the original multi-modal data in step (1).
3. The multi-stage multi-modal sentiment analysis method based on mutual information method representation as claimed in claim 1, characterized in that, the method comprises the following steps, in the step (1), in the process of original multi-modal data feature coding, the visual modality adopts Facet to carry out feature capture on the action for expressing the human sentiment information in the video; the voice modality uses covanep to collect features in the audio; after the text mode is trained by a large-scale corpus by adopting a pre-trained BERT model, the output of the BERT is adopted as feature codes in a multi-mode emotion analysis task.
4. A multi-stage multi-modal sentiment analysis method according to claim 3 wherein the mutual information method representation based on comprises the steps of the visual modality acquiring any one or more of features including eye closure, neck muscles, head movements, hand movements and leg movements; the sound modality acquisition includes any one or more of intensity, pitch, audio peak slope, and voiced-unvoiced segment characteristics.
5. The multi-stage multi-modal sentiment analysis method based on mutual information method representation according to claim 1, comprising the step of (2) extracting time features of different modalities by using two independent LSTM models in the intra-modality high-dimensional feature extraction process.
6. The multi-stage multi-modal sentiment analysis method based on mutual information method representation as claimed in claim 1, characterized in that, the method comprises the following steps, in the step (3), in the multi-modal characteristic collaborative representation process, a representation method of mutual information maximization is adopted to learn projection representation among different modes; in the specific calculation process, a mutual information target is optimized through a feedforward neural network, and nonlinear projection which enables each mode to be maximally correlated is output; the loss function that maximizes the inter-modal mutual information is expressed as follows:
Figure FDA0003957491210000021
wherein q (y) i |x i ) Is a multivariate Gaussian distribution, N is the batch size in training, m 1 ,m 2 Summing the likelihoods of the two target modalities; optimized and expressed as T by mutual information method m =D m (H m ) Comprising a double layer feedforward neural network D corresponding to each mode m And outputting the modal characteristics of the representation.
7. The multi-stage multi-modal sentiment analysis method based on mutual information method representation according to claim 1, comprising the steps of, in the step (4), performing fusion interaction between modal features by using a multi-modal hierarchical fusion network in the feature fusion process; the calculation process of the multi-modal hierarchical fusion network is as follows: t is [L,V] =D L,V (T L ,T V ),T [L,A] =D L,A (T L ,T A ),T [V,A] =D V,A (T V ,T A ),T [L,V,A] =D L,V,A (T L ,T V ,T A ,T [L,V] ,T [L,A] ,T [V,A] ),
Figure FDA0003957491210000031
Wherein T is L t ,T V t ,/>
Figure FDA0003957491210000035
Respectively correspondingly represents the characteristics of the language, vision and sound modes>
Figure FDA0003957491210000032
All represent bimodal features, respectively corresponding to different modal features through a plurality of independent double-layer feedforward neural networks D L,V ,D L,A ,D L,A Learning to obtain; />
Figure FDA0003957491210000033
Representing trimodal features by bimodal features
Figure FDA0003957491210000034
Via a double-layer feedforward neural network D L,V,A Learning to obtain; after learning layer by layer, the single mode, double mode and triple mode characteristics are processed by D f The fusion results in a multi-modal fusion signature Z.
8. The multi-stage multi-modal sentiment analysis method based on mutual information method representation according to claim 7, wherein the multi-modal hierarchical fusion network models interactions among single-modal, bi-modal and tri-modal hierarchically, and dynamically adjusts an internal structure according to an interaction process.
CN202211465914.6A 2022-11-22 2022-11-22 Multi-stage multi-modal emotion analysis method based on mutual information method representation Pending CN115858726A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211465914.6A CN115858726A (en) 2022-11-22 2022-11-22 Multi-stage multi-modal emotion analysis method based on mutual information method representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211465914.6A CN115858726A (en) 2022-11-22 2022-11-22 Multi-stage multi-modal emotion analysis method based on mutual information method representation

Publications (1)

Publication Number Publication Date
CN115858726A true CN115858726A (en) 2023-03-28

Family

ID=85664849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211465914.6A Pending CN115858726A (en) 2022-11-22 2022-11-22 Multi-stage multi-modal emotion analysis method based on mutual information method representation

Country Status (1)

Country Link
CN (1) CN115858726A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975776A (en) * 2023-07-14 2023-10-31 湖北楚天高速数字科技有限公司 Multi-mode data fusion method and device based on tensor and mutual information
CN117809229A (en) * 2024-02-29 2024-04-02 广东工业大学 Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975776A (en) * 2023-07-14 2023-10-31 湖北楚天高速数字科技有限公司 Multi-mode data fusion method and device based on tensor and mutual information
CN117809229A (en) * 2024-02-29 2024-04-02 广东工业大学 Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance
CN117809229B (en) * 2024-02-29 2024-05-07 广东工业大学 Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance

Similar Documents

Publication Publication Date Title
CN111275085B (en) Online short video multi-modal emotion recognition method based on attention fusion
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN111507311B (en) Video character recognition method based on multi-mode feature fusion depth network
Beard et al. Multi-modal sequence fusion via recursive attention for emotion recognition
CN115329779B (en) Multi-person dialogue emotion recognition method
CN111723937A (en) Method, device, equipment and medium for generating description information of multimedia data
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
CN115858726A (en) Multi-stage multi-modal emotion analysis method based on mutual information method representation
CN113255755A (en) Multi-modal emotion classification method based on heterogeneous fusion network
CN113033450B (en) Multi-mode continuous emotion recognition method, service inference method and system
Patilkulkarni Visual speech recognition for small scale dataset using VGG16 convolution neural network
CN112597841B (en) Emotion analysis method based on door mechanism multi-mode fusion
CN114021524B (en) Emotion recognition method, device, equipment and readable storage medium
Li et al. A deep reinforcement learning framework for Identifying funny scenes in movies
CN114091466A (en) Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
CN116975776A (en) Multi-mode data fusion method and device based on tensor and mutual information
CN117251057A (en) AIGC-based method and system for constructing AI number wisdom
CN116129013A (en) Method, device and storage medium for generating virtual person animation video
Zeng et al. Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities
Xue et al. LCSNet: End-to-end lipreading with channel-aware feature selection
CN116484885A (en) Visual language translation method and system based on contrast learning and word granularity weight
CN115858728A (en) Multi-mode data based emotion analysis method
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
CN115270917A (en) Two-stage processing multi-mode garment image generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination