CN111680541B - Multi-modal emotion analysis method based on multi-dimensional attention fusion network - Google Patents

Multi-modal emotion analysis method based on multi-dimensional attention fusion network Download PDF

Info

Publication number
CN111680541B
CN111680541B CN202010292014.0A CN202010292014A CN111680541B CN 111680541 B CN111680541 B CN 111680541B CN 202010292014 A CN202010292014 A CN 202010292014A CN 111680541 B CN111680541 B CN 111680541B
Authority
CN
China
Prior art keywords
fusion
autocorrelation
target
modal
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010292014.0A
Other languages
Chinese (zh)
Other versions
CN111680541A (en
Inventor
冯镔
付彦喆
王耀平
江子文
杭浩然
李瑞达
刘文予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202010292014.0A priority Critical patent/CN111680541B/en
Publication of CN111680541A publication Critical patent/CN111680541A/en
Application granted granted Critical
Publication of CN111680541B publication Critical patent/CN111680541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention discloses a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which comprises the following steps of: extracting voice preprocessing characteristics, video preprocessing characteristics and text preprocessing characteristics aiming at sample data containing multiple modals such as voice, video and text; then, constructing the multi-dimensional attention fusion network for each mode, extracting a first-level autocorrelation feature and a second-level autocorrelation feature by using an autocorrelation feature extraction module in the network, combining autocorrelation information of the three modes, and obtaining cross-mode fusion features of the three modes by using a cross-mode fusion module in the network; combining the secondary autocorrelation characteristics and the cross-modal fusion characteristics to obtain modal multi-dimensional characteristics; finally, splicing the modal multi-dimensional characteristics, determining emotion scores and performing emotion analysis; the method can effectively perform feature fusion in a non-aligned multi-modal data scene, and perform emotion analysis by fully utilizing multi-modal associated information.

Description

Multi-modal emotion analysis method based on multi-dimensional attention fusion network
Technical Field
The invention belongs to the field of multi-modal emotion calculation, and particularly relates to a multi-modal emotion analysis method based on a multi-dimensional attention fusion network.
Background
Mood analysis has numerous applications in daily life. With the development of big data and multimedia technology, different modes of voice, video and text of data are analyzed by means of a multi-mode emotion analysis technology, and shallow meanings behind the data are better mined. In a return visit survey, for example, the degree of satisfaction of a user with a service or a product is known through comprehensive analysis of the user's voice, face, and speech content.
At present, the difficulty of multimodal emotion analysis lies in how to effectively fuse multimodal information, and the acquisition modes of voice, video and text characteristics are completely different. When the same content is described, the sequence length of the two modalities of voice and video is greatly different from the text in the time dimension, and the characteristics of the three modalities are in one-to-one correspondence in the time dimension, which causes great difficulty in the fusion between the modalities.
At present, two common methods are available, one method is based on modal integration, namely a data layer, a feature layer and a decision layer are selected from the whole emotion analysis system to splice intermediate results, and then emotion prediction is carried out. The method only simply integrates the results of three modes, does not consider the correlation information among the modes, and is easy to cause model overfitting due to information redundancy. The other method is based on mode labeling alignment, namely, when data labeling is carried out, the three modes are forcibly aligned in a time dimension according to characters or phonemes, so that the corresponding relation of the three modes in time is guaranteed, and mode fusion is carried out by utilizing a cyclic neural network, a convolutional neural network, an attention mechanism and a Seq2Seq framework, so that the mode labeling cost is high, and the mode labeling is not beneficial to the actual production and living environment.
Disclosure of Invention
The invention aims to provide a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which can effectively avoid the problems of overfitting caused by an integration method and overlarge labeling cost caused by modal-based labeling alignment, and fully utilizes multi-dimensional information in and among the modes to obtain more accurate and reliable emotion analysis results.
The invention adopts the following steps to solve the technical problem:
step one, a multi-modal emotion analysis database is established, the size of the database is N, each sample in the database contains three target modal data of voice, video and text, preprocessing characteristics of the three target modalities are extracted in advance, and emotion labeling is carried out on each sample.
And step two, constructing respective multidimensional attention fusion networks by using the three target modes in the step one.
The multidimensional attention fusion network of each of the three target modes comprises an autocorrelation feature extraction module and a cross-mode fusion module, wherein the autocorrelation feature extraction module and the cross-mode fusion module are formed by a transform network.
And step three, the preprocessed features in the step one are respectively processed by the autocorrelation feature extraction module in the step two, and autocorrelation information of three modes, namely voice autocorrelation information, text autocorrelation information and video autocorrelation information, is extracted.
The voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor.
The voice autocorrelation feature extraction module is configured to extract autocorrelation information of the input voice pre-processing features.
The text autocorrelation feature extraction module is configured to extract autocorrelation information of input text pre-processing features.
The video autocorrelation feature extraction module is configured to extract autocorrelation information of input video pre-processing features.
The autocorrelation information of the three modes comprises a first-level autocorrelation characteristic and a second-level autocorrelation characteristic.
And step four, selecting the first-level autocorrelation characteristics of any target modality in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modalities as auxiliary fusion characteristics, and sending the target characteristics to a cross-modality fusion module where the target modalities are located according to a certain grouping mode to respectively obtain voice-based, text-based cross-modality fusion characteristics and video-based cross-modality fusion characteristics.
The cross-modal fusion module comprises two bimodal fusion devices and a weighted integration network.
And step five, adding the cross-modal fusion characteristics of each target mode and the secondary autocorrelation characteristics of the step three to obtain multi-dimensional fusion characteristics.
And step six, splicing the voice, text and video multi-dimensional fusion characteristics obtained in the step five to obtain full-scale multi-dimensional characteristics, and sending the full-scale multi-dimensional characteristics to a scoring module to obtain emotion scores.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the three target modal preprocessing feature extractions in the step one are as follows: voice preprocessing features are obtained by extracting mfcc features from voice by using a kaldi voice recognition tool packet, video preprocessing features are obtained by extracting facial expression unit features from Facet, and text preprocessing features can be obtained by extracting word vector features from word2 vec.
And respectively carrying out feature dimension alignment on the voice preprocessing feature, the video preprocessing feature and the text preprocessing feature through linear transformation.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the emotion in the first step is marked as a limited continuous interval, the interior of the interval can be continuously and equidistantly divided into M sub-intervals, and each sub-interval represents the emotion degree range.
Wherein the interval can be divided into integer intervals of [ -K, K ], wherein more than 0 is judged positive, equal to 0 is judged neutral, less than 0 is judged negative, and the specific subinterval can be further divided according to the required emotion granularity size ].
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the multi-dimensional attention fusion network in the step two has the same structure aiming at the three target modes of voice, video and text.
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the self-correlation feature extraction module in the step two is a Transformer network.
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the first-level autocorrelation feature extractor and the first-level autocorrelation feature extractor in the third step are in a cascading mode.
The first-stage autocorrelation feature extractor is configured to extract first-stage autocorrelation features from input target modal preprocessing features based on a transform multi-head self-attention mechanism.
The calculation formula of the multi-head self-attention mechanism is as follows:
Qi=XWi Q;Ki=XWi K;Vi=XWi V
Figure BDA0002450762740000041
MultiHead_X=Concat(head1,…,headn)
wherein X is the target modal preprocessing characteristic in the second step, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for the values of the target modality m, softmax being a weight normalization function, Ki TIs KiTransposed matrix of dkFor the scaling factor, n is the number of headers, MultiHead _ X1The obtained first-order autocorrelation characteristics.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the second-level correlation feature extractor in step three is configured to extract second-level autocorrelation features from the first-level autocorrelation features of the input target modality based on a feedforward neural network.
Wherein the feed-forward neural network is configured to:
MultiHead_X2=max(0,MultiHead_X1·W1+b1)·W2+b2
wherein Multihead _ X2Is a second order autocorrelation feature of the target mode, W1Is the weight of a first-order autocorrelation feature, W2For network hidden layer weights, b1Biasing for first order autocorrelation characteristics, b2Biasing for the network hidden layer.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the grouping mode in the fourth step is (X)0,main,X1,aide),(X0,main,X2,aide) For subsequent packet fusion, wherein X0,mainFor the target feature to be fused in step four, X1,aide,X2,aideIs an auxiliary fusion feature of the other two modalities.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is used for the grouping fusion and is configured to input the grouping to obtain two groups of cross-modal fusion characteristics.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is cross-modal fusion based on a transform multi-head attention mechanism, and the calculation method is as follows:
Qm,main=Xm,mainWQ′;K=XaideWK′;V=Xj,aideWV′
Figure BDA0002450762740000051
CrossFusion_Xaide→main=Concat(head1,…,headn)
wherein Xm,mainSelf-attention feature, X, representing the current target modality mj,aideIndicating a secondary fusion feature, CrossFusion _ Xaide→mainDenotes the fusion result, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for values of the target modality m, dk' is the scaling factor.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the cross-modal fusion of the dual-modal fusion device based on the multi-head attention mechanism is carried out, and the specific fusion process is as follows:
(1) the target feature X to be fused in the step four is0,mainAnd performing query mapping. The mapping method is as follows:
Q0,main=X0,mainWQ′
wherein, WQWeights are learned for the query mappings.
(2) Fusing the auxiliary fusion feature X described in the step four1,aide,X2,aideAnd mapping the grouping key and the value.
The mapping method is as follows:
for X1,aide
Figure BDA0002450762740000061
For X2,aide
Figure BDA0002450762740000062
Wherein the content of the first and second substances,
Figure BDA0002450762740000063
to assist the key mapping weights of modality 1,
Figure BDA0002450762740000064
weights are mapped for the values of the auxiliary modality 1,
Figure BDA0002450762740000065
to assist the key mapping weights of modality 2,
Figure BDA0002450762740000066
mapping weights for auxiliary modality 2 values, K1Representing auxiliary fusion features X1,aideMapped key features, V1Representing auxiliary fusion features X1,aideThe value characteristic of the mapping. K is2Representing auxiliary fusion features X2,aideMapped key features, V2Representing auxiliary fusion features X2,aideThe value characteristic of the mapping.
(3) Performing cross-modal fusion based on a multi-head attention mechanism based on the mapping result:
for the packet (X)0,main,X1,aide) The fusion mode is as follows:
Figure BDA0002450762740000067
CrossFusion_X1→0=Concat(head1,…,headn)
for the packet (X)0,main,X2,aide) The fusion mode is as follows:
Figure BDA0002450762740000068
CrossFusion_X2→0=Concat(head1,…,headn)
wherein X0,mainSelf-attention feature, X, representing the current target modality1,aide、X2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ Xaide→mainRepresenting the target modality based on the auxiliary modality fusion result.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the weighted integration network in the fourth step is based on an adaptive weighted fusion algorithm and is configured to input the two sets of cross-modal fusion features to extract cross-modal fusion features of the target.
The formula of the self-adaptive weighting fusion algorithm is as follows:
Figure BDA0002450762740000071
Figure BDA0002450762740000072
wherein Wj,bjIs the hidden layer network weight and bias, λ, of the jth fusion submodulenCross fusion _ X, the integration weight of all fusion submodulesmThe cross-modal fusion characteristics obtained in the fourth step.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the scoring module in the sixth step is configured to input the full-scale multi-dimensional features to obtain a final emotion score based on a regression network, and the calculation process is as follows:
Score=WoutRelu(WsConcat([CrossFusion_X1…CrossFusion_Xm]))
wherein Relu is the activation function, WsAs weights of said full-scale multi-dimensional features, WoutConcat is a hidden layer parameter of the last fully connected layer, and is a matrix splicing operation.
Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:
(1) in the invention, by utilizing the multi-attention mechanism based on the Transformer to process the preprocessing characteristics of the three modes, compared with the traditional circulating neural network and the traditional convolution neural network, the multi-mode data does not need to be aligned in the time dimension in advance. The method is beneficial to reducing the data marking cost and is better applied to the actual production environment;
(2) in the invention, by utilizing the transform-based autocorrelation feature extraction module and the cross-modal fusion module, compared with the traditional method, the method considers the inherent information which is beneficial to depicting emotion in the modal and the fusion information between the modalities, and avoids the problem of model overfitting caused by directly cascading modal features;
(3) in the invention, when the mode fusion is carried out by the self-adaptive weighting fusion algorithm, different self-adaptive weights are given during the mode fusion by learning the dependency relationship between the modes, and the inherent difference between the modes is better considered compared with the traditional method.
Drawings
FIG. 1 is a flow chart of a multi-modal emotion analysis method based on a multi-dimensional attention fusion network according to the present invention;
FIG. 2 is a schematic structural diagram of a multidimensional attention fusion network based on an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a cross-mode fusion module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, and the specific flow is as shown in fig. 1, in addition, fig. 2 is a structural schematic diagram of the multi-dimensional attention fusion network in the embodiment of the invention, and fig. 3 is a structural schematic diagram of a cross-modal fusion module in the embodiment of the invention. The method comprises the following implementation steps:
1. and processing the multi-mode emotion database, and aligning the feature dimensions.
The experiment of the invention is based on MOSEI multi-modal emotion database, the database comprises 23454 data samples, each data sample comprises preprocessing characteristics of three modes of voice, video and text, wherein the video preprocessing characteristics are obtained by extracting 35-dimensional facial expression unit characteristics from Facet, the voice preprocessing characteristics are obtained by extracting 39-dimensional mfcc characteristics from voice by using a kaldi voice recognition tool package, and the text preprocessing characteristics can be obtained by extracting 300-dimensional word vector characteristics from word2 vec. For each data sample, an emotion mark score is included, and the emotion mark score of the whole sample is in the range of [ -3, 3], wherein (0,3] is positive emotion and [ -3, 0) is negative emotion. The defined emotion classes are determined according to the interval, for example, if the interval is 1, then the interval is divided into [ -3, -2, -1, 0, 1, 2, 3], i.e., there are 7 emotion classes.
Because the three features are not distributed uniformly and have different feature dimensions, the three features are mapped into the same dimension through linear transformation in order to facilitate subsequent cross-modal fusion.
2. And extracting autocorrelation information of three modes.
The method comprises the step of extracting autocorrelation information of three modes by using a Transformer feature extractor, wherein the autocorrelation information of the modes is important information which is helpful to emotion recognition in the modes extracted by using a Transformer network. The Transformer itself contains two important parts of a multi-headed self-attention mechanism and a feedforward network. As shown in FIG. 2, the present invention uses a first-order autocorrelation feature extractor based on a multi-head attention mechanism to extract important information of a modal from a preprocessed feature, regards the result as a first-order autocorrelation feature, uses a feedforward network as a second-order autocorrelation feature extractor to perform nonlinear fitting on the preprocessed feature, and regards the result as a second-order autocorrelation feature. Three sets of autocorrelation information of video, voice and text are obtained.
3. And extracting multi-dimensional fusion features.
And (3-1) extracting the autocorrelation characteristics.
Based on the autocorrelation information of the three modalities extracted in step 2 and the cross-modality fusion module shown in fig. 3, the primary autocorrelation feature of one target modality is used as a target feature to be fused, and the secondary autocorrelation features of the other two modalities are used as auxiliary fusion modality features, for example, the primary autocorrelation feature of voice is selected, and the primary autocorrelation features of video and text are sent to the cross-modality fusion module; selecting first-level autocorrelation characteristics of the video, and first-level autocorrelation characteristics of the voice and the text, and sending the first-level autocorrelation characteristics to a cross-mode fusion module; selecting first-level autocorrelation characteristics of the text and first-level autocorrelation characteristics of the voice and the video, and sending the first-level autocorrelation characteristics of the voice and the video into a cross-mode fusion module; wherein the cross-modality fusion module comprises two bimodal fusions and a weighted integration network.
And (3-2) extracting cross-modal fusion features.
The feature combinations in (3-1) are sent to the cross-modal fusion module shown in fig. 3, and then cross-modal fusion features are calculated based on the key and value query concept of the attention mechanism in the Transformer. For example, one-level autocorrelation characteristic Q of the speech would be selected0,mainLinear mapping is carried out to obtain a query vector, and the video X is processed1,aideText X2,aideLinear mapping is carried out on the two-level autocorrelation characteristics of the two modes to obtain respective key and value vectors, then multi-mode fusion is carried out according to a graph 3 to respectively obtain videos>Cross-modal fusion of features, text, and speech>Features are fused across modalities for speech. The specific calculation process is as follows:
for X0,main:Q0,main=X0,mainWQ′
For X1,aide
Figure BDA0002450762740000091
For X2,aide
Figure BDA0002450762740000092
For packet (X)0,main,X1,aide) The fusion mode is as follows:
Figure BDA0002450762740000101
CrossFusion_X1→0=Concat(head1,…,headn)
for packet (X)0,main,X2,aide) The fusion mode is as follows:
Figure BDA0002450762740000102
CrossFusion_X2→0=Concat(head1,…,headn)
wherein X0,mainSelf-attention feature, X, representing the current target modality1,aide、X2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ Xaide→mainThe representation target modality main is based on the auxiliary modality aide fusion result.
The two groups of features are sent to a weighted integration network shown in fig. 2, and a (video, text) - > voice cross-modal fusion feature is obtained. The specific calculation process is as follows
Figure BDA0002450762740000103
CrossFusion_Xm=λ*CrossFusion_X1→0+(1-λ)*CrossFusion_X2→0
The above processes are performed simultaneously in three sets of multidimensional attention fusion networks as shown in fig. 1, and finally, a (video, text) - > speech cross-modal fusion feature, a (video, speech) - > text cross-modal fusion feature, and a (speech, text) - > video cross-modal fusion feature are obtained
And (3-3) extracting multi-dimensional characteristics of videos, voices and texts.
In order to take the characteristics of two features into consideration, in this embodiment, a multi-dimensional fusion feature is obtained by fusing an autocorrelation feature and a cross-modal feature, and the specific fusion process is as follows:
and adding the secondary auto-correlation characteristics of the voice and the (voice and text) - > video cross-modal fusion characteristics to obtain the video multi-dimensional fusion characteristics.
And adding the two-stage autocorrelation characteristics of the voice and the (video, text) - > voice cross-modal fusion characteristics to obtain the voice multi-dimensional fusion characteristics.
And adding the secondary autocorrelation characteristics of the text and the (video and voice) -text cross-modal fusion characteristics to obtain the text multi-dimensional fusion characteristics.
4. An emotion score is calculated.
As shown in fig. 1, the obtained voice multidimensional fusion features, video multidimensional fusion features and text multidimensional fusion features are spliced, and then regression calculation is performed to obtain specific emotion scores, wherein the calculation process is as follows:
Score=WoutRRelu(WsRConcat([CrossFusion_X1,CrossFusion_X2,CrossFusion_X3]))
wherein Cross fusion _ X1,CrossFusion_X2,CrossFusion_X3Respectively representing video multi-dimensional fusion characteristics, voice multi-dimensional fusion characteristics and text multi-dimensional fusion characteristics. WoutHiding layer parameters for the regression network.
And determining the emotion interval in which the score specifically falls by calculating the emotion score of the sample and combining the emotion label of the database sample to obtain the final emotion grade.
The effectiveness of the invention is proved by the following experimental examples, and the experimental results prove that the invention can improve the recognition accuracy of emotion analysis.
The method is compared with 4 existing representative emotion analysis methods on an MOSEI data set, and Table 1 shows that the method and the 4 comparison methods for comparison are based on the Accuracy (ACC) of 2 classification and 7 classification and the performance of an F1 index on the data set, the larger the numerical value of the result is, the higher the emotion analysis quality is, and the improvement of the method (namely OurRMethod noted in Table 1) is very obvious.
TABLE 1R Performance of ACCR and F1 indices on MOSEI data set by different methods
Figure BDA0002450762740000111
Figure BDA0002450762740000121
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A multi-modal emotion analysis method based on a multi-dimensional attention fusion network is characterized by comprising the following steps:
step one, constructing a multi-modal emotion analysis database, wherein each sample in the database comprises three target modal data of voice, video and text, preprocessing characteristics of the three target modals are extracted in advance, and emotion labeling is carried out on each sample; the emotion is marked as a limited continuous interval, the inside of the interval is continuously and equidistantly divided into M sub-intervals, and each sub-interval represents the degree range of the emotion;
step two, constructing respective multidimensional attention fusion networks for the three target modalities in the step one, wherein the respective multidimensional attention fusion networks for the three target modalities all comprise an autocorrelation feature extraction module and a cross-modal fusion module, the multidimensional attention fusion network for the voice target modality comprises a voice autocorrelation feature extraction module and a voice cross-modal fusion module, the multidimensional attention fusion network for the video target modality comprises a video autocorrelation feature extraction module and a video cross-modal fusion module, and the multidimensional attention fusion network for the text target modality comprises a text autocorrelation feature extraction module and a text cross-modal fusion module; the autocorrelation characteristic extraction module is a Transformer network;
step three, the preprocessing characteristics of the three target modes in the step one are respectively passed through the self-correlation characteristic extraction module corresponding to each target module in the step two, and the self-correlation information of the three modes, namely the voice self-correlation information, the text self-correlation information and the video self-correlation information, is extracted; the voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor, and the autocorrelation information of the three target modes comprises primary autocorrelation features and secondary autocorrelation features; the first-stage autocorrelation feature extractor and the second-stage autocorrelation feature extractor adopt a cascading mode, wherein the first-stage autocorrelation feature extractor is configured to input preprocessing features of a target modality to extract first-stage autocorrelation features based on a multi-head self-attention mechanism of a Transformer, and the second-stage autocorrelation feature extractor is configured to input the first-stage autocorrelation features of the target modality to extract second-stage autocorrelation features based on a feedforward neural network of the Transformer;
step four, selecting the first-level autocorrelation characteristics of any one target mode in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modes as auxiliary fusion characteristics, and sending the target characteristics to a cross-mode fusion module where the target modes are located according to a preset grouping mode to respectively obtain voice cross-mode fusion characteristics, text cross-mode fusion characteristics and video cross-mode fusion characteristics, wherein the cross-mode fusion module comprises two dual-mode fusion devices and a weighted integration network;
fifthly, adding the cross-modal fusion characteristics of each target mode in the fourth step and the secondary autocorrelation characteristics in the third step to obtain multi-dimensional fusion characteristics;
and step six, splicing the voice, text and video multi-dimensional fusion characteristics obtained in the step five to obtain full-scale multi-dimensional characteristics, and sending the full-scale multi-dimensional characteristics to a scoring module to obtain emotion scores.
2. The method for multi-modal emotion analysis based on multi-dimensional attention fusion network as claimed in claim 1, wherein the transform's multi-head self-attention mechanism is calculated by the following formula:
Qi=XWi Q
Figure FDA0003604772500000021
Figure FDA0003604772500000022
MultiHead_X1=Concat(head1,...,headn)
wherein X is the target modal preprocessing characteristic in the second step, Wi QIs the query mapping weight of the target modality pre-processing feature ith header,
Figure FDA0003604772500000023
is the key mapping weight of the ith head of the target modality preprocessing feature,
Figure FDA0003604772500000024
is the value mapping weight of the ith head of the target modal preprocessing characteristic, softmax is the weight normalization function, Ki TIs KiTransposed matrix of dkFor the scaling factor, n is the number of headers, MultiHead _ X1The obtained first-order autocorrelation characteristics.
3. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1, wherein the feedforward neural network formula is:
MultiHead_X2=max(0,MultiHead_X1·W1+b1)·W2+b2
wherein Multihead _ X2Is a second order autocorrelation feature of the target mode, W1Is the weight of a first-order autocorrelation feature, W2Hiding layer weights for the network, b1Biasing for first order autocorrelation characteristics, b2Biasing for the network hidden layer.
4. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1, wherein the grouping manner in step four is (X)0,main,X1,aide),(X0,main,X2,aide) For subsequent packet fusion, wherein X0,mainFor the target feature to be fused in step four, X1,aide,X2,aideFor the auxiliary fusion feature of the other two modalities, (X) is0,main,X1,aide) Inputting a bimodal fusion device, (X)0,main,X2,aide) Input into another bimodal fuser.
5. The multi-modal emotion analysis method based on the multidimensional attention fusion network as claimed in claim 1, wherein the bimodal fusion device in step four is a cross-modal fusion based on a transform's multi-head attention mechanism, and the calculation method is as follows:
Qm,main=Xm,mainWQ′;K=Xj,aideWK′;V=Xj,aideWV′
Figure FDA0003604772500000031
CrossFusion_Xaide→main=Concat(head1,...,headn)
wherein Xm,mainSelf-attention feature, X, representing the current target modality mj,aideIndicating a secondary fusion feature, CrossFusion _ Xaide→mainDenotes the fusion result, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for values of target modality m, dk' is the scaling factor.
CN202010292014.0A 2020-04-14 2020-04-14 Multi-modal emotion analysis method based on multi-dimensional attention fusion network Active CN111680541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010292014.0A CN111680541B (en) 2020-04-14 2020-04-14 Multi-modal emotion analysis method based on multi-dimensional attention fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010292014.0A CN111680541B (en) 2020-04-14 2020-04-14 Multi-modal emotion analysis method based on multi-dimensional attention fusion network

Publications (2)

Publication Number Publication Date
CN111680541A CN111680541A (en) 2020-09-18
CN111680541B true CN111680541B (en) 2022-06-21

Family

ID=72433356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010292014.0A Active CN111680541B (en) 2020-04-14 2020-04-14 Multi-modal emotion analysis method based on multi-dimensional attention fusion network

Country Status (1)

Country Link
CN (1) CN111680541B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053690B (en) * 2020-09-22 2023-12-29 湖南大学 Cross-mode multi-feature fusion audio/video voice recognition method and system
CN112233698B (en) * 2020-10-09 2023-07-25 中国平安人寿保险股份有限公司 Character emotion recognition method, device, terminal equipment and storage medium
CN112489635B (en) * 2020-12-03 2022-11-11 杭州电子科技大学 Multi-mode emotion recognition method based on attention enhancement mechanism
CN112765323B (en) * 2021-01-24 2021-08-17 中国电子科技集团公司第十五研究所 Voice emotion recognition method based on multi-mode feature extraction and fusion
CN112819052B (en) * 2021-01-25 2021-12-24 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal fine-grained mixing method, system, device and storage medium
CN112560811B (en) * 2021-02-19 2021-07-02 中国科学院自动化研究所 End-to-end automatic detection research method for audio-video depression
CN112989977B (en) * 2021-03-03 2022-09-06 复旦大学 Audio-visual event positioning method and device based on cross-modal attention mechanism
CN113723166A (en) * 2021-03-26 2021-11-30 腾讯科技(北京)有限公司 Content identification method and device, computer equipment and storage medium
CN113807440B (en) * 2021-09-17 2022-08-26 北京百度网讯科技有限公司 Method, apparatus, and medium for processing multimodal data using neural networks
CN113806609B (en) * 2021-09-26 2022-07-12 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113723112B (en) * 2021-11-02 2022-02-22 天津海翼科技有限公司 Multi-modal emotion analysis prediction method, device, equipment and storage medium
CN114387997B (en) * 2022-01-21 2024-03-29 合肥工业大学 Voice emotion recognition method based on deep learning
CN116580257A (en) * 2022-01-24 2023-08-11 腾讯科技(深圳)有限公司 Feature fusion model training and sample retrieval method and device and computer equipment
CN114821385A (en) * 2022-03-08 2022-07-29 阿里巴巴(中国)有限公司 Multimedia information processing method, device, equipment and storage medium
CN115205179A (en) * 2022-07-15 2022-10-18 小米汽车科技有限公司 Image fusion method and device, vehicle and storage medium
CN116070169A (en) * 2023-01-28 2023-05-05 天翼云科技有限公司 Model training method and device, electronic equipment and storage medium
CN116189272B (en) * 2023-05-05 2023-07-07 南京邮电大学 Facial expression recognition method and system based on feature fusion and attention mechanism

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614895A (en) * 2018-10-29 2019-04-12 山东大学 A method of the multi-modal emotion recognition based on attention Fusion Features
CN110033029A (en) * 2019-03-22 2019-07-19 五邑大学 A kind of emotion identification method and device based on multi-modal emotion model
CN110188343A (en) * 2019-04-22 2019-08-30 浙江工业大学 Multi-modal emotion identification method based on fusion attention network
CN110287389A (en) * 2019-05-31 2019-09-27 南京理工大学 The multi-modal sensibility classification method merged based on text, voice and video
CN110399841A (en) * 2019-07-26 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170001490A (en) * 2015-06-26 2017-01-04 삼성전자주식회사 The electronic apparatus and method for controlling function in the electronic apparatus using the bio-metric sensor
WO2019103484A1 (en) * 2017-11-24 2019-05-31 주식회사 제네시스랩 Multi-modal emotion recognition device, method and storage medium using artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614895A (en) * 2018-10-29 2019-04-12 山东大学 A method of the multi-modal emotion recognition based on attention Fusion Features
CN110033029A (en) * 2019-03-22 2019-07-19 五邑大学 A kind of emotion identification method and device based on multi-modal emotion model
CN110188343A (en) * 2019-04-22 2019-08-30 浙江工业大学 Multi-modal emotion identification method based on fusion attention network
CN110287389A (en) * 2019-05-31 2019-09-27 南京理工大学 The multi-modal sensibility classification method merged based on text, voice and video
CN110399841A (en) * 2019-07-26 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Multimodal continuous emotion recognition with data augmentation using recurrent neural networks;Jian Huang等;《ACM》;20181231;第57-64页 *

Also Published As

Publication number Publication date
CN111680541A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN111680541B (en) Multi-modal emotion analysis method based on multi-dimensional attention fusion network
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111625641B (en) Dialog intention recognition method and system based on multi-dimensional semantic interaction representation model
Hazarika et al. Self-attentive feature-level fusion for multimodal emotion detection
CN111506732B (en) Text multi-level label classification method
CN116450796B (en) Intelligent question-answering model construction method and device
CN113268609B (en) Knowledge graph-based dialogue content recommendation method, device, equipment and medium
CN112562669B (en) Method and system for automatically abstracting intelligent digital newspaper and performing voice interaction chat
CN112287093B (en) Automatic question-answering system based on semi-supervised learning and Text-to-SQL model
CN115292461B (en) Man-machine interaction learning method and system based on voice recognition
CN110569869A (en) feature level fusion method for multi-modal emotion detection
CN116303977B (en) Question-answering method and system based on feature classification
CN114417097A (en) Emotion prediction method and system based on time convolution and self-attention
Khare et al. Multi-modal embeddings using multi-task learning for emotion recognition
CN113705315A (en) Video processing method, device, equipment and storage medium
CN115563290A (en) Intelligent emotion recognition method based on context modeling
CN116189039A (en) Multi-modal emotion classification method and system for modal sequence perception with global audio feature enhancement
CN111984780A (en) Multi-intention recognition model training method, multi-intention recognition method and related device
Zhao et al. Knowledge-aware bayesian co-attention for multimodal emotion recognition
Sun et al. Multi-classification speech emotion recognition based on two-stage bottleneck features selection and MCJD algorithm
CN116702091B (en) Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
CN113569553A (en) Sentence similarity judgment method based on improved Adaboost algorithm
KR102297480B1 (en) System and method for structured-paraphrasing the unstructured query or request sentence
CN112417125A (en) Open domain dialogue reply method and system based on deep reinforcement learning
CN115329776B (en) Semantic analysis method for network security co-processing based on less-sample learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant