CN111680541B - Multi-modal emotion analysis method based on multi-dimensional attention fusion network - Google Patents
Multi-modal emotion analysis method based on multi-dimensional attention fusion network Download PDFInfo
- Publication number
- CN111680541B CN111680541B CN202010292014.0A CN202010292014A CN111680541B CN 111680541 B CN111680541 B CN 111680541B CN 202010292014 A CN202010292014 A CN 202010292014A CN 111680541 B CN111680541 B CN 111680541B
- Authority
- CN
- China
- Prior art keywords
- fusion
- autocorrelation
- target
- modal
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Abstract
The invention discloses a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which comprises the following steps of: extracting voice preprocessing characteristics, video preprocessing characteristics and text preprocessing characteristics aiming at sample data containing multiple modals such as voice, video and text; then, constructing the multi-dimensional attention fusion network for each mode, extracting a first-level autocorrelation feature and a second-level autocorrelation feature by using an autocorrelation feature extraction module in the network, combining autocorrelation information of the three modes, and obtaining cross-mode fusion features of the three modes by using a cross-mode fusion module in the network; combining the secondary autocorrelation characteristics and the cross-modal fusion characteristics to obtain modal multi-dimensional characteristics; finally, splicing the modal multi-dimensional characteristics, determining emotion scores and performing emotion analysis; the method can effectively perform feature fusion in a non-aligned multi-modal data scene, and perform emotion analysis by fully utilizing multi-modal associated information.
Description
Technical Field
The invention belongs to the field of multi-modal emotion calculation, and particularly relates to a multi-modal emotion analysis method based on a multi-dimensional attention fusion network.
Background
Mood analysis has numerous applications in daily life. With the development of big data and multimedia technology, different modes of voice, video and text of data are analyzed by means of a multi-mode emotion analysis technology, and shallow meanings behind the data are better mined. In a return visit survey, for example, the degree of satisfaction of a user with a service or a product is known through comprehensive analysis of the user's voice, face, and speech content.
At present, the difficulty of multimodal emotion analysis lies in how to effectively fuse multimodal information, and the acquisition modes of voice, video and text characteristics are completely different. When the same content is described, the sequence length of the two modalities of voice and video is greatly different from the text in the time dimension, and the characteristics of the three modalities are in one-to-one correspondence in the time dimension, which causes great difficulty in the fusion between the modalities.
At present, two common methods are available, one method is based on modal integration, namely a data layer, a feature layer and a decision layer are selected from the whole emotion analysis system to splice intermediate results, and then emotion prediction is carried out. The method only simply integrates the results of three modes, does not consider the correlation information among the modes, and is easy to cause model overfitting due to information redundancy. The other method is based on mode labeling alignment, namely, when data labeling is carried out, the three modes are forcibly aligned in a time dimension according to characters or phonemes, so that the corresponding relation of the three modes in time is guaranteed, and mode fusion is carried out by utilizing a cyclic neural network, a convolutional neural network, an attention mechanism and a Seq2Seq framework, so that the mode labeling cost is high, and the mode labeling is not beneficial to the actual production and living environment.
Disclosure of Invention
The invention aims to provide a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, which can effectively avoid the problems of overfitting caused by an integration method and overlarge labeling cost caused by modal-based labeling alignment, and fully utilizes multi-dimensional information in and among the modes to obtain more accurate and reliable emotion analysis results.
The invention adopts the following steps to solve the technical problem:
step one, a multi-modal emotion analysis database is established, the size of the database is N, each sample in the database contains three target modal data of voice, video and text, preprocessing characteristics of the three target modalities are extracted in advance, and emotion labeling is carried out on each sample.
And step two, constructing respective multidimensional attention fusion networks by using the three target modes in the step one.
The multidimensional attention fusion network of each of the three target modes comprises an autocorrelation feature extraction module and a cross-mode fusion module, wherein the autocorrelation feature extraction module and the cross-mode fusion module are formed by a transform network.
And step three, the preprocessed features in the step one are respectively processed by the autocorrelation feature extraction module in the step two, and autocorrelation information of three modes, namely voice autocorrelation information, text autocorrelation information and video autocorrelation information, is extracted.
The voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor.
The voice autocorrelation feature extraction module is configured to extract autocorrelation information of the input voice pre-processing features.
The text autocorrelation feature extraction module is configured to extract autocorrelation information of input text pre-processing features.
The video autocorrelation feature extraction module is configured to extract autocorrelation information of input video pre-processing features.
The autocorrelation information of the three modes comprises a first-level autocorrelation characteristic and a second-level autocorrelation characteristic.
And step four, selecting the first-level autocorrelation characteristics of any target modality in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modalities as auxiliary fusion characteristics, and sending the target characteristics to a cross-modality fusion module where the target modalities are located according to a certain grouping mode to respectively obtain voice-based, text-based cross-modality fusion characteristics and video-based cross-modality fusion characteristics.
The cross-modal fusion module comprises two bimodal fusion devices and a weighted integration network.
And step five, adding the cross-modal fusion characteristics of each target mode and the secondary autocorrelation characteristics of the step three to obtain multi-dimensional fusion characteristics.
And step six, splicing the voice, text and video multi-dimensional fusion characteristics obtained in the step five to obtain full-scale multi-dimensional characteristics, and sending the full-scale multi-dimensional characteristics to a scoring module to obtain emotion scores.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the three target modal preprocessing feature extractions in the step one are as follows: voice preprocessing features are obtained by extracting mfcc features from voice by using a kaldi voice recognition tool packet, video preprocessing features are obtained by extracting facial expression unit features from Facet, and text preprocessing features can be obtained by extracting word vector features from word2 vec.
And respectively carrying out feature dimension alignment on the voice preprocessing feature, the video preprocessing feature and the text preprocessing feature through linear transformation.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the emotion in the first step is marked as a limited continuous interval, the interior of the interval can be continuously and equidistantly divided into M sub-intervals, and each sub-interval represents the emotion degree range.
Wherein the interval can be divided into integer intervals of [ -K, K ], wherein more than 0 is judged positive, equal to 0 is judged neutral, less than 0 is judged negative, and the specific subinterval can be further divided according to the required emotion granularity size ].
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the multi-dimensional attention fusion network in the step two has the same structure aiming at the three target modes of voice, video and text.
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the self-correlation feature extraction module in the step two is a Transformer network.
The further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network is characterized in that the first-level autocorrelation feature extractor and the first-level autocorrelation feature extractor in the third step are in a cascading mode.
The first-stage autocorrelation feature extractor is configured to extract first-stage autocorrelation features from input target modal preprocessing features based on a transform multi-head self-attention mechanism.
The calculation formula of the multi-head self-attention mechanism is as follows:
Qi=XWi Q;Ki=XWi K;Vi=XWi V;
MultiHead_X=Concat(head1,…,headn)
wherein X is the target modal preprocessing characteristic in the second step, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for the values of the target modality m, softmax being a weight normalization function, Ki TIs KiTransposed matrix of dkFor the scaling factor, n is the number of headers, MultiHead _ X1The obtained first-order autocorrelation characteristics.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the second-level correlation feature extractor in step three is configured to extract second-level autocorrelation features from the first-level autocorrelation features of the input target modality based on a feedforward neural network.
Wherein the feed-forward neural network is configured to:
MultiHead_X2=max(0,MultiHead_X1·W1+b1)·W2+b2
wherein Multihead _ X2Is a second order autocorrelation feature of the target mode, W1Is the weight of a first-order autocorrelation feature, W2For network hidden layer weights, b1Biasing for first order autocorrelation characteristics, b2Biasing for the network hidden layer.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the grouping mode in the fourth step is (X)0,main,X1,aide),(X0,main,X2,aide) For subsequent packet fusion, wherein X0,mainFor the target feature to be fused in step four, X1,aide,X2,aideIs an auxiliary fusion feature of the other two modalities.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is used for the grouping fusion and is configured to input the grouping to obtain two groups of cross-modal fusion characteristics.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the dual-modal fusion device in the fourth step is cross-modal fusion based on a transform multi-head attention mechanism, and the calculation method is as follows:
Qm,main=Xm,mainWQ′;K=XaideWK′;V=Xj,aideWV′;
CrossFusion_Xaide→main=Concat(head1,…,headn)
wherein Xm,mainSelf-attention feature, X, representing the current target modality mj,aideIndicating a secondary fusion feature, CrossFusion _ Xaide→mainDenotes the fusion result, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for values of the target modality m, dk' is the scaling factor.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the cross-modal fusion of the dual-modal fusion device based on the multi-head attention mechanism is carried out, and the specific fusion process is as follows:
(1) the target feature X to be fused in the step four is0,mainAnd performing query mapping. The mapping method is as follows:
Q0,main=X0,mainWQ′
wherein, WQWeights are learned for the query mappings.
(2) Fusing the auxiliary fusion feature X described in the step four1,aide,X2,aideAnd mapping the grouping key and the value.
The mapping method is as follows:
Wherein the content of the first and second substances,to assist the key mapping weights of modality 1,weights are mapped for the values of the auxiliary modality 1,to assist the key mapping weights of modality 2,mapping weights for auxiliary modality 2 values, K1Representing auxiliary fusion features X1,aideMapped key features, V1Representing auxiliary fusion features X1,aideThe value characteristic of the mapping. K is2Representing auxiliary fusion features X2,aideMapped key features, V2Representing auxiliary fusion features X2,aideThe value characteristic of the mapping.
(3) Performing cross-modal fusion based on a multi-head attention mechanism based on the mapping result:
for the packet (X)0,main,X1,aide) The fusion mode is as follows:
CrossFusion_X1→0=Concat(head1,…,headn)
for the packet (X)0,main,X2,aide) The fusion mode is as follows:
CrossFusion_X2→0=Concat(head1,…,headn)
wherein X0,mainSelf-attention feature, X, representing the current target modality1,aide、X2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ Xaide→mainRepresenting the target modality based on the auxiliary modality fusion result.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the weighted integration network in the fourth step is based on an adaptive weighted fusion algorithm and is configured to input the two sets of cross-modal fusion features to extract cross-modal fusion features of the target.
The formula of the self-adaptive weighting fusion algorithm is as follows:
wherein Wj,bjIs the hidden layer network weight and bias, λ, of the jth fusion submodulenCross fusion _ X, the integration weight of all fusion submodulesmThe cross-modal fusion characteristics obtained in the fourth step.
As a further optimization scheme of the multi-modal emotion analysis method based on the multi-dimensional attention fusion network, the scoring module in the sixth step is configured to input the full-scale multi-dimensional features to obtain a final emotion score based on a regression network, and the calculation process is as follows:
Score=WoutRelu(WsConcat([CrossFusion_X1…CrossFusion_Xm]))
wherein Relu is the activation function, WsAs weights of said full-scale multi-dimensional features, WoutConcat is a hidden layer parameter of the last fully connected layer, and is a matrix splicing operation.
Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:
(1) in the invention, by utilizing the multi-attention mechanism based on the Transformer to process the preprocessing characteristics of the three modes, compared with the traditional circulating neural network and the traditional convolution neural network, the multi-mode data does not need to be aligned in the time dimension in advance. The method is beneficial to reducing the data marking cost and is better applied to the actual production environment;
(2) in the invention, by utilizing the transform-based autocorrelation feature extraction module and the cross-modal fusion module, compared with the traditional method, the method considers the inherent information which is beneficial to depicting emotion in the modal and the fusion information between the modalities, and avoids the problem of model overfitting caused by directly cascading modal features;
(3) in the invention, when the mode fusion is carried out by the self-adaptive weighting fusion algorithm, different self-adaptive weights are given during the mode fusion by learning the dependency relationship between the modes, and the inherent difference between the modes is better considered compared with the traditional method.
Drawings
FIG. 1 is a flow chart of a multi-modal emotion analysis method based on a multi-dimensional attention fusion network according to the present invention;
FIG. 2 is a schematic structural diagram of a multidimensional attention fusion network based on an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a cross-mode fusion module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a multi-modal emotion analysis method based on a multi-dimensional attention fusion network, and the specific flow is as shown in fig. 1, in addition, fig. 2 is a structural schematic diagram of the multi-dimensional attention fusion network in the embodiment of the invention, and fig. 3 is a structural schematic diagram of a cross-modal fusion module in the embodiment of the invention. The method comprises the following implementation steps:
1. and processing the multi-mode emotion database, and aligning the feature dimensions.
The experiment of the invention is based on MOSEI multi-modal emotion database, the database comprises 23454 data samples, each data sample comprises preprocessing characteristics of three modes of voice, video and text, wherein the video preprocessing characteristics are obtained by extracting 35-dimensional facial expression unit characteristics from Facet, the voice preprocessing characteristics are obtained by extracting 39-dimensional mfcc characteristics from voice by using a kaldi voice recognition tool package, and the text preprocessing characteristics can be obtained by extracting 300-dimensional word vector characteristics from word2 vec. For each data sample, an emotion mark score is included, and the emotion mark score of the whole sample is in the range of [ -3, 3], wherein (0,3] is positive emotion and [ -3, 0) is negative emotion. The defined emotion classes are determined according to the interval, for example, if the interval is 1, then the interval is divided into [ -3, -2, -1, 0, 1, 2, 3], i.e., there are 7 emotion classes.
Because the three features are not distributed uniformly and have different feature dimensions, the three features are mapped into the same dimension through linear transformation in order to facilitate subsequent cross-modal fusion.
2. And extracting autocorrelation information of three modes.
The method comprises the step of extracting autocorrelation information of three modes by using a Transformer feature extractor, wherein the autocorrelation information of the modes is important information which is helpful to emotion recognition in the modes extracted by using a Transformer network. The Transformer itself contains two important parts of a multi-headed self-attention mechanism and a feedforward network. As shown in FIG. 2, the present invention uses a first-order autocorrelation feature extractor based on a multi-head attention mechanism to extract important information of a modal from a preprocessed feature, regards the result as a first-order autocorrelation feature, uses a feedforward network as a second-order autocorrelation feature extractor to perform nonlinear fitting on the preprocessed feature, and regards the result as a second-order autocorrelation feature. Three sets of autocorrelation information of video, voice and text are obtained.
3. And extracting multi-dimensional fusion features.
And (3-1) extracting the autocorrelation characteristics.
Based on the autocorrelation information of the three modalities extracted in step 2 and the cross-modality fusion module shown in fig. 3, the primary autocorrelation feature of one target modality is used as a target feature to be fused, and the secondary autocorrelation features of the other two modalities are used as auxiliary fusion modality features, for example, the primary autocorrelation feature of voice is selected, and the primary autocorrelation features of video and text are sent to the cross-modality fusion module; selecting first-level autocorrelation characteristics of the video, and first-level autocorrelation characteristics of the voice and the text, and sending the first-level autocorrelation characteristics to a cross-mode fusion module; selecting first-level autocorrelation characteristics of the text and first-level autocorrelation characteristics of the voice and the video, and sending the first-level autocorrelation characteristics of the voice and the video into a cross-mode fusion module; wherein the cross-modality fusion module comprises two bimodal fusions and a weighted integration network.
And (3-2) extracting cross-modal fusion features.
The feature combinations in (3-1) are sent to the cross-modal fusion module shown in fig. 3, and then cross-modal fusion features are calculated based on the key and value query concept of the attention mechanism in the Transformer. For example, one-level autocorrelation characteristic Q of the speech would be selected0,mainLinear mapping is carried out to obtain a query vector, and the video X is processed1,aideText X2,aideLinear mapping is carried out on the two-level autocorrelation characteristics of the two modes to obtain respective key and value vectors, then multi-mode fusion is carried out according to a graph 3 to respectively obtain videos>Cross-modal fusion of features, text, and speech>Features are fused across modalities for speech. The specific calculation process is as follows:
for X0,main:Q0,main=X0,mainWQ′
For packet (X)0,main,X1,aide) The fusion mode is as follows:
CrossFusion_X1→0=Concat(head1,…,headn)
for packet (X)0,main,X2,aide) The fusion mode is as follows:
CrossFusion_X2→0=Concat(head1,…,headn)
wherein X0,mainSelf-attention feature, X, representing the current target modality1,aide、X2,aideAn auxiliary fusion feature representing the remaining target modalities, Cross fusion _ Xaide→mainThe representation target modality main is based on the auxiliary modality aide fusion result.
The two groups of features are sent to a weighted integration network shown in fig. 2, and a (video, text) - > voice cross-modal fusion feature is obtained. The specific calculation process is as follows
CrossFusion_Xm=λ*CrossFusion_X1→0+(1-λ)*CrossFusion_X2→0
The above processes are performed simultaneously in three sets of multidimensional attention fusion networks as shown in fig. 1, and finally, a (video, text) - > speech cross-modal fusion feature, a (video, speech) - > text cross-modal fusion feature, and a (speech, text) - > video cross-modal fusion feature are obtained
And (3-3) extracting multi-dimensional characteristics of videos, voices and texts.
In order to take the characteristics of two features into consideration, in this embodiment, a multi-dimensional fusion feature is obtained by fusing an autocorrelation feature and a cross-modal feature, and the specific fusion process is as follows:
and adding the secondary auto-correlation characteristics of the voice and the (voice and text) - > video cross-modal fusion characteristics to obtain the video multi-dimensional fusion characteristics.
And adding the two-stage autocorrelation characteristics of the voice and the (video, text) - > voice cross-modal fusion characteristics to obtain the voice multi-dimensional fusion characteristics.
And adding the secondary autocorrelation characteristics of the text and the (video and voice) -text cross-modal fusion characteristics to obtain the text multi-dimensional fusion characteristics.
4. An emotion score is calculated.
As shown in fig. 1, the obtained voice multidimensional fusion features, video multidimensional fusion features and text multidimensional fusion features are spliced, and then regression calculation is performed to obtain specific emotion scores, wherein the calculation process is as follows:
Score=WoutRRelu(WsRConcat([CrossFusion_X1,CrossFusion_X2,CrossFusion_X3]))
wherein Cross fusion _ X1,CrossFusion_X2,CrossFusion_X3Respectively representing video multi-dimensional fusion characteristics, voice multi-dimensional fusion characteristics and text multi-dimensional fusion characteristics. WoutHiding layer parameters for the regression network.
And determining the emotion interval in which the score specifically falls by calculating the emotion score of the sample and combining the emotion label of the database sample to obtain the final emotion grade.
The effectiveness of the invention is proved by the following experimental examples, and the experimental results prove that the invention can improve the recognition accuracy of emotion analysis.
The method is compared with 4 existing representative emotion analysis methods on an MOSEI data set, and Table 1 shows that the method and the 4 comparison methods for comparison are based on the Accuracy (ACC) of 2 classification and 7 classification and the performance of an F1 index on the data set, the larger the numerical value of the result is, the higher the emotion analysis quality is, and the improvement of the method (namely OurRMethod noted in Table 1) is very obvious.
TABLE 1R Performance of ACCR and F1 indices on MOSEI data set by different methods
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (5)
1. A multi-modal emotion analysis method based on a multi-dimensional attention fusion network is characterized by comprising the following steps:
step one, constructing a multi-modal emotion analysis database, wherein each sample in the database comprises three target modal data of voice, video and text, preprocessing characteristics of the three target modals are extracted in advance, and emotion labeling is carried out on each sample; the emotion is marked as a limited continuous interval, the inside of the interval is continuously and equidistantly divided into M sub-intervals, and each sub-interval represents the degree range of the emotion;
step two, constructing respective multidimensional attention fusion networks for the three target modalities in the step one, wherein the respective multidimensional attention fusion networks for the three target modalities all comprise an autocorrelation feature extraction module and a cross-modal fusion module, the multidimensional attention fusion network for the voice target modality comprises a voice autocorrelation feature extraction module and a voice cross-modal fusion module, the multidimensional attention fusion network for the video target modality comprises a video autocorrelation feature extraction module and a video cross-modal fusion module, and the multidimensional attention fusion network for the text target modality comprises a text autocorrelation feature extraction module and a text cross-modal fusion module; the autocorrelation characteristic extraction module is a Transformer network;
step three, the preprocessing characteristics of the three target modes in the step one are respectively passed through the self-correlation characteristic extraction module corresponding to each target module in the step two, and the self-correlation information of the three modes, namely the voice self-correlation information, the text self-correlation information and the video self-correlation information, is extracted; the voice autocorrelation feature extraction module, the text autocorrelation feature extraction module and the video autocorrelation feature extraction module respectively comprise a primary autocorrelation feature extractor and a secondary autocorrelation feature extractor, and the autocorrelation information of the three target modes comprises primary autocorrelation features and secondary autocorrelation features; the first-stage autocorrelation feature extractor and the second-stage autocorrelation feature extractor adopt a cascading mode, wherein the first-stage autocorrelation feature extractor is configured to input preprocessing features of a target modality to extract first-stage autocorrelation features based on a multi-head self-attention mechanism of a Transformer, and the second-stage autocorrelation feature extractor is configured to input the first-stage autocorrelation features of the target modality to extract second-stage autocorrelation features based on a feedforward neural network of the Transformer;
step four, selecting the first-level autocorrelation characteristics of any one target mode in the step three as target characteristics to be fused, and the second-level autocorrelation characteristics of the other two target modes as auxiliary fusion characteristics, and sending the target characteristics to a cross-mode fusion module where the target modes are located according to a preset grouping mode to respectively obtain voice cross-mode fusion characteristics, text cross-mode fusion characteristics and video cross-mode fusion characteristics, wherein the cross-mode fusion module comprises two dual-mode fusion devices and a weighted integration network;
fifthly, adding the cross-modal fusion characteristics of each target mode in the fourth step and the secondary autocorrelation characteristics in the third step to obtain multi-dimensional fusion characteristics;
and step six, splicing the voice, text and video multi-dimensional fusion characteristics obtained in the step five to obtain full-scale multi-dimensional characteristics, and sending the full-scale multi-dimensional characteristics to a scoring module to obtain emotion scores.
2. The method for multi-modal emotion analysis based on multi-dimensional attention fusion network as claimed in claim 1, wherein the transform's multi-head self-attention mechanism is calculated by the following formula:
MultiHead_X1=Concat(head1,...,headn)
wherein X is the target modal preprocessing characteristic in the second step, Wi QIs the query mapping weight of the target modality pre-processing feature ith header,is the key mapping weight of the ith head of the target modality preprocessing feature,is the value mapping weight of the ith head of the target modal preprocessing characteristic, softmax is the weight normalization function, Ki TIs KiTransposed matrix of dkFor the scaling factor, n is the number of headers, MultiHead _ X1The obtained first-order autocorrelation characteristics.
3. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1, wherein the feedforward neural network formula is:
MultiHead_X2=max(0,MultiHead_X1·W1+b1)·W2+b2
wherein Multihead _ X2Is a second order autocorrelation feature of the target mode, W1Is the weight of a first-order autocorrelation feature, W2Hiding layer weights for the network, b1Biasing for first order autocorrelation characteristics, b2Biasing for the network hidden layer.
4. The multi-modal emotion analysis method based on multi-dimensional attention fusion network as claimed in claim 1, wherein the grouping manner in step four is (X)0,main,X1,aide),(X0,main,X2,aide) For subsequent packet fusion, wherein X0,mainFor the target feature to be fused in step four, X1,aide,X2,aideFor the auxiliary fusion feature of the other two modalities, (X) is0,main,X1,aide) Inputting a bimodal fusion device, (X)0,main,X2,aide) Input into another bimodal fuser.
5. The multi-modal emotion analysis method based on the multidimensional attention fusion network as claimed in claim 1, wherein the bimodal fusion device in step four is a cross-modal fusion based on a transform's multi-head attention mechanism, and the calculation method is as follows:
Qm,main=Xm,mainWQ′;K=Xj,aideWK′;V=Xj,aideWV′;
CrossFusion_Xaide→main=Concat(head1,...,headn)
wherein Xm,mainSelf-attention feature, X, representing the current target modality mj,aideIndicating a secondary fusion feature, CrossFusion _ Xaide→mainDenotes the fusion result, WQ′Query mapping weight, W, for target modality mK′Weight of key mapping for target modality m, WV′Mapping weights for values of target modality m, dk' is the scaling factor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010292014.0A CN111680541B (en) | 2020-04-14 | 2020-04-14 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010292014.0A CN111680541B (en) | 2020-04-14 | 2020-04-14 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111680541A CN111680541A (en) | 2020-09-18 |
CN111680541B true CN111680541B (en) | 2022-06-21 |
Family
ID=72433356
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010292014.0A Active CN111680541B (en) | 2020-04-14 | 2020-04-14 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680541B (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112053690B (en) * | 2020-09-22 | 2023-12-29 | 湖南大学 | Cross-mode multi-feature fusion audio/video voice recognition method and system |
CN112233698B (en) * | 2020-10-09 | 2023-07-25 | 中国平安人寿保险股份有限公司 | Character emotion recognition method, device, terminal equipment and storage medium |
CN112489635B (en) * | 2020-12-03 | 2022-11-11 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112765323B (en) * | 2021-01-24 | 2021-08-17 | 中国电子科技集团公司第十五研究所 | Voice emotion recognition method based on multi-mode feature extraction and fusion |
CN112819052B (en) * | 2021-01-25 | 2021-12-24 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN112560811B (en) * | 2021-02-19 | 2021-07-02 | 中国科学院自动化研究所 | End-to-end automatic detection research method for audio-video depression |
CN112989977B (en) * | 2021-03-03 | 2022-09-06 | 复旦大学 | Audio-visual event positioning method and device based on cross-modal attention mechanism |
CN113723166A (en) * | 2021-03-26 | 2021-11-30 | 腾讯科技(北京)有限公司 | Content identification method and device, computer equipment and storage medium |
CN113807440B (en) * | 2021-09-17 | 2022-08-26 | 北京百度网讯科技有限公司 | Method, apparatus, and medium for processing multimodal data using neural networks |
CN113806609B (en) * | 2021-09-26 | 2022-07-12 | 郑州轻工业大学 | Multi-modal emotion analysis method based on MIT and FSM |
CN113723112B (en) * | 2021-11-02 | 2022-02-22 | 天津海翼科技有限公司 | Multi-modal emotion analysis prediction method, device, equipment and storage medium |
CN114387997B (en) * | 2022-01-21 | 2024-03-29 | 合肥工业大学 | Voice emotion recognition method based on deep learning |
CN116580257A (en) * | 2022-01-24 | 2023-08-11 | 腾讯科技(深圳)有限公司 | Feature fusion model training and sample retrieval method and device and computer equipment |
CN114821385A (en) * | 2022-03-08 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Multimedia information processing method, device, equipment and storage medium |
CN115205179A (en) * | 2022-07-15 | 2022-10-18 | 小米汽车科技有限公司 | Image fusion method and device, vehicle and storage medium |
CN116070169A (en) * | 2023-01-28 | 2023-05-05 | 天翼云科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN116189272B (en) * | 2023-05-05 | 2023-07-07 | 南京邮电大学 | Facial expression recognition method and system based on feature fusion and attention mechanism |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614895A (en) * | 2018-10-29 | 2019-04-12 | 山东大学 | A method of the multi-modal emotion recognition based on attention Fusion Features |
CN110033029A (en) * | 2019-03-22 | 2019-07-19 | 五邑大学 | A kind of emotion identification method and device based on multi-modal emotion model |
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN110287389A (en) * | 2019-05-31 | 2019-09-27 | 南京理工大学 | The multi-modal sensibility classification method merged based on text, voice and video |
CN110399841A (en) * | 2019-07-26 | 2019-11-01 | 北京达佳互联信息技术有限公司 | A kind of video classification methods, device and electronic equipment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20170001490A (en) * | 2015-06-26 | 2017-01-04 | 삼성전자주식회사 | The electronic apparatus and method for controlling function in the electronic apparatus using the bio-metric sensor |
WO2019103484A1 (en) * | 2017-11-24 | 2019-05-31 | 주식회사 제네시스랩 | Multi-modal emotion recognition device, method and storage medium using artificial intelligence |
-
2020
- 2020-04-14 CN CN202010292014.0A patent/CN111680541B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614895A (en) * | 2018-10-29 | 2019-04-12 | 山东大学 | A method of the multi-modal emotion recognition based on attention Fusion Features |
CN110033029A (en) * | 2019-03-22 | 2019-07-19 | 五邑大学 | A kind of emotion identification method and device based on multi-modal emotion model |
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN110287389A (en) * | 2019-05-31 | 2019-09-27 | 南京理工大学 | The multi-modal sensibility classification method merged based on text, voice and video |
CN110399841A (en) * | 2019-07-26 | 2019-11-01 | 北京达佳互联信息技术有限公司 | A kind of video classification methods, device and electronic equipment |
Non-Patent Citations (1)
Title |
---|
Multimodal continuous emotion recognition with data augmentation using recurrent neural networks;Jian Huang等;《ACM》;20181231;第57-64页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111680541A (en) | 2020-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111680541B (en) | Multi-modal emotion analysis method based on multi-dimensional attention fusion network | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN111625641B (en) | Dialog intention recognition method and system based on multi-dimensional semantic interaction representation model | |
Hazarika et al. | Self-attentive feature-level fusion for multimodal emotion detection | |
CN111506732B (en) | Text multi-level label classification method | |
CN116450796B (en) | Intelligent question-answering model construction method and device | |
CN113268609B (en) | Knowledge graph-based dialogue content recommendation method, device, equipment and medium | |
CN112562669B (en) | Method and system for automatically abstracting intelligent digital newspaper and performing voice interaction chat | |
CN112287093B (en) | Automatic question-answering system based on semi-supervised learning and Text-to-SQL model | |
CN115292461B (en) | Man-machine interaction learning method and system based on voice recognition | |
CN110569869A (en) | feature level fusion method for multi-modal emotion detection | |
CN116303977B (en) | Question-answering method and system based on feature classification | |
CN114417097A (en) | Emotion prediction method and system based on time convolution and self-attention | |
Khare et al. | Multi-modal embeddings using multi-task learning for emotion recognition | |
CN113705315A (en) | Video processing method, device, equipment and storage medium | |
CN115563290A (en) | Intelligent emotion recognition method based on context modeling | |
CN116189039A (en) | Multi-modal emotion classification method and system for modal sequence perception with global audio feature enhancement | |
CN111984780A (en) | Multi-intention recognition model training method, multi-intention recognition method and related device | |
Zhao et al. | Knowledge-aware bayesian co-attention for multimodal emotion recognition | |
Sun et al. | Multi-classification speech emotion recognition based on two-stage bottleneck features selection and MCJD algorithm | |
CN116702091B (en) | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP | |
CN113569553A (en) | Sentence similarity judgment method based on improved Adaboost algorithm | |
KR102297480B1 (en) | System and method for structured-paraphrasing the unstructured query or request sentence | |
CN112417125A (en) | Open domain dialogue reply method and system based on deep reinforcement learning | |
CN115329776B (en) | Semantic analysis method for network security co-processing based on less-sample learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |