CN117809229A - Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance - Google Patents
Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance Download PDFInfo
- Publication number
- CN117809229A CN117809229A CN202410224455.5A CN202410224455A CN117809229A CN 117809229 A CN117809229 A CN 117809229A CN 202410224455 A CN202410224455 A CN 202410224455A CN 117809229 A CN117809229 A CN 117809229A
- Authority
- CN
- China
- Prior art keywords
- modality
- features
- acoustic
- commonality
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 68
- 238000004458 analytical method Methods 0.000 title claims abstract description 32
- 230000000007 visual effect Effects 0.000 claims abstract description 35
- 238000012512 characterization method Methods 0.000 claims abstract description 27
- 230000006870 function Effects 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 15
- 238000000034 method Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000002776 aggregation Effects 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 4
- 238000006467 substitution reaction Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 description 7
- 230000003993 interaction Effects 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 230000004927 fusion Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000000875 corresponding effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008033 biological extinction Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 235000001968 nicotinic acid Nutrition 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a multi-mode emotion analysis method based on personalized and commonality comparison staged guidance, which comprises the following steps: extracting language features, acoustic features and visual features of the video sample; preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data; extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic; extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes; inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic. The invention can comprehensively utilize the multiple aspects of the description of the data to infer the emotion polarity of the character in the video clip.
Description
Technical Field
The invention belongs to the technical fields of natural language processing, voice signal processing and computer vision, and particularly relates to a multi-modal emotion analysis method based on personalized and commonality comparison staged guidance.
Background
Emotion analysis is a challenging task in natural language processing that requires judging the emotion of a person from information provided by text. Human emotions often have a variety, sometimes text does not fully describe a person's emotion, and often it is difficult for the machine to make a correct understanding. With the development of social networking platforms, people have increasingly enriched modalities of expressing views, particularly the advent of short videos, so that people can describe their views through text, speech, and actions. This results in explosive growth of multi-modal data, which is popular, and objects of emotion analysis expand into multi-modal data, which is no longer limited to text. Compared with single-mode emotion analysis only aiming at texts, multi-mode (such as texts, vision, audio and the like) emotion analysis is more comprehensive in judging character emotion, has better generalization, and is a main problem of how to process and contact multi-mode information.
When the research only relates to a single mode, the working layer in the related field is endless, wherein the created single mode models are rather countless, and the models can be used without excessive configuration on the application that only requires a single mode in the treatment process. However, when the number of modes increases, the model index can be utilized, and the main reason is that the reasoning of the multi-mode model needs to consider the fusion architecture among a plurality of modes, and in order to design a reasonable interaction mechanism, many aspects of influencing the fusion of the modes must be carefully considered. In addition, from the bionics, attempts have been made to construct multimodal fusion mechanisms by observing and mimicking the behavior of humans or animals. Facing such complex fusion mechanisms, the multi-modal model to design an effect Zhuo Qun is very challenging, and each excellent model requires careful hand-making by researchers, resulting in significant time consumption. Moreover, these models are typically designed for specific multi-modal tasks on a fixed number of modalities, the addition and subtraction of modalities may render the models unusable or increase computational complexity, and these models may be difficult to migrate after construction is complete. If the multi-modal network is designed as a generic model, problems such as performance degradation or training difficulties may occur.
Disclosure of Invention
The invention aims to provide a multi-mode emotion analysis method based on personality and commonality comparison staged guidance so as to comprehensively utilize multiple aspects of description of data to infer emotion polarities of characters in video clips.
In order to achieve the above object, the present invention provides a multi-modal emotion analysis method based on personality and commonality comparison staged guidance, comprising:
extracting language features, acoustic features and visual features of the video sample;
preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data;
extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic;
extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes;
inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic.
Optionally, preprocessing the language features, acoustic features, and visual features includes:
performing alignment processing on the language features, the acoustic features and the visual features in a time dimension by taking the text as a reference, and removing stop word parts irrelevant to emotion in the language features, the acoustic features and the visual features;
additional contrast views are generated for the linguistic, acoustic, and visual features using random zero vector substitution.
Optionally, performing two-stage high-level semantic feature extraction includes:
extracting corresponding high-level semantic features from the preprocessed language features, acoustic features and visual features by using three transducer encoders; wherein, divide three converters into two stages, each stage includes several converters encoding layers;
the two-stage calculation method comprises the following steps:
wherein,and->Transformer encoder of the first stage and the second stage respectively, +.>,/>Is a language modality->Is a visual modality, ->Is acoustic mode, +.>Is modality->Is>A sample is input.
Optionally, the personality-contrast loss function is:
wherein,expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the first phase of the network +.>Extracted vector,/->Representing a multi-layer perceptron @, @>Indicating a temperature coefficient with a value greater than zero, ">Indicating removalIndex set of->Indicating label and->The samples are identical but the index belongs to the set +.>Is a set of indices of (a).
Optionally, the commonality contrast loss function is:
wherein,expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the second phase of the network ∈>Extracted vector,/->Representing the removal of modality out of three modalities>Is a set of (3).
Optionally, inferring emotion values for the video samples includes:
converting the characterization data specific to each modal characteristic and the characterization data sharing the characteristic among the modalities into vector representation through maximum pooling, and obtaining the emotion value of the video sample after the vector representation is processed by two full-connection layers, a RuLU activation function and a Dropout layer;
and calculating emotion analysis loss by using root mean square error based on the emotion value.
Optionally, the method for calculating the emotion analysis loss includes:
wherein,representing maximum pooling, ++>Representing emotion value, < >>Indicate->True emotion tag value of individual samples, +.>Indicating aggregation of +.>Language modality characteristics of individual samples output by the first stage of the network +.>Visual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Express language modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing visual modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing acoustic modality->Characteristics of the output of the individual samples through the first phase of the network,/->Express language modality->The characteristics of the output of the samples through the second phase of the network,/->Representing visual modality->The characteristics of the output of the samples through the second phase of the network,/->Representing acoustic modality->The characteristics of the output of the samples through the second phase of the network,/->Indicating aggregation of +.>Language modality characteristics of individual samples output by the second stage of the network +.>Visual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Indicate->Index of individual samples, +.>Indicating the number of samples in a batch.
The invention has the following beneficial effects:
according to the invention, the personality contrast loss function is utilized to extract the data extracted in the first stage, and the extracted characterization data specific to each modal characteristic is beneficial to understanding the emotion context and avoiding ambiguity; and extracting the extracted data in the second stage by utilizing the commonality contrast loss function, wherein the extracted characterization data of the sharing characteristic among the modes is beneficial to enhancing the mood of emotion and improving the reasoning capacity of the model on the emotion of the character.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
FIG. 1 is a schematic diagram of an overall framework of a multi-modal emotion analysis method based on personality and commonality contrast staged guidance in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data preprocessing module according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a personality comparison module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a commonality comparison module according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an emotion prediction module according to an embodiment of the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
As shown in fig. 1, the embodiment provides a multi-modal emotion analysis method based on personality and commonality comparison staged guidance, which includes the following steps:
extracting language features, acoustic features and visual features of the video sample;
more specifically, word embedding features of transcribed text are extracted using a pre-trained BERT model, visual features are extracted using Openface tool library features, and acoustic features are extracted using an Openface tool library;
in this embodiment, firstly, text content transcribed from a video segment is converted into words by using a language model BERT obtained by pre-training on a large data set to be embedded as language features, information such as the position of key points of faces, facial action units, head gesture changes and the like of the characters in the video segment are detected by using an open-source facial behavior analysis tool library Openface as visual features, and information such as mel frequency cepstrum coefficients, fundamental frequencies, short-time jitter parameters, frame energy and the like in audio of the characters are extracted by using an open-source voice processing algorithm tool library COVAREP as acoustic features, and the whole process is shown in fig. 2.
Preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data;
preprocessing the language features, acoustic features, and visual features includes:
performing alignment processing on the language features, the acoustic features and the visual features in a time dimension by taking the text as a reference, and removing stop word parts irrelevant to emotion in the language features, the acoustic features and the visual features;
additional contrast views are generated for the linguistic, acoustic, and visual features using random zero vector substitution.
More specifically, the embodiment aligns three modal data in the time dimension based on text, and removes stop word portions unrelated to emotion therein. All three modality data are time series data, expressed mathematically as matrix tensors.
To be able to calculate the contrast loss, the present embodiment generates additional views for each sample of each modality data as comparison objects, which have similar representations, before inputting the three modality data into the model. The view is generated by randomly replacing some time steps in the sequence data with zero vectors. The additional views are taken as positive samples for comparison, zero values of denominators are avoided when the comparison loss of small batches of samples is calculated, the quality of the features extracted by the encoder is improved, and different performance influences of different view numbers on the model can be generated.
The high-level semantic feature extraction of the two stages comprises:
extracting corresponding high-level semantic features from the preprocessed language features, acoustic features and visual features by using three transducer encoders; wherein, divide three converters into two stages, each stage includes several converters encoding layers;
in this embodiment, three transform encoders are used to extract corresponding high-level semantic features from each modality data. In this process, the present embodiment divides the three transducer encoders into two stages, each of which is composed of multiple transducer encoding layers, and the calculation process is as follows:
wherein,and->Transformer encoder for the first phase and the second phase respectively, +.>,/>Is a language modality->Is a visual modality, ->Is an acoustic mode of the device,is modality->Is>A sample is input.
Extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic;
extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes;
more specifically, the embodiment inputs the characteristics extracted by each encoder in the first stage into the personality comparison module, the personality comparison module generates personality comparison loss specific to each mode, and the personality comparison loss is utilized to guide the update of the network parameters in the back propagation process, so that the characteristics generated by the encoder in the first stage have information specific to the characteristics of each mode. For the features extracted by each encoder in the second stage, the embodiment inputs the features into the commonality comparison module, the commonality comparison module generates commonality comparison loss about three modes, and the commonality comparison loss is the same as the personality comparison loss, and the embodiment guides the encoder to generate the characterization with the sharing characteristic among the modes in the second stage by utilizing the commonality comparison loss.
In the personality-contrast module shown in fig. 3, the present embodiment receives samples of each modality in the same batch. For multiple views belonging to the same sample and modality, the present embodiment requires their feature representations to be as close as possible in the mapping space, thus treating them as pairs of products; whereas for all views belonging to different samples but having the same modality, the present embodiment requires their characteristic representations to be relatively far apart in the mapping space, thus treating them as an vanishing pair. The present embodiment can learn this goal using the contrast loss function:
wherein,,/>is a multi-layer perceptron, is a->Is the temperature coefficient of the temperature of the material,,/>,/>is derived from->All views index of the samples. After learning, the feature representation of each modality of each sample is closely related to the nature of its modality itself. In addition, for the features extracted by the personality comparison module, the embodiment also requires that the information contained in the features is highly relevant to the task, so that the embodiment uses the classification labels of the task to treat all views belonging to different samples and having the same labels and modalities as product pairs, so that the features of the views are sufficiently close to each other in the mapping space. This is effective because for samples assigned labels related to tasks, they themselves contain information related to the label attributes, which should be similar if they have the same label. The present embodiment can learn using a supervised contrast loss functionThis goal is achieved by:
wherein,. It should be noted that, since emotion analysis is a regression task, in this embodiment, there is no ready task tag to classify samples, so this embodiment uses the true value of the label as a classification tag by rounding. Because the range of emotion regression values is generally [ -3,3]So this embodiment can obtain 7 categories (i.e., [ -3, -2, -1,0,1,2,3)]) This corresponds in practice to 7 degrees from extremely negative to extremely positive in emotion analysis. The embodiment applies the personality comparison module to the first stage of the encoder such that features extracted by the encoder in the first stage are highly correlated with modalities and tasks.
S5, as shown in FIG. 4, the commonality comparison module is used for refining information related to the modal commonality. Unlike personality contrast modules, commonality contrast modules need to receive features from multiple modalities simultaneously, and thus are less dependent on whether multiple views exist for a single modality. For multiple modalities of the same sample, the present embodiment treats them as pairs of products, requiring their feature representations to be sufficiently close in mapping space; whereas for all modalities belonging to different samples, the present embodiment treats them as an extinction pair, requiring their characteristic representations to remain relatively far apart in the mapping space. Likewise, the present embodiment may use the contrast loss function to learn this goal:
wherein,,/>. It should be noted that in the commonality comparison moduleIf the modality has multiple views, these additional views need not participate in the comparison of the modality, since their feature representation has been pulled up in the first stage of the encoder by the guidance of the personality comparison module. The embodiment applies the commonality comparison module to the second stage of the encoder, so that the features extracted by the encoder in the second stage are related to the multimode commonalities.
Inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic.
In the embodiment, the personality contrast module guides the extracted intra-modal features to be beneficial to understanding the context of emotion and avoiding ambiguity, and the commonality contrast module guides the extracted inter-modal features to be beneficial to enhancing the mood of emotion and improving the reasoning capacity of the model on the emotion of the person. The comparison module designed in the embodiment is only in training, is completely removed in reasoning, does not damage the structure of the original network, can be expanded to any mode, and can increase the calculation complexity of the model very little, so that the flexibility of the model in processing multi-mode data is improved.
In this embodiment, the features extracted by the three encoders in the first stage and the second stage are input to the emotion prediction module, as shown in fig. 5. In the emotion prediction module, the embodiment represents a plurality of features as a vector through maximum pooling, and obtains the emotion value of a sample through two full connection layers, a RuLU activation function and a Dropout layerFinally, calculating emotion analysis loss by using root mean square error, wherein the emotion analysis loss comprises the following steps:
wherein,representing maximum pooling. The final loss function can be expressed as:
wherein,and->To balance the super parameters of the different losses.
To illustrate the effect of the present invention, the present embodiment selects a part of the existing methods for comparison, using CMU-MOSI multi-modality reference dataset verification. The evaluation index adopts a correlation coefficient Corr, a classification accuracy Acc and an F1 fraction. The experimental results are shown in table 1 below:
TABLE 1
Experimental results show that on the test set of the CMU-MOSI, the invention has larger improvement on three indexes of the correlation coefficient, the classification accuracy and the F1 fraction than a baseline system.
Experimental results of the embodiment on the CMU-MOSI data set show that the method provided by the embodiment remarkably improves the performance of multi-mode emotion analysis and is remarkably superior to a plurality of baseline models in performance.
In the embodiment, aiming at the fact that single text emotion analysis cannot cope with complex emotion mechanisms and rich multi-modal data at present, an emotion analysis model for receiving multi-modal data is designed so as to comprehensively utilize multiple aspects of description of the data to infer emotion polarities of characters in video clips. In addition, in order to solve the problem of difficulty in multi-modal interaction design, a training scheme without setting an interaction architecture is provided. And a personality comparison module and a commonality comparison module are introduced, and the comparison module is utilized to guide the single-mode encoder to refine the characteristics, so that the characteristics extracted by the encoder have single-mode personalities and multimode commonalities. The implicit interaction of the mode avoids the trouble that the explicit interaction needs to manually design the interaction process, so that the embodiment can concentrate on the selection of the single-mode network, enjoy the advantages brought by the single-mode network and accelerate training.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (7)
1. The multi-modal emotion analysis method based on personalized and commonality comparison staged guidance is characterized by comprising the following steps of:
extracting language features, acoustic features and visual features of the video sample;
preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data;
extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic;
extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes;
inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic.
2. The method of multimodal emotion analysis based on staged guidance of personality and commonality contrast of claim 1, wherein preprocessing the linguistic, acoustic, and visual features comprises:
performing alignment processing on the language features, the acoustic features and the visual features in a time dimension by taking the text as a reference, and removing stop word parts irrelevant to emotion in the language features, the acoustic features and the visual features;
additional contrast views are generated for the linguistic, acoustic, and visual features using random zero vector substitution.
3. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 1, wherein performing two stages of high level semantic feature extraction comprises:
extracting corresponding high-level semantic features from the preprocessed language features, acoustic features and visual features by using three transducer encoders; wherein, divide three converters into two stages, each stage includes several converters encoding layers;
the two-stage calculation method comprises the following steps:
wherein (1)>And->Transformer encoder of the first stage and the second stage respectively, +.>,/>Is a language modality->Is a visual modality, ->Is acoustic mode, +.>Is modality->Is>A sample is input.
4. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 1, wherein the personality contrast loss function is:
wherein (1)>Expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the first phase of the network +.>Extracted vector,/->Representing a multi-layer perceptron @, @>Indicating a temperature coefficient with a value greater than zero, ">Indicating removal->Index set of->Indicating label and->The samples are identical but the index belongs to the set +.>Is used for the indexing of the set of (a),kthe number of enhanced views is represented and,nrepresents the number of samples in a batch, +.>Representing the acoustic mode shape of the device,prepresentation set->In (c) a sample of the sample,icmrepresenting a personality comparison module.
5. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 1, wherein the commonality contrast loss function is:
wherein (1)>Expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the second phase of the network ∈>Extracted vector,/->Representing the removal of modality out of three modalities>Set of->Indicating that it is derived from->All view indexes of the individual samples are stored,jrepresentation set->Is used for the indexing of (a),srepresentation set->Index of->Expressed by a function +.>From the modalitysIs the first of (2)jCharacteristics of the output of the individual samples in the second phase of the network ∈>Extracted vector,/->Representing the acoustic mode shape of the device,qrepresentation set->Is used for the indexing of (a),expressed by a function +.>From the modalityqIs the first of (2)aCharacteristics of the output of the individual samples in the second phase of the network ∈>The vector of the extraction is used to extract,ccmrepresenting a commonality comparison module.
6. The method for multimodal emotion analysis based on personality and commonality contrast staged guidance of claim 1, wherein inferring emotion values for video samples comprises:
converting the characterization data specific to each modal characteristic and the characterization data sharing the characteristic among the modalities into vector representation through maximum pooling, and obtaining the emotion value of the video sample after the vector representation is processed by two full-connection layers, a RuLU activation function and a Dropout layer;
and calculating emotion analysis loss by using root mean square error based on the emotion value.
7. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 6, wherein the emotion analysis loss calculation method is as follows:
wherein (1)>Representing maximum pooling, ++>Representing emotion value, < >>Indicate->True emotion tag value of individual samples, +.>Indicating aggregation of +.>Language modality characteristics of individual samples output by the first stage of the network +.>Visual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Express language modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing visual modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing acoustic modality->Characteristics of the output of the individual samples through the first phase of the network,/->Express language modality->The characteristics of the output of the samples through the second phase of the network,/->Representing visual modality->The characteristics of the output of the samples through the second phase of the network,/->Representing acoustic modality->The characteristics of the output of the samples through the second phase of the network,/->Indicating aggregation of +.>Language modality features with individual samples output by the second stage of the networkVisual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Indicate->Index of individual samples, +.>Indicating the number of samples in a batch,msarepresenting a multi-modal emotion analysis.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224455.5A CN117809229B (en) | 2024-02-29 | 2024-02-29 | Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224455.5A CN117809229B (en) | 2024-02-29 | 2024-02-29 | Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117809229A true CN117809229A (en) | 2024-04-02 |
CN117809229B CN117809229B (en) | 2024-05-07 |
Family
ID=90428079
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410224455.5A Active CN117809229B (en) | 2024-02-29 | 2024-02-29 | Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117809229B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114973062A (en) * | 2022-04-25 | 2022-08-30 | 西安电子科技大学 | Multi-modal emotion analysis method based on Transformer |
CN115858726A (en) * | 2022-11-22 | 2023-03-28 | 天翼电子商务有限公司 | Multi-stage multi-modal emotion analysis method based on mutual information method representation |
WO2023084348A1 (en) * | 2021-11-12 | 2023-05-19 | Sony Group Corporation | Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network |
-
2024
- 2024-02-29 CN CN202410224455.5A patent/CN117809229B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023084348A1 (en) * | 2021-11-12 | 2023-05-19 | Sony Group Corporation | Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network |
CN114973062A (en) * | 2022-04-25 | 2022-08-30 | 西安电子科技大学 | Multi-modal emotion analysis method based on Transformer |
CN115858726A (en) * | 2022-11-22 | 2023-03-28 | 天翼电子商务有限公司 | Multi-stage multi-modal emotion analysis method based on mutual information method representation |
Non-Patent Citations (1)
Title |
---|
陈军等: "基于多模态组合模型的语音情感识别", 《软件》, no. 12, 15 December 2019 (2019-12-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN117809229B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111930992B (en) | Neural network training method and device and electronic equipment | |
Zhao et al. | Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition | |
Xu et al. | A social emotion classification approach using multi-model fusion | |
Mai et al. | Analyzing multimodal sentiment via acoustic-and visual-LSTM with channel-aware temporal convolution network | |
US20170286397A1 (en) | Predictive Embeddings | |
Sardari et al. | Audio based depression detection using Convolutional Autoencoder | |
CN113420807A (en) | Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method | |
CN113095357A (en) | Multi-mode emotion recognition method and system based on attention mechanism and GMN | |
Liu et al. | Speech expression multimodal emotion recognition based on deep belief network | |
CN108536735B (en) | Multi-mode vocabulary representation method and system based on multi-channel self-encoder | |
Shi et al. | Chatgraph: Interpretable text classification by converting chatgpt knowledge to graphs | |
CN114691864A (en) | Text classification model training method and device and text classification method and device | |
CN113705315A (en) | Video processing method, device, equipment and storage medium | |
Zhang et al. | Facial affect recognition based on transformer encoder and audiovisual fusion for the abaw5 challenge | |
Huang et al. | TeFNA: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis | |
Asiya et al. | Speech emotion recognition-a deep learning approach | |
CN117809229B (en) | Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance | |
Zhang et al. | Feature learning via deep belief network for Chinese speech emotion recognition | |
Gong et al. | LanSER: Language-model supported speech emotion recognition | |
CN115687620A (en) | User attribute detection method based on tri-modal characterization learning | |
US20220318230A1 (en) | Text to question-answer model system | |
Bai et al. | Low-rank multimodal fusion algorithm based on context modeling | |
Sun et al. | A new view of multi-modal language analysis: Audio and video features as text “Styles” | |
Saleem et al. | DeepCNN: spectro‐temporal feature representation for speech emotion recognition | |
Wu et al. | AB-GRU: An attention-based bidirectional GRU model for multimodal sentiment fusion and analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |