CN117809229A

CN117809229A - Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance

Info

Publication number: CN117809229A
Application number: CN202410224455.5A
Authority: CN
Inventors: 杨振国; 刘达煌; 郭志玮
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-04-02
Anticipated expiration: 2044-02-29
Also published as: CN117809229B

Abstract

The invention discloses a multi-mode emotion analysis method based on personalized and commonality comparison staged guidance, which comprises the following steps: extracting language features, acoustic features and visual features of the video sample; preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data; extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic; extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes; inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic. The invention can comprehensively utilize the multiple aspects of the description of the data to infer the emotion polarity of the character in the video clip.

Description

Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance

Technical Field

The invention belongs to the technical fields of natural language processing, voice signal processing and computer vision, and particularly relates to a multi-modal emotion analysis method based on personalized and commonality comparison staged guidance.

Background

Emotion analysis is a challenging task in natural language processing that requires judging the emotion of a person from information provided by text. Human emotions often have a variety, sometimes text does not fully describe a person's emotion, and often it is difficult for the machine to make a correct understanding. With the development of social networking platforms, people have increasingly enriched modalities of expressing views, particularly the advent of short videos, so that people can describe their views through text, speech, and actions. This results in explosive growth of multi-modal data, which is popular, and objects of emotion analysis expand into multi-modal data, which is no longer limited to text. Compared with single-mode emotion analysis only aiming at texts, multi-mode (such as texts, vision, audio and the like) emotion analysis is more comprehensive in judging character emotion, has better generalization, and is a main problem of how to process and contact multi-mode information.

When the research only relates to a single mode, the working layer in the related field is endless, wherein the created single mode models are rather countless, and the models can be used without excessive configuration on the application that only requires a single mode in the treatment process. However, when the number of modes increases, the model index can be utilized, and the main reason is that the reasoning of the multi-mode model needs to consider the fusion architecture among a plurality of modes, and in order to design a reasonable interaction mechanism, many aspects of influencing the fusion of the modes must be carefully considered. In addition, from the bionics, attempts have been made to construct multimodal fusion mechanisms by observing and mimicking the behavior of humans or animals. Facing such complex fusion mechanisms, the multi-modal model to design an effect Zhuo Qun is very challenging, and each excellent model requires careful hand-making by researchers, resulting in significant time consumption. Moreover, these models are typically designed for specific multi-modal tasks on a fixed number of modalities, the addition and subtraction of modalities may render the models unusable or increase computational complexity, and these models may be difficult to migrate after construction is complete. If the multi-modal network is designed as a generic model, problems such as performance degradation or training difficulties may occur.

Disclosure of Invention

The invention aims to provide a multi-mode emotion analysis method based on personality and commonality comparison staged guidance so as to comprehensively utilize multiple aspects of description of data to infer emotion polarities of characters in video clips.

In order to achieve the above object, the present invention provides a multi-modal emotion analysis method based on personality and commonality comparison staged guidance, comprising:

extracting language features, acoustic features and visual features of the video sample;

preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data;

extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic;

extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes;

inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic.

Optionally, preprocessing the language features, acoustic features, and visual features includes:

performing alignment processing on the language features, the acoustic features and the visual features in a time dimension by taking the text as a reference, and removing stop word parts irrelevant to emotion in the language features, the acoustic features and the visual features;

additional contrast views are generated for the linguistic, acoustic, and visual features using random zero vector substitution.

Optionally, performing two-stage high-level semantic feature extraction includes:

extracting corresponding high-level semantic features from the preprocessed language features, acoustic features and visual features by using three transducer encoders; wherein, divide three converters into two stages, each stage includes several converters encoding layers;

the two-stage calculation method comprises the following steps:

wherein,and->Transformer encoder of the first stage and the second stage respectively, +.>，/>Is a language modality->Is a visual modality, ->Is acoustic mode, +.>Is modality->Is>A sample is input.

Optionally, the personality-contrast loss function is:

wherein,expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the first phase of the network +.>Extracted vector,/->Representing a multi-layer perceptron @, @>Indicating a temperature coefficient with a value greater than zero, ">Indicating removalIndex set of->Indicating label and->The samples are identical but the index belongs to the set +.>Is a set of indices of (a).

Optionally, the commonality contrast loss function is:

wherein,expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the second phase of the network ∈>Extracted vector,/->Representing the removal of modality out of three modalities>Is a set of (3).

Optionally, inferring emotion values for the video samples includes:

converting the characterization data specific to each modal characteristic and the characterization data sharing the characteristic among the modalities into vector representation through maximum pooling, and obtaining the emotion value of the video sample after the vector representation is processed by two full-connection layers, a RuLU activation function and a Dropout layer;

and calculating emotion analysis loss by using root mean square error based on the emotion value.

Optionally, the method for calculating the emotion analysis loss includes:

wherein,representing maximum pooling, ++>Representing emotion value, < >>Indicate->True emotion tag value of individual samples, +.>Indicating aggregation of +.>Language modality characteristics of individual samples output by the first stage of the network +.>Visual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Express language modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing visual modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing acoustic modality->Characteristics of the output of the individual samples through the first phase of the network,/->Express language modality->The characteristics of the output of the samples through the second phase of the network,/->Representing visual modality->The characteristics of the output of the samples through the second phase of the network,/->Representing acoustic modality->The characteristics of the output of the samples through the second phase of the network,/->Indicating aggregation of +.>Language modality characteristics of individual samples output by the second stage of the network +.>Visual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Indicate->Index of individual samples, +.>Indicating the number of samples in a batch.

The invention has the following beneficial effects:

according to the invention, the personality contrast loss function is utilized to extract the data extracted in the first stage, and the extracted characterization data specific to each modal characteristic is beneficial to understanding the emotion context and avoiding ambiguity; and extracting the extracted data in the second stage by utilizing the commonality contrast loss function, wherein the extracted characterization data of the sharing characteristic among the modes is beneficial to enhancing the mood of emotion and improving the reasoning capacity of the model on the emotion of the character.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:

FIG. 1 is a schematic diagram of an overall framework of a multi-modal emotion analysis method based on personality and commonality contrast staged guidance in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data preprocessing module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a personality comparison module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a commonality comparison module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an emotion prediction module according to an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

As shown in fig. 1, the embodiment provides a multi-modal emotion analysis method based on personality and commonality comparison staged guidance, which includes the following steps:

more specifically, word embedding features of transcribed text are extracted using a pre-trained BERT model, visual features are extracted using Openface tool library features, and acoustic features are extracted using an Openface tool library;

in this embodiment, firstly, text content transcribed from a video segment is converted into words by using a language model BERT obtained by pre-training on a large data set to be embedded as language features, information such as the position of key points of faces, facial action units, head gesture changes and the like of the characters in the video segment are detected by using an open-source facial behavior analysis tool library Openface as visual features, and information such as mel frequency cepstrum coefficients, fundamental frequencies, short-time jitter parameters, frame energy and the like in audio of the characters are extracted by using an open-source voice processing algorithm tool library COVAREP as acoustic features, and the whole process is shown in fig. 2.

preprocessing the language features, acoustic features, and visual features includes:

More specifically, the embodiment aligns three modal data in the time dimension based on text, and removes stop word portions unrelated to emotion therein. All three modality data are time series data, expressed mathematically as matrix tensors.

To be able to calculate the contrast loss, the present embodiment generates additional views for each sample of each modality data as comparison objects, which have similar representations, before inputting the three modality data into the model. The view is generated by randomly replacing some time steps in the sequence data with zero vectors. The additional views are taken as positive samples for comparison, zero values of denominators are avoided when the comparison loss of small batches of samples is calculated, the quality of the features extracted by the encoder is improved, and different performance influences of different view numbers on the model can be generated.

The high-level semantic feature extraction of the two stages comprises:

in this embodiment, three transform encoders are used to extract corresponding high-level semantic features from each modality data. In this process, the present embodiment divides the three transducer encoders into two stages, each of which is composed of multiple transducer encoding layers, and the calculation process is as follows:

wherein,and->Transformer encoder for the first phase and the second phase respectively, +.>，/>Is a language modality->Is a visual modality, ->Is an acoustic mode of the device,is modality->Is>A sample is input.

more specifically, the embodiment inputs the characteristics extracted by each encoder in the first stage into the personality comparison module, the personality comparison module generates personality comparison loss specific to each mode, and the personality comparison loss is utilized to guide the update of the network parameters in the back propagation process, so that the characteristics generated by the encoder in the first stage have information specific to the characteristics of each mode. For the features extracted by each encoder in the second stage, the embodiment inputs the features into the commonality comparison module, the commonality comparison module generates commonality comparison loss about three modes, and the commonality comparison loss is the same as the personality comparison loss, and the embodiment guides the encoder to generate the characterization with the sharing characteristic among the modes in the second stage by utilizing the commonality comparison loss.

In the personality-contrast module shown in fig. 3, the present embodiment receives samples of each modality in the same batch. For multiple views belonging to the same sample and modality, the present embodiment requires their feature representations to be as close as possible in the mapping space, thus treating them as pairs of products; whereas for all views belonging to different samples but having the same modality, the present embodiment requires their characteristic representations to be relatively far apart in the mapping space, thus treating them as an vanishing pair. The present embodiment can learn this goal using the contrast loss function:

wherein,,/>is a multi-layer perceptron, is a->Is the temperature coefficient of the temperature of the material,，/>，/>is derived from->All views index of the samples. After learning, the feature representation of each modality of each sample is closely related to the nature of its modality itself. In addition, for the features extracted by the personality comparison module, the embodiment also requires that the information contained in the features is highly relevant to the task, so that the embodiment uses the classification labels of the task to treat all views belonging to different samples and having the same labels and modalities as product pairs, so that the features of the views are sufficiently close to each other in the mapping space. This is effective because for samples assigned labels related to tasks, they themselves contain information related to the label attributes, which should be similar if they have the same label. The present embodiment can learn using a supervised contrast loss functionThis goal is achieved by:

wherein,. It should be noted that, since emotion analysis is a regression task, in this embodiment, there is no ready task tag to classify samples, so this embodiment uses the true value of the label as a classification tag by rounding. Because the range of emotion regression values is generally [ -3,3]So this embodiment can obtain 7 categories (i.e., [ -3, -2, -1,0,1,2,3)]) This corresponds in practice to 7 degrees from extremely negative to extremely positive in emotion analysis. The embodiment applies the personality comparison module to the first stage of the encoder such that features extracted by the encoder in the first stage are highly correlated with modalities and tasks.

S5, as shown in FIG. 4, the commonality comparison module is used for refining information related to the modal commonality. Unlike personality contrast modules, commonality contrast modules need to receive features from multiple modalities simultaneously, and thus are less dependent on whether multiple views exist for a single modality. For multiple modalities of the same sample, the present embodiment treats them as pairs of products, requiring their feature representations to be sufficiently close in mapping space; whereas for all modalities belonging to different samples, the present embodiment treats them as an extinction pair, requiring their characteristic representations to remain relatively far apart in the mapping space. Likewise, the present embodiment may use the contrast loss function to learn this goal:

wherein,，/>. It should be noted that in the commonality comparison moduleIf the modality has multiple views, these additional views need not participate in the comparison of the modality, since their feature representation has been pulled up in the first stage of the encoder by the guidance of the personality comparison module. The embodiment applies the commonality comparison module to the second stage of the encoder, so that the features extracted by the encoder in the second stage are related to the multimode commonalities.

In the embodiment, the personality contrast module guides the extracted intra-modal features to be beneficial to understanding the context of emotion and avoiding ambiguity, and the commonality contrast module guides the extracted inter-modal features to be beneficial to enhancing the mood of emotion and improving the reasoning capacity of the model on the emotion of the person. The comparison module designed in the embodiment is only in training, is completely removed in reasoning, does not damage the structure of the original network, can be expanded to any mode, and can increase the calculation complexity of the model very little, so that the flexibility of the model in processing multi-mode data is improved.

In this embodiment, the features extracted by the three encoders in the first stage and the second stage are input to the emotion prediction module, as shown in fig. 5. In the emotion prediction module, the embodiment represents a plurality of features as a vector through maximum pooling, and obtains the emotion value of a sample through two full connection layers, a RuLU activation function and a Dropout layerFinally, calculating emotion analysis loss by using root mean square error, wherein the emotion analysis loss comprises the following steps:

wherein,representing maximum pooling. The final loss function can be expressed as:

wherein,and->To balance the super parameters of the different losses.

To illustrate the effect of the present invention, the present embodiment selects a part of the existing methods for comparison, using CMU-MOSI multi-modality reference dataset verification. The evaluation index adopts a correlation coefficient Corr, a classification accuracy Acc and an F1 fraction. The experimental results are shown in table 1 below:

TABLE 1

Experimental results show that on the test set of the CMU-MOSI, the invention has larger improvement on three indexes of the correlation coefficient, the classification accuracy and the F1 fraction than a baseline system.

Experimental results of the embodiment on the CMU-MOSI data set show that the method provided by the embodiment remarkably improves the performance of multi-mode emotion analysis and is remarkably superior to a plurality of baseline models in performance.

In the embodiment, aiming at the fact that single text emotion analysis cannot cope with complex emotion mechanisms and rich multi-modal data at present, an emotion analysis model for receiving multi-modal data is designed so as to comprehensively utilize multiple aspects of description of the data to infer emotion polarities of characters in video clips. In addition, in order to solve the problem of difficulty in multi-modal interaction design, a training scheme without setting an interaction architecture is provided. And a personality comparison module and a commonality comparison module are introduced, and the comparison module is utilized to guide the single-mode encoder to refine the characteristics, so that the characteristics extracted by the encoder have single-mode personalities and multimode commonalities. The implicit interaction of the mode avoids the trouble that the explicit interaction needs to manually design the interaction process, so that the embodiment can concentrate on the selection of the single-mode network, enjoy the advantages brought by the single-mode network and accelerate training.

The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The multi-modal emotion analysis method based on personalized and commonality comparison staged guidance is characterized by comprising the following steps of:

2. The method of multimodal emotion analysis based on staged guidance of personality and commonality contrast of claim 1, wherein preprocessing the linguistic, acoustic, and visual features comprises:

3. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 1, wherein performing two stages of high level semantic feature extraction comprises:

the two-stage calculation method comprises the following steps:

wherein (1)>And->Transformer encoder of the first stage and the second stage respectively, +.>，/>Is a language modality->Is a visual modality, ->Is acoustic mode, +.>Is modality->Is>A sample is input.

4. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 1, wherein the personality contrast loss function is:

wherein (1)>Expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the first phase of the network +.>Extracted vector,/->Representing a multi-layer perceptron @, @>Indicating a temperature coefficient with a value greater than zero, ">Indicating removal->Index set of->Indicating label and->The samples are identical but the index belongs to the set +.>Is used for the indexing of the set of (a),kthe number of enhanced views is represented and,nrepresents the number of samples in a batch, +.>Representing the acoustic mode shape of the device,prepresentation set->In (c) a sample of the sample,icmrepresenting a personality comparison module.

5. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 1, wherein the commonality contrast loss function is:

wherein (1)>Expressed by a function +.>From modality->Is>Characteristics of the output of the individual samples in the second phase of the network ∈>Extracted vector,/->Representing the removal of modality out of three modalities>Set of->Indicating that it is derived from->All view indexes of the individual samples are stored,jrepresentation set->Is used for the indexing of (a),srepresentation set->Index of->Expressed by a function +.>From the modalitysIs the first of (2)jCharacteristics of the output of the individual samples in the second phase of the network ∈>Extracted vector,/->Representing the acoustic mode shape of the device,qrepresentation set->Is used for the indexing of (a),expressed by a function +.>From the modalityqIs the first of (2)aCharacteristics of the output of the individual samples in the second phase of the network ∈>The vector of the extraction is used to extract,ccmrepresenting a commonality comparison module.

6. The method for multimodal emotion analysis based on personality and commonality contrast staged guidance of claim 1, wherein inferring emotion values for video samples comprises:

7. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 6, wherein the emotion analysis loss calculation method is as follows:

wherein (1)>Representing maximum pooling, ++>Representing emotion value, < >>Indicate->True emotion tag value of individual samples, +.>Indicating aggregation of +.>Language modality characteristics of individual samples output by the first stage of the network +.>Visual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Express language modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing visual modality->Characteristics of the output of the individual samples through the first phase of the network,/->Representing acoustic modality->Characteristics of the output of the individual samples through the first phase of the network,/->Express language modality->The characteristics of the output of the samples through the second phase of the network,/->Representing visual modality->The characteristics of the output of the samples through the second phase of the network,/->Representing acoustic modality->The characteristics of the output of the samples through the second phase of the network,/->Indicating aggregation of +.>Language modality features with individual samples output by the second stage of the networkVisual modality characteristics->Acoustic modality characteristics->Characterization of the later, meta>Indicate->Index of individual samples, +.>Indicating the number of samples in a batch,msarepresenting a multi-modal emotion analysis.