CN117809229B

CN117809229B - Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance

Info

Publication number: CN117809229B
Application number: CN202410224455.5A
Authority: CN
Inventors: 杨振国; 刘达煌; 郭志玮
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-05-07
Anticipated expiration: 2044-02-29
Also published as: CN117809229A

Abstract

The invention discloses a multi-mode emotion analysis method based on personalized and commonality comparison staged guidance, which comprises the following steps: extracting language features, acoustic features and visual features of the video sample; preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data; extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic; extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes; inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic. The invention can comprehensively utilize the multiple aspects of the description of the data to infer the emotion polarity of the character in the video clip.

Description

Multi-modal emotion analysis method based on personalized and commonality comparison staged guidance

Technical Field

The invention belongs to the technical fields of natural language processing, voice signal processing and computer vision, and particularly relates to a multi-modal emotion analysis method based on personalized and commonality comparison staged guidance.

Background

Emotion analysis is a challenging task in natural language processing that requires judging the emotion of a person from information provided by text. Human emotions often have a variety, sometimes text does not fully describe a person's emotion, and often it is difficult for the machine to make a correct understanding. With the development of social networking platforms, people have increasingly enriched modalities of expressing views, particularly the advent of short videos, so that people can describe their views through text, speech, and actions. This results in explosive growth of multi-modal data, which is popular, and objects of emotion analysis expand into multi-modal data, which is no longer limited to text. Compared with single-mode emotion analysis only aiming at texts, multi-mode (such as texts, vision, audio and the like) emotion analysis is more comprehensive in judging character emotion, has better generalization, and is a main problem of how to process and contact multi-mode information.

When the research only relates to a single mode, the working layer in the related field is endless, wherein the created single mode models are rather countless, and the models can be used without excessive configuration on the application that only requires a single mode in the treatment process. However, when the number of modes increases, the model index can be utilized, and the main reason is that the reasoning of the multi-mode model needs to consider the fusion architecture among a plurality of modes, and in order to design a reasonable interaction mechanism, many aspects of influencing the fusion of the modes must be carefully considered. In addition, from the bionics, attempts have been made to construct multimodal fusion mechanisms by observing and mimicking the behavior of humans or animals. Facing such complex fusion mechanisms, the design of multimodal models for one effect Zhuo Qun is very challenging, and each of the outstanding models requires careful manual fabrication by researchers, resulting in a significant amount of time. Moreover, these models are typically designed for specific multi-modal tasks on a fixed number of modalities, the addition and subtraction of modalities may render the models unusable or increase computational complexity, and these models may be difficult to migrate after construction is complete. If the multi-modal network is designed as a generic model, problems such as performance degradation or training difficulties may occur.

Disclosure of Invention

The invention aims to provide a multi-mode emotion analysis method based on personality and commonality comparison staged guidance so as to comprehensively utilize multiple aspects of description of data to infer emotion polarities of characters in video clips.

In order to achieve the above object, the present invention provides a multi-modal emotion analysis method based on personality and commonality comparison staged guidance, comprising:

Extracting language features, acoustic features and visual features of the video sample;

Preprocessing the language features, the acoustic features and the visual features, and then extracting high-level semantic features in two stages to obtain first-stage extraction data and second-stage extraction data;

extracting the first-stage extraction data by utilizing a personality contrast loss function to obtain characterization data specific to each modal characteristic;

extracting the second-stage extraction data by utilizing a commonality contrast loss function to obtain characterization data of the sharing characteristics among the modes;

inferring an emotion value of the video sample based on the characterization data specific to each modality characteristic and the characterization data of the inter-modality sharing characteristic.

Optionally, preprocessing the language features, acoustic features, and visual features includes:

Performing alignment processing on the language features, the acoustic features and the visual features in a time dimension by taking the text as a reference, and removing stop word parts irrelevant to emotion in the language features, the acoustic features and the visual features;

additional contrast views are generated for the linguistic, acoustic, and visual features using random zero vector substitution.

Optionally, performing two-stage high-level semantic feature extraction includes:

Extracting corresponding high-level semantic features from the preprocessed language features, acoustic features and visual features by using three transducer encoders; wherein, divide three converters into two stages, each stage includes several converters encoding layers;

The two-stage calculation method comprises the following steps:

Wherein, And/>Transformer encoder for first stage and second stage respectively,/>，/>Is a language modality,/>Is a visual modality/>Is an acoustic modality,/>Is the mode/>(1 /)A sample is input.

Optionally, the personality-contrast loss function is:

Wherein, Expressed by a function/>From modality/>(1 /)Features of individual samples output in the first stage of the network/>Extracted vector,/>Representing a multi-layer perceptron,/>Representing a temperature coefficient with a value greater than zero,/>Represent removal/>Index set of/>Representing tags and/>The samples are identical but the index belongs to set/>Is a set of indices of (a).

Optionally, the commonality contrast loss function is:

Wherein, Expressed by a function/>From modality/>(1 /)Features of individual samples output in the second stage of the network/>Extracted vector,/>Represents removal modality/>, of three modalitiesIs a set of (3).

Optionally, inferring emotion values for the video samples includes:

Converting the characterization data specific to each modal characteristic and the characterization data sharing the characteristic among the modalities into vector representation through maximum pooling, and obtaining the emotion value of the video sample after the vector representation is processed by two full-connection layers, ruLU activation functions and a Dropout layer;

And calculating emotion analysis loss by using root mean square error based on the emotion value.

Optionally, the method for calculating the emotion analysis loss includes:

Wherein, Representing maximum pooling,/>Representing emotion value,/>Represents the/>True emotion tag value of individual samples,/>Represent aggregation of the/>, by maximum poolingLanguage modal characteristics/>, of each sample output by the first stage of the networkVisual modality characteristics/>Acoustic mode characteristics/>Characterization of the later results,/>Express language modality No./>Features of the output of the samples through the first stage of the network,/>Represents the visual modality (s)/>Features of the output of the samples through the first stage of the network,/>Representing acoustic modality No./>Features of the output of the samples through the first stage of the network,/>Express language modality No./>Features of the output of the samples through the second phase of the network,/>Represents the visual modality (s)/>Features of the output of the samples through the second phase of the network,/>Representing acoustic modality No./>Features of the output of the samples through the second phase of the network,/>Represent aggregation of the/>, by maximum poolingLanguage modal characteristics/>, of the samples output by the second stage of the networkVisual modality characteristics/>Acoustic mode characteristics/>Characterization of the later results,/>Represents the/>Index of individual samples,/>Indicating the number of samples in a batch.

The invention has the following beneficial effects:

According to the invention, the personality contrast loss function is utilized to extract the data extracted in the first stage, and the extracted characterization data specific to each modal characteristic is beneficial to understanding the emotion context and avoiding ambiguity; and extracting the extracted data in the second stage by utilizing the commonality contrast loss function, wherein the extracted characterization data of the sharing characteristic among the modes is beneficial to enhancing the mood of emotion and improving the reasoning capacity of the model on the emotion of the character.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a schematic diagram of an overall framework of a multi-modal emotion analysis method based on personality and commonality contrast staged guidance in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data preprocessing module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a personality comparison module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a commonality comparison module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an emotion prediction module according to an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

As shown in fig. 1, the embodiment provides a multi-modal emotion analysis method based on personality and commonality comparison staged guidance, which includes the following steps:

more specifically, word embedding features of transcribed text are extracted using a pre-trained BERT model, visual features are extracted using Openface tool library features, and acoustic features are extracted using Openface tool library;

In this embodiment, the text content transcribed from the video segment is first converted into words by using the language model BERT pre-trained on the large data set to be embedded as language features, the information such as the position of the key points of the face, the facial action units, and the head gesture changes of the person in the video segment is detected by using the open-source facial behavior analysis tool library Openface as visual features, and the information such as the mel frequency cepstrum coefficient, the fundamental frequency, the short-time jitter parameters, and the frame energy in the audio of the person speaking is extracted by using the open-source speech processing algorithm tool library COVAREP as acoustic features, and the whole process is shown in fig. 2.

preprocessing the language features, acoustic features, and visual features includes:

More specifically, the embodiment aligns three modal data in the time dimension based on text, and removes stop word portions unrelated to emotion therein. All three modality data are time series data, expressed mathematically as matrix tensors.

To be able to calculate the contrast loss, the present embodiment generates additional views for each sample of each modality data as comparison objects, which have similar representations, before inputting the three modality data into the model. The view is generated by randomly replacing some time steps in the sequence data with zero vectors. The additional views are taken as positive samples for comparison, zero values of denominators are avoided when the comparison loss of small batches of samples is calculated, the quality of the features extracted by the encoder is improved, and different performance influences of different view numbers on the model can be generated.

The high-level semantic feature extraction of the two stages comprises:

In this embodiment, three transform encoders are used to extract corresponding high-level semantic features from each modality data. In this process, the present embodiment divides the three transducer encoders into two stages, each of which is composed of multiple transducer encoding layers, and the calculation process is as follows:

Wherein, And/>Transformer encoder for first stage and second stage respectively,/>，/>Is a language modality,/>Is a visual modality/>Is an acoustic mode of the device,Is the mode/>(1 /)A sample is input.

More specifically, the embodiment inputs the characteristics extracted by each encoder in the first stage into the personality comparison module, the personality comparison module generates personality comparison loss specific to each mode, and the personality comparison loss is utilized to guide the update of the network parameters in the back propagation process, so that the characteristics generated by the encoder in the first stage have information specific to the characteristics of each mode. For the features extracted by each encoder in the second stage, the embodiment inputs the features into the commonality comparison module, the commonality comparison module generates commonality comparison loss about three modes, and the commonality comparison loss is the same as the personality comparison loss, and the embodiment guides the encoder to generate the characterization with the sharing characteristic among the modes in the second stage by utilizing the commonality comparison loss.

In the personality-contrast module shown in fig. 3, the present embodiment receives samples of each modality in the same batch. For multiple views belonging to the same sample and modality, the present embodiment requires their feature representations to be as close as possible in the mapping space, thus treating them as pairs of products; whereas for all views belonging to different samples but having the same modality, the present embodiment requires their characteristic representations to be relatively far apart in the mapping space, thus treating them as an vanishing pair. The present embodiment can learn this goal using the contrast loss function:

Wherein, ,/>Is a multi-layer perceptron,/>Is the temperature coefficient of the temperature of the material,，/>，/>Is derived from the/>All views index of the samples. After learning, the feature representation of each modality of each sample is closely related to the nature of its modality itself. In addition, for the features extracted by the personality comparison module, the embodiment also requires that the information contained in the features is highly relevant to the task, so that the embodiment uses the classification labels of the task to treat all views belonging to different samples and having the same labels and modalities as product pairs, so that the features of the views are sufficiently close to each other in the mapping space. This is effective because for samples assigned labels related to tasks, they themselves contain information related to the label attributes, which should be similar if they have the same label. The present embodiment may learn this goal using a supervised contrast loss function:

Wherein, . It should be noted that, since emotion analysis is a regression task, in this embodiment, there is no ready task tag to classify samples, so this embodiment uses the true value of the label as a classification tag by rounding. Since the emotion regression values generally range from [ -3,3], this embodiment can yield 7 categories (i.e., [ -3, -2, -1,0,1,2,3 ]), which in fact corresponds to 7 degrees of emotion analysis from extremely negative to extremely positive. The embodiment applies the personality comparison module to the first stage of the encoder such that features extracted by the encoder in the first stage are highly correlated with modalities and tasks.

S5, as shown in FIG. 4, the commonality comparison module is used for refining information related to the modal commonality. Unlike personality contrast modules, commonality contrast modules need to receive features from multiple modalities simultaneously, and thus are less dependent on whether multiple views exist for a single modality. For multiple modalities of the same sample, the present embodiment treats them as pairs of products, requiring their feature representations to be sufficiently close in mapping space; whereas for all modalities belonging to different samples, the present embodiment treats them as an extinction pair, requiring their characteristic representations to remain relatively far apart in the mapping space. Likewise, the present embodiment may use the contrast loss function to learn this goal:

Wherein, ，/>. It is noted that in the commonality contrast module, if a modality has multiple views, these additional views need not participate in the contrast of the modality, as their feature representation has been pulled up in the first stage of the encoder by the guidance of the personality contrast module. The embodiment applies the commonality comparison module to the second stage of the encoder, so that the features extracted by the encoder in the second stage are related to the multimode commonalities.

In the embodiment, the personality contrast module guides the extracted intra-modal features to be beneficial to understanding the context of emotion and avoiding ambiguity, and the commonality contrast module guides the extracted inter-modal features to be beneficial to enhancing the mood of emotion and improving the reasoning capacity of the model on the emotion of the person. The comparison module designed in the embodiment is only in training, is completely removed in reasoning, does not damage the structure of the original network, can be expanded to any mode, and can increase the calculation complexity of the model very little, so that the flexibility of the model in processing multi-mode data is improved.

In this embodiment, the features extracted by the three encoders in the first stage and the second stage are input to the emotion prediction module, as shown in fig. 5. In the emotion prediction module, the embodiment represents a plurality of features as a vector through maximum pooling, and obtains emotion values of samples through two full connection layers, ruLU activation functions and Dropout layersFinally, calculating emotion analysis loss by using root mean square error, wherein the emotion analysis loss comprises the following steps:

Wherein, Representing maximum pooling. The final loss function can be expressed as:

Wherein, And/>To balance the super parameters of the different losses.

To illustrate the effect of the present invention, the present embodiment selects a part of the existing methods for comparison, using CMU-MOSI multi-modality reference dataset verification. The evaluation index adopts a correlation coefficient Corr, a classification accuracy Acc and an F1 fraction. The experimental results are shown in table 1 below:

TABLE 1

Experimental results show that on the test set of the CMU-MOSI, the invention has larger improvement on three indexes of the correlation coefficient, the classification accuracy and the F1 fraction than a baseline system.

Experimental results of the embodiment on the CMU-MOSI data set show that the method provided by the embodiment remarkably improves the performance of multi-mode emotion analysis and is remarkably superior to a plurality of baseline models in performance.

In the embodiment, aiming at the fact that single text emotion analysis cannot cope with complex emotion mechanisms and rich multi-modal data at present, an emotion analysis model for receiving multi-modal data is designed so as to comprehensively utilize multiple aspects of description of the data to infer emotion polarities of characters in video clips. In addition, in order to solve the problem of difficulty in multi-modal interaction design, a training scheme without setting an interaction architecture is provided. And a personality comparison module and a commonality comparison module are introduced, and the comparison module is utilized to guide the single-mode encoder to refine the characteristics, so that the characteristics extracted by the encoder have single-mode personalities and multimode commonalities. The implicit interaction of the mode avoids the trouble that the explicit interaction needs to manually design the interaction process, so that the embodiment can concentrate on the selection of the single-mode network, enjoy the advantages brought by the single-mode network and accelerate training.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. The multi-modal emotion analysis method based on personalized and commonality comparison staged guidance is characterized by comprising the following steps of:

the high-level semantic feature extraction of the two stages comprises:

The two-stage calculation method comprises the following steps:

wherein/> And/>Transformer encoder for first stage and second stage respectively,/>，/>Is a language modality,/>Is a visual modality/>Is an acoustic modality,/>Is the mode/>(1 /)A sample input;

the personality contrast loss function is:

wherein/> Expressed by a function/>From modality/>(1 /)Features of individual samples output in the first stage of the network/>Extracted vector,/>Representing a multi-layer perceptron,/>Representing a temperature coefficient with a value greater than zero,/>Represent removal/>Index set of/>Representing tags and/>The samples are identical but the index belongs to set/>K represents the number of enhancement views, n represents the number of samples in a batch,/>Represents acoustic modality, p represents set/>In (2), icm represents a personality comparison module;

the commonality contrast loss function is:

wherein/> Expressed by a function/>From modality/>(1 /)Features of individual samples output in the second stage of the network/>Extracted vector,/>Represents removal modality/>, of three modalitiesSet of (I)/>The representation is derived from the/>All view indexes of the individual samples, j represents the set/>S represents the set/>Index of/>Expressed by a function/>Features/>, output from the jth sample of modality s in the second phase of the networkExtracted vector,/>Represents acoustic modality, q represents set/>Index of/>Expressed by a function/>Features/>, output from sample a of modality q in the second phase of the networkThe extracted vector, ccm, represents a commonality comparison module;

2. The method of multimodal emotion analysis based on staged guidance of personality and commonality contrast of claim 1, wherein preprocessing the linguistic, acoustic, and visual features comprises:

3. The method for multimodal emotion analysis based on personality and commonality contrast staged guidance of claim 1, wherein inferring emotion values for video samples comprises:

4. The multi-modal emotion analysis method based on personality and commonality contrast staged guidance of claim 3, wherein the emotion analysis loss calculation method is as follows:

wherein/> Representing maximum pooling,/>Representing emotion value,/>Represents the/>True emotion tag value of individual samples,/>Represent aggregation of the/>, by maximum poolingLanguage modal characteristics/>, of each sample output by the first stage of the networkVisual modality characteristics/>Acoustic mode characteristics/>Characterization of the later results,/>Express language modality No./>Features of the output of the samples through the first stage of the network,/>Represents the visual modality (s)/>Features of the output of the samples through the first stage of the network,/>Representing acoustic modality No./>Features of the output of the samples through the first stage of the network,/>Express language modality No./>Features of the output of the samples through the second phase of the network,/>Represents the visual modality (s)/>Features of the output of the samples through the second phase of the network,/>Representing acoustic modality No./>Features of the output of the samples through the second phase of the network,/>Represent aggregation of the/>, by maximum poolingLanguage modal characteristics/>, of the samples output by the second stage of the networkVisual modality characteristics/>Acoustic mode characteristics/>Characterization of the later results,/>Represents the/>Index of individual samples,/>Representing the number of samples in a batch, msa represents a multi-modal emotion analysis.