CN113312530B

CN113312530B - Multi-mode emotion classification method taking text as core

Info

Publication number: CN113312530B
Application number: CN202110652703.2A
Authority: CN
Inventors: 秦兵; 吴洋; 赵妍妍; 胡晓毓
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2022-02-15
Anticipated expiration: 2041-06-09
Also published as: CN113312530A

Abstract

A multi-mode emotion classification method with texts as cores relates to the technical field of natural language processing, and aims to solve the problem that in the prior art, semantic information of each mode is regarded as a whole, interaction capacity of different modes is lacked, and emotion classification is inaccurate. By utilizing the model, a judgment rule of the shared characteristic and the private characteristic is designed, and the shared characteristic and the private characteristic are distinguished by utilizing the rule. And the other part is an emotion prediction model, and the model fuses text modal characteristics and sharing and private characteristics of voice/images by using a cross-modal attention mechanism, and finally obtains multi-modal fusion characteristics for emotion classification.

Description

Multi-mode emotion classification method taking text as core

Technical Field

The invention relates to the technical field of natural language processing, in particular to a multi-modal emotion classification method taking a text as a core.

Background

Multimodal emotion analysis is an emerging field of research aimed at understanding people's emotions using textual and non-textual (visual, acoustic) data. This task has recently attracted increasing attention from social circles because it has been recognized that non-textual cues help detect emotion and identify opinions and emotions in videos.

In multimodal emotion analysis, there are two main work routes. One is multi-modal feature fusion focused on the linguistic level. Such methods use the features of the entire speech piece, first extracting visual or acoustic features at the frame level and then averaging them to obtain the final speech piece level features. Text-level textual features may be obtained by applying RNNs. The obtained speech feature level features are fed into a fusion model to obtain a multi-modal representation. Several efficient multi-modal feature fusion models have been proposed (Zadeh et al, 2017; Liu et al, 2018; Mai et al, 2020). The features at the speech level contain mainly global information and may not capture local information. Therefore, recent work has focused primarily on multi-modal features at the word level. To extract the word-level features, the first step is to obtain the time stamp of each word appearing in the video, including the start time and the end time. The corpus is then segmented into video segments based on the timestamps. Finally, the visual or acoustic features at the word level are obtained by averaging the frame-level features of the video segments. Researchers have proposed many methods to perform word-level multimodal feature fusion (Zadeh et al, 2018; Wang et al, 2019; Tsai et al, 2019; Vaswani et al, 2017). Furthermore, there is a related work (Pham et al, 2019) that considers that joint characterization can be learned from the translation of a source modality to a target modality and proposes a multimodal loop translation network (MCTN) to learn joint multimodal characterization.

Existing work has shown that adding non-textual data can improve the performance of emotion analysis compared to traditional text emotion analysis (Liu, 2012) (Chen et al, 2017; Zadeh et al, 2018; Sun et al, 2020). There are two reasons for this, the first being that the three modalities may convey some common semantics. In this case, these non-textual common semantics do not provide additional information beyond textual data, but the repeated information therein may enhance the final performance, referred to as shared semantics. Another reason is that all three modalities have their own specific semantic information that is different from the other modalities. These semantic information are modality specific and difficult to predict from text data alone, called private semantics. The final emotion can be detected more accurately by combining the private semantic information.

Previous work generally does not distinguish between shared semantics and private semantics, but rather treats the semantic information of each modality as a whole, lacking the ability to explore the interaction of different modalities.

Disclosure of Invention

The purpose of the invention is: aiming at the problem that semantic information of each mode is regarded as a whole in the prior art, the interaction capability of different modes is lacked, and further the emotion classification is inaccurate, the multi-mode emotion analysis method using the text as the core is provided.

The technical scheme adopted by the invention to solve the technical problems is as follows:

a multi-mode emotion classification method with texts as cores comprises the following steps:

the method comprises the following steps: extracting a text characteristic sequence, a visual characteristic sequence and an acoustic characteristic sequence from data, then training a cross-modal prediction model I by using the text characteristic sequence and the visual characteristic sequence, then training a cross-modal prediction model II by using the text characteristic sequence and the acoustic characteristic sequence, and finishing the model training when the loss function values of the cross-modal prediction model I and the cross-modal prediction model II are not reduced any more;

step two: inputting the text characteristic sequence to be tested into a cross-modal prediction model I to obtain an output visual characteristic sequence, then obtaining a visual sharing characteristic and a visual private characteristic according to the output visual characteristic sequence,

inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, and then obtaining an acoustic shared feature and an acoustic private feature according to the output acoustic feature sequence;

step three: fusing a text feature sequence to be tested with the visual sharing feature and the acoustic sharing feature, and then fusing a fusion result with the visual private feature and the acoustic private feature to obtain a final fusion result;

step four: inputting the final fusion result into a classifier for classification;

the visual sharing feature and the acoustic sharing feature are features that do not contain additional information relative to the text feature, and the visual private feature and the acoustic private feature are features that contain information that is not contained in the text feature.

Further, the visual sharing feature and the acoustic sharing feature in the second step are obtained through the following steps:

inputting a text feature sequence to be detected into a cross-modal prediction model I to obtain an output visual feature sequence, further obtaining N text features with the largest attention weight corresponding to each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature are obtained, namely the visual shared features;

inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, further obtaining N text features with the maximum attention weight corresponding to each feature in the output acoustic feature sequence, wherein the feature in the acoustic feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature are obtained, namely the acoustic shared features;

n is 3, 4 or 5.

Further, N is 5.

inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a text feature of which the attention weight is greater than 0.05 in each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the text features of which the attention weight is greater than 0.05, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature, namely the visual shared features, are obtained;

inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, then obtaining a text feature with the attention weight larger than 0.05 in each feature in the output acoustic feature sequence, and then obtaining the feature in the acoustic feature sequence, namely the shared feature corresponding to each text feature in the text feature with the attention weight larger than 0.05, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature, namely the acoustic shared features, are obtained.

Further, the private characteristics in the second step are obtained through the following steps:

inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely a visual private feature;

inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely an acoustic private feature.

inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely a visual private feature;

inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely an acoustic private feature.

Further, the first cross-modal prediction model and the second cross-modal prediction model comprise an encoder and a decoder.

Further, the encoder and decoder are implemented by LSTM or transform.

Further, the fusion in the third step comprises the following specific steps:

step three, firstly: inputting the visual characteristic sequence into a first LSTM to obtain a visual characteristic representation sequence, inputting the text characteristic into a second LSTM to obtain a text characteristic representation sequence, and inputting the acoustic characteristic sequence into a third LSTM to obtain an acoustic characteristic representation sequence;

step three: a cross-modal attention mechanism is used for fusing the text feature representation sequence and the visual feature representation sequence corresponding to the visual sharing feature to obtain a visual sharing representation sequence; using a cross-modal attention mechanism to fuse the text feature representation sequence and the acoustic feature representation sequence corresponding to the acoustic sharing feature to obtain an acoustic sharing representation sequence;

step three: splicing the obtained visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence, sending the spliced visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence into a fourth LSTM to obtain a sharing fusion representation, and transforming the sharing fusion representation by using a self-attention mechanism to obtain a sharing representation;

step three and four: the visual private characterization is obtained by fusing the visual characteristic representation sequence corresponding to the visual private characteristic by using an attention mechanism, and the acoustic private characterization is obtained by fusing the acoustic characteristic representation sequence corresponding to the acoustic private characteristic by using the attention mechanism;

step three and five: and splicing the shared representation, the visual private representation and the acoustic private representation to obtain a final fusion result.

Further, the classifier in the fourth step is softmax, Logistic or SVM.

The invention has the beneficial effects that:

the shared private framework provided by the application has high accuracy in multi-modal emotion analysis. In addition, the shared and private features of the non-textual modalities derived from the cross-modality prediction task of the present application may provide more interpretable clues for interactions between different modalities. Thus, these non-textual shared-private features can be fused together with textual features to improve multimodal sentiment analysis. The shared mask in the application can enable the emotion regression model to obtain the characteristics of modal sharing, so that a more stable regression model is formed. Private masking concentrates the regression model on modality-specific features, which provides additional information for emotion prediction. With the help of sharing and private masks, the regression model in the sharing-private framework can independently fuse text features and two types of non-text features, and is more effective.

Drawings

FIG. 1 is a schematic overall structure of the present application;

FIG. 2 is a schematic diagram of the shared and private features of the present application;

fig. 3 is a schematic diagram of the shared features of the present application.

Detailed Description

It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.

The first embodiment is as follows: specifically, referring to fig. 1, the method for multi-modal emotion classification with text as a core according to the present embodiment includes the following steps:

To address the problems in the prior art, the present application proposes a shared-private framework for text-centric multimodal sentiment analysis. In the framework, a text mode is considered as a core mode, a cross-mode prediction task is designed firstly in the application for distinguishing shared and private semantics between the text mode and a non-text (visual and acoustic) mode, then an emotion regression model comprising a shared module and a private module is provided, and text features and two types of non-text features are fused for emotion analysis.

The second embodiment is as follows: this embodiment mode is a further description of the first embodiment mode, and the difference between this embodiment mode and the first embodiment mode is that the visual sharing feature and the acoustic sharing feature in the second step are obtained by the following steps:

n is 3, 4 or 5.

The third concrete implementation mode: this embodiment mode is a further description of the first embodiment mode, and is different from the first embodiment mode in that N is 5.

The fourth concrete implementation mode: this embodiment mode is a further description of the first embodiment mode, and the difference between this embodiment mode and the first embodiment mode is that the visual sharing feature and the acoustic sharing feature in the second step are obtained by the following steps:

The fifth concrete implementation mode: the third embodiment or the fourth embodiment is further described, and the difference between the third embodiment and the fourth embodiment is that the private features in the second step are obtained through the following steps:

The sixth specific implementation mode: this embodiment is a further description of the third or fourth embodiment, and the difference between this embodiment and the third or fourth embodiment is that the private features in step two are obtained through the following steps:

The seventh embodiment: the present embodiment is a further description of the first embodiment, and the difference between the present embodiment and the first embodiment is that the first cross-modal prediction model and the second cross-modal prediction model include an encoder and a decoder.

The specific implementation mode is eight: this embodiment is a further description of a seventh embodiment, and the difference between this embodiment and the seventh embodiment is that the encoder and the decoder are implemented by LSTM or transform.

The specific implementation method nine: this embodiment is a further description of an eighth embodiment, and the difference between this embodiment and the eighth embodiment is that the specific steps fused in the third step are:

The detailed implementation mode is ten: this embodiment is a further description of a ninth embodiment, and the difference between this embodiment and the ninth embodiment is that the classifier in step four is softmax, Logistic, or SVM.

The principle is as follows:

the application is a shared private framework which takes a text as a core mode and is used for multi-mode emotion analysis, the framework mainly comprises two parts, one part is a cross-mode prediction model, and the model takes text mode characteristics as input and outputs voice/image mode characteristics. The model is utilized to design the judgment rules of the shared characteristics and the private characteristics, and then the shared characteristics and the private characteristics are distinguished by utilizing the rules. And the other part is an emotion prediction model, and the model fuses text modal characteristics and sharing and private characteristics of voice/images by using a cross-modal attention mechanism, and finally obtains multi-modal fusion characteristics for emotion classification.

The cross-modal prediction model consists of an encoder and a decoder, both of which are implemented by LSTM. The encoder takes as input a sequence of input text features and outputs an encoded text representation in which information in the text features is modeled. The input to the decoder is a textual representation of the encoder output, each time step outputting a characteristic of a target modality, the output of each step being dependent on the output of the previous time step and the input of the encoder. The training target of the cross-modal prediction model is to predict the image/audio features corresponding to the input text features.

To mine the relationship between text modalities and speech/image features, the present application defines shared and private features. Shared features do not contain additional information relative to text features, but rather provide overlapping speech information, such features may make model prediction more robust. The decision rule for this type of feature is as follows. Firstly, acquiring attention weight of an input text feature sequence when generating target features from a cross-modal prediction model, and then keeping 5 text features with the maximum attention weight corresponding to each generated target feature, wherein the target feature corresponding to each text feature is called as a shared feature corresponding to the text feature. The private features contain information that is not available in the text features, which are difficult to predict from the text features. The decision rule for this class of features is to consider a target feature as a private feature if the prediction loss for that feature is high. Through the two types of rules, the two types of information can be distinguished through a cross-modal prediction model and then sent to an emotion prediction model for emotion prediction.

The emotion prediction model consists of a feature input coding module, a shared feature coding module and a private feature coding module. The feature input coding module uses LSTM to code the features of the input text, voice and image, and obtains the feature representation with context information. The shared feature coding module utilizes a cross-modal attention model, and the text representation of each feature input coding module performs cross-modal interaction on the speech/image feature representation shared by the text representation of the feature input coding module to obtain the shared representation of the non-text feature. And splicing the text representation with the voice representation and the image representation, carrying out fusion coding on the spliced representation through an LSTM, carrying out coding by using a layer of self-attention module for deeper feature interaction, and finally taking the output of the first position as the multi-mode shared feature representation. The input to the private feature encoding module is the private feature representation of speech and images, which uses the attention mechanism to give higher weight to more important features, resulting in a modal private feature representation. And splicing the modality sharing feature representation and the modality private feature representation, and sending the obtained result into a classification layer to predict a final feature representation. Finally, in the implementation process, the selection of the private feature and the shared feature is realized by a masking mechanism, namely, the weight of the unselected position is set to be 0. The classification results include positive, negative, and neutral (shared and private features are shown in fig. 2 and 3).

The method comprises the steps of taking a text as a core mode, using a shared private framework for multi-mode emotion analysis, mining shared and private characteristics from image and voice characteristics through a cross-mode prediction model realized by a coder-decoder, further fusing the text characteristics with the shared characteristics and the private characteristics in the emotion prediction model, and finally predicting an emotion label.

The present application compares the proposed method to several baseline methods, and the experimental results are shown in table 1. The basic model of the application cannot obtain the best results on Acc and F1 indexes of the MOSEI data set, and is not as good as RAVEN and MulT in performance. However, with the help of the cross-modal prediction task, the text-centric shared private framework (TCSP) of the present application achieves the best performance and outperforms all baseline methods on both datasets. This can prove that the shared private framework proposed by the present application is effective for multimodal sentiment analysis. Furthermore, it can be seen that the shared and private features of non-textual modalities derived from the cross-modality prediction task can provide more interpretable clues for interactions between different modalities. Thus, these non-textual shared-private features can be fused together with textual features to improve multimodal sentiment analysis. On MOSI datasets, there is a large gap between the performance of the complete model of the present application and our underlying model. The small amount of data that this application attributes to MOSI datasets is not enough for training the underlying model, but in the complete model the model also benefits from shared and private information.

TABLE 1 Experimental results on MOSI and MOSEI

TABLE 2 ablation experimental results on MOSI and MOSEI

Ablation experiments were performed to distinguish the contribution of each part. As shown in table 2, ablating either the shared mask or the private mask compromises the performance of the model, indicating that both parts are useful for emotion prediction. The sharing mask can enable the emotion regression model to obtain the characteristics of modal sharing, and therefore a more robust regression model is formed. Private masking concentrates the regression model on modality-specific features, which provides additional information for emotion prediction. With the help of sharing and private masks, the regression model in the sharing-private framework can independently fuse text features and two types of non-text features, and is more effective.

It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.

Claims

1. A multi-mode emotion classification method taking a text as a core is characterized by comprising the following steps:

the visual sharing feature and the acoustic sharing feature are features which do not contain additional information relative to the text feature, and the visual private feature and the acoustic private feature are features which contain information which is not contained in the text feature;

the visual sharing feature and the acoustic sharing feature in the step two are obtained through the following steps:

n is 3, 4 or 5;

the private characteristics in the second step are obtained through the following steps:

2. The method of claim 1, wherein N is 5.

3. The method of claim 1, wherein the visual sharing feature and the acoustic sharing feature in the second step are obtained by:

4. The method for multi-modal emotion classification with text as core according to claim 2 or 3, wherein the private features in the second step are obtained by the following steps:

5. The method of claim 1, wherein the first cross-modal prediction model and the second cross-modal prediction model comprise an encoder and a decoder.

6. The method of claim 5, wherein the encoder and decoder are implemented by LSTM or transform.

7. The method of claim 6, wherein the step three of merging comprises the following steps:

8. The method of claim 7, wherein the step four middle classifier is softmax, Logistic, or SVM.