CN113312530B - Multi-mode emotion classification method taking text as core - Google Patents

Multi-mode emotion classification method taking text as core Download PDF

Info

Publication number
CN113312530B
CN113312530B CN202110652703.2A CN202110652703A CN113312530B CN 113312530 B CN113312530 B CN 113312530B CN 202110652703 A CN202110652703 A CN 202110652703A CN 113312530 B CN113312530 B CN 113312530B
Authority
CN
China
Prior art keywords
feature
sequence
text
acoustic
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110652703.2A
Other languages
Chinese (zh)
Other versions
CN113312530A (en
Inventor
秦兵
吴洋
赵妍妍
胡晓毓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110652703.2A priority Critical patent/CN113312530B/en
Publication of CN113312530A publication Critical patent/CN113312530A/en
Application granted granted Critical
Publication of CN113312530B publication Critical patent/CN113312530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification

Abstract

A multi-mode emotion classification method with texts as cores relates to the technical field of natural language processing, and aims to solve the problem that in the prior art, semantic information of each mode is regarded as a whole, interaction capacity of different modes is lacked, and emotion classification is inaccurate. By utilizing the model, a judgment rule of the shared characteristic and the private characteristic is designed, and the shared characteristic and the private characteristic are distinguished by utilizing the rule. And the other part is an emotion prediction model, and the model fuses text modal characteristics and sharing and private characteristics of voice/images by using a cross-modal attention mechanism, and finally obtains multi-modal fusion characteristics for emotion classification.

Description

Multi-mode emotion classification method taking text as core
Technical Field
The invention relates to the technical field of natural language processing, in particular to a multi-modal emotion classification method taking a text as a core.
Background
Multimodal emotion analysis is an emerging field of research aimed at understanding people's emotions using textual and non-textual (visual, acoustic) data. This task has recently attracted increasing attention from social circles because it has been recognized that non-textual cues help detect emotion and identify opinions and emotions in videos.
In multimodal emotion analysis, there are two main work routes. One is multi-modal feature fusion focused on the linguistic level. Such methods use the features of the entire speech piece, first extracting visual or acoustic features at the frame level and then averaging them to obtain the final speech piece level features. Text-level textual features may be obtained by applying RNNs. The obtained speech feature level features are fed into a fusion model to obtain a multi-modal representation. Several efficient multi-modal feature fusion models have been proposed (Zadeh et al, 2017; Liu et al, 2018; Mai et al, 2020). The features at the speech level contain mainly global information and may not capture local information. Therefore, recent work has focused primarily on multi-modal features at the word level. To extract the word-level features, the first step is to obtain the time stamp of each word appearing in the video, including the start time and the end time. The corpus is then segmented into video segments based on the timestamps. Finally, the visual or acoustic features at the word level are obtained by averaging the frame-level features of the video segments. Researchers have proposed many methods to perform word-level multimodal feature fusion (Zadeh et al, 2018; Wang et al, 2019; Tsai et al, 2019; Vaswani et al, 2017). Furthermore, there is a related work (Pham et al, 2019) that considers that joint characterization can be learned from the translation of a source modality to a target modality and proposes a multimodal loop translation network (MCTN) to learn joint multimodal characterization.
Existing work has shown that adding non-textual data can improve the performance of emotion analysis compared to traditional text emotion analysis (Liu, 2012) (Chen et al, 2017; Zadeh et al, 2018; Sun et al, 2020). There are two reasons for this, the first being that the three modalities may convey some common semantics. In this case, these non-textual common semantics do not provide additional information beyond textual data, but the repeated information therein may enhance the final performance, referred to as shared semantics. Another reason is that all three modalities have their own specific semantic information that is different from the other modalities. These semantic information are modality specific and difficult to predict from text data alone, called private semantics. The final emotion can be detected more accurately by combining the private semantic information.
Previous work generally does not distinguish between shared semantics and private semantics, but rather treats the semantic information of each modality as a whole, lacking the ability to explore the interaction of different modalities.
Disclosure of Invention
The purpose of the invention is: aiming at the problem that semantic information of each mode is regarded as a whole in the prior art, the interaction capability of different modes is lacked, and further the emotion classification is inaccurate, the multi-mode emotion analysis method using the text as the core is provided.
The technical scheme adopted by the invention to solve the technical problems is as follows:
a multi-mode emotion classification method with texts as cores comprises the following steps:
the method comprises the following steps: extracting a text characteristic sequence, a visual characteristic sequence and an acoustic characteristic sequence from data, then training a cross-modal prediction model I by using the text characteristic sequence and the visual characteristic sequence, then training a cross-modal prediction model II by using the text characteristic sequence and the acoustic characteristic sequence, and finishing the model training when the loss function values of the cross-modal prediction model I and the cross-modal prediction model II are not reduced any more;
step two: inputting the text characteristic sequence to be tested into a cross-modal prediction model I to obtain an output visual characteristic sequence, then obtaining a visual sharing characteristic and a visual private characteristic according to the output visual characteristic sequence,
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, and then obtaining an acoustic shared feature and an acoustic private feature according to the output acoustic feature sequence;
step three: fusing a text feature sequence to be tested with the visual sharing feature and the acoustic sharing feature, and then fusing a fusion result with the visual private feature and the acoustic private feature to obtain a final fusion result;
step four: inputting the final fusion result into a classifier for classification;
the visual sharing feature and the acoustic sharing feature are features that do not contain additional information relative to the text feature, and the visual private feature and the acoustic private feature are features that contain information that is not contained in the text feature.
Further, the visual sharing feature and the acoustic sharing feature in the second step are obtained through the following steps:
inputting a text feature sequence to be detected into a cross-modal prediction model I to obtain an output visual feature sequence, further obtaining N text features with the largest attention weight corresponding to each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature are obtained, namely the visual shared features;
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, further obtaining N text features with the maximum attention weight corresponding to each feature in the output acoustic feature sequence, wherein the feature in the acoustic feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature are obtained, namely the acoustic shared features;
n is 3, 4 or 5.
Further, N is 5.
Further, the visual sharing feature and the acoustic sharing feature in the second step are obtained through the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a text feature of which the attention weight is greater than 0.05 in each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the text features of which the attention weight is greater than 0.05, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature, namely the visual shared features, are obtained;
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, then obtaining a text feature with the attention weight larger than 0.05 in each feature in the output acoustic feature sequence, and then obtaining the feature in the acoustic feature sequence, namely the shared feature corresponding to each text feature in the text feature with the attention weight larger than 0.05, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature, namely the acoustic shared features, are obtained.
Further, the private characteristics in the second step are obtained through the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely a visual private feature;
inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely an acoustic private feature.
Further, the private characteristics in the second step are obtained through the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely a visual private feature;
inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely an acoustic private feature.
Further, the first cross-modal prediction model and the second cross-modal prediction model comprise an encoder and a decoder.
Further, the encoder and decoder are implemented by LSTM or transform.
Further, the fusion in the third step comprises the following specific steps:
step three, firstly: inputting the visual characteristic sequence into a first LSTM to obtain a visual characteristic representation sequence, inputting the text characteristic into a second LSTM to obtain a text characteristic representation sequence, and inputting the acoustic characteristic sequence into a third LSTM to obtain an acoustic characteristic representation sequence;
step three: a cross-modal attention mechanism is used for fusing the text feature representation sequence and the visual feature representation sequence corresponding to the visual sharing feature to obtain a visual sharing representation sequence; using a cross-modal attention mechanism to fuse the text feature representation sequence and the acoustic feature representation sequence corresponding to the acoustic sharing feature to obtain an acoustic sharing representation sequence;
step three: splicing the obtained visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence, sending the spliced visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence into a fourth LSTM to obtain a sharing fusion representation, and transforming the sharing fusion representation by using a self-attention mechanism to obtain a sharing representation;
step three and four: the visual private characterization is obtained by fusing the visual characteristic representation sequence corresponding to the visual private characteristic by using an attention mechanism, and the acoustic private characterization is obtained by fusing the acoustic characteristic representation sequence corresponding to the acoustic private characteristic by using the attention mechanism;
step three and five: and splicing the shared representation, the visual private representation and the acoustic private representation to obtain a final fusion result.
Further, the classifier in the fourth step is softmax, Logistic or SVM.
The invention has the beneficial effects that:
the shared private framework provided by the application has high accuracy in multi-modal emotion analysis. In addition, the shared and private features of the non-textual modalities derived from the cross-modality prediction task of the present application may provide more interpretable clues for interactions between different modalities. Thus, these non-textual shared-private features can be fused together with textual features to improve multimodal sentiment analysis. The shared mask in the application can enable the emotion regression model to obtain the characteristics of modal sharing, so that a more stable regression model is formed. Private masking concentrates the regression model on modality-specific features, which provides additional information for emotion prediction. With the help of sharing and private masks, the regression model in the sharing-private framework can independently fuse text features and two types of non-text features, and is more effective.
Drawings
FIG. 1 is a schematic overall structure of the present application;
FIG. 2 is a schematic diagram of the shared and private features of the present application;
fig. 3 is a schematic diagram of the shared features of the present application.
Detailed Description
It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.
The first embodiment is as follows: specifically, referring to fig. 1, the method for multi-modal emotion classification with text as a core according to the present embodiment includes the following steps:
the method comprises the following steps: extracting a text characteristic sequence, a visual characteristic sequence and an acoustic characteristic sequence from data, then training a cross-modal prediction model I by using the text characteristic sequence and the visual characteristic sequence, then training a cross-modal prediction model II by using the text characteristic sequence and the acoustic characteristic sequence, and finishing the model training when the loss function values of the cross-modal prediction model I and the cross-modal prediction model II are not reduced any more;
step two: inputting the text characteristic sequence to be tested into a cross-modal prediction model I to obtain an output visual characteristic sequence, then obtaining a visual sharing characteristic and a visual private characteristic according to the output visual characteristic sequence,
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, and then obtaining an acoustic shared feature and an acoustic private feature according to the output acoustic feature sequence;
step three: fusing a text feature sequence to be tested with the visual sharing feature and the acoustic sharing feature, and then fusing a fusion result with the visual private feature and the acoustic private feature to obtain a final fusion result;
step four: inputting the final fusion result into a classifier for classification;
the visual sharing feature and the acoustic sharing feature are features that do not contain additional information relative to the text feature, and the visual private feature and the acoustic private feature are features that contain information that is not contained in the text feature.
To address the problems in the prior art, the present application proposes a shared-private framework for text-centric multimodal sentiment analysis. In the framework, a text mode is considered as a core mode, a cross-mode prediction task is designed firstly in the application for distinguishing shared and private semantics between the text mode and a non-text (visual and acoustic) mode, then an emotion regression model comprising a shared module and a private module is provided, and text features and two types of non-text features are fused for emotion analysis.
The second embodiment is as follows: this embodiment mode is a further description of the first embodiment mode, and the difference between this embodiment mode and the first embodiment mode is that the visual sharing feature and the acoustic sharing feature in the second step are obtained by the following steps:
inputting a text feature sequence to be detected into a cross-modal prediction model I to obtain an output visual feature sequence, further obtaining N text features with the largest attention weight corresponding to each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature are obtained, namely the visual shared features;
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, further obtaining N text features with the maximum attention weight corresponding to each feature in the output acoustic feature sequence, wherein the feature in the acoustic feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature are obtained, namely the acoustic shared features;
n is 3, 4 or 5.
The third concrete implementation mode: this embodiment mode is a further description of the first embodiment mode, and is different from the first embodiment mode in that N is 5.
The fourth concrete implementation mode: this embodiment mode is a further description of the first embodiment mode, and the difference between this embodiment mode and the first embodiment mode is that the visual sharing feature and the acoustic sharing feature in the second step are obtained by the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a text feature of which the attention weight is greater than 0.05 in each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the text features of which the attention weight is greater than 0.05, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature, namely the visual shared features, are obtained;
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, then obtaining a text feature with the attention weight larger than 0.05 in each feature in the output acoustic feature sequence, and then obtaining the feature in the acoustic feature sequence, namely the shared feature corresponding to each text feature in the text feature with the attention weight larger than 0.05, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature, namely the acoustic shared features, are obtained.
The fifth concrete implementation mode: the third embodiment or the fourth embodiment is further described, and the difference between the third embodiment and the fourth embodiment is that the private features in the second step are obtained through the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely a visual private feature;
inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely an acoustic private feature.
The sixth specific implementation mode: this embodiment is a further description of the third or fourth embodiment, and the difference between this embodiment and the third or fourth embodiment is that the private features in step two are obtained through the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely a visual private feature;
inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely an acoustic private feature.
The seventh embodiment: the present embodiment is a further description of the first embodiment, and the difference between the present embodiment and the first embodiment is that the first cross-modal prediction model and the second cross-modal prediction model include an encoder and a decoder.
The specific implementation mode is eight: this embodiment is a further description of a seventh embodiment, and the difference between this embodiment and the seventh embodiment is that the encoder and the decoder are implemented by LSTM or transform.
The specific implementation method nine: this embodiment is a further description of an eighth embodiment, and the difference between this embodiment and the eighth embodiment is that the specific steps fused in the third step are:
step three, firstly: inputting the visual characteristic sequence into a first LSTM to obtain a visual characteristic representation sequence, inputting the text characteristic into a second LSTM to obtain a text characteristic representation sequence, and inputting the acoustic characteristic sequence into a third LSTM to obtain an acoustic characteristic representation sequence;
step three: a cross-modal attention mechanism is used for fusing the text feature representation sequence and the visual feature representation sequence corresponding to the visual sharing feature to obtain a visual sharing representation sequence; using a cross-modal attention mechanism to fuse the text feature representation sequence and the acoustic feature representation sequence corresponding to the acoustic sharing feature to obtain an acoustic sharing representation sequence;
step three: splicing the obtained visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence, sending the spliced visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence into a fourth LSTM to obtain a sharing fusion representation, and transforming the sharing fusion representation by using a self-attention mechanism to obtain a sharing representation;
step three and four: the visual private characterization is obtained by fusing the visual characteristic representation sequence corresponding to the visual private characteristic by using an attention mechanism, and the acoustic private characterization is obtained by fusing the acoustic characteristic representation sequence corresponding to the acoustic private characteristic by using the attention mechanism;
step three and five: and splicing the shared representation, the visual private representation and the acoustic private representation to obtain a final fusion result.
The detailed implementation mode is ten: this embodiment is a further description of a ninth embodiment, and the difference between this embodiment and the ninth embodiment is that the classifier in step four is softmax, Logistic, or SVM.
The principle is as follows:
the application is a shared private framework which takes a text as a core mode and is used for multi-mode emotion analysis, the framework mainly comprises two parts, one part is a cross-mode prediction model, and the model takes text mode characteristics as input and outputs voice/image mode characteristics. The model is utilized to design the judgment rules of the shared characteristics and the private characteristics, and then the shared characteristics and the private characteristics are distinguished by utilizing the rules. And the other part is an emotion prediction model, and the model fuses text modal characteristics and sharing and private characteristics of voice/images by using a cross-modal attention mechanism, and finally obtains multi-modal fusion characteristics for emotion classification.
The cross-modal prediction model consists of an encoder and a decoder, both of which are implemented by LSTM. The encoder takes as input a sequence of input text features and outputs an encoded text representation in which information in the text features is modeled. The input to the decoder is a textual representation of the encoder output, each time step outputting a characteristic of a target modality, the output of each step being dependent on the output of the previous time step and the input of the encoder. The training target of the cross-modal prediction model is to predict the image/audio features corresponding to the input text features.
To mine the relationship between text modalities and speech/image features, the present application defines shared and private features. Shared features do not contain additional information relative to text features, but rather provide overlapping speech information, such features may make model prediction more robust. The decision rule for this type of feature is as follows. Firstly, acquiring attention weight of an input text feature sequence when generating target features from a cross-modal prediction model, and then keeping 5 text features with the maximum attention weight corresponding to each generated target feature, wherein the target feature corresponding to each text feature is called as a shared feature corresponding to the text feature. The private features contain information that is not available in the text features, which are difficult to predict from the text features. The decision rule for this class of features is to consider a target feature as a private feature if the prediction loss for that feature is high. Through the two types of rules, the two types of information can be distinguished through a cross-modal prediction model and then sent to an emotion prediction model for emotion prediction.
The emotion prediction model consists of a feature input coding module, a shared feature coding module and a private feature coding module. The feature input coding module uses LSTM to code the features of the input text, voice and image, and obtains the feature representation with context information. The shared feature coding module utilizes a cross-modal attention model, and the text representation of each feature input coding module performs cross-modal interaction on the speech/image feature representation shared by the text representation of the feature input coding module to obtain the shared representation of the non-text feature. And splicing the text representation with the voice representation and the image representation, carrying out fusion coding on the spliced representation through an LSTM, carrying out coding by using a layer of self-attention module for deeper feature interaction, and finally taking the output of the first position as the multi-mode shared feature representation. The input to the private feature encoding module is the private feature representation of speech and images, which uses the attention mechanism to give higher weight to more important features, resulting in a modal private feature representation. And splicing the modality sharing feature representation and the modality private feature representation, and sending the obtained result into a classification layer to predict a final feature representation. Finally, in the implementation process, the selection of the private feature and the shared feature is realized by a masking mechanism, namely, the weight of the unselected position is set to be 0. The classification results include positive, negative, and neutral (shared and private features are shown in fig. 2 and 3).
The method comprises the steps of taking a text as a core mode, using a shared private framework for multi-mode emotion analysis, mining shared and private characteristics from image and voice characteristics through a cross-mode prediction model realized by a coder-decoder, further fusing the text characteristics with the shared characteristics and the private characteristics in the emotion prediction model, and finally predicting an emotion label.
The present application compares the proposed method to several baseline methods, and the experimental results are shown in table 1. The basic model of the application cannot obtain the best results on Acc and F1 indexes of the MOSEI data set, and is not as good as RAVEN and MulT in performance. However, with the help of the cross-modal prediction task, the text-centric shared private framework (TCSP) of the present application achieves the best performance and outperforms all baseline methods on both datasets. This can prove that the shared private framework proposed by the present application is effective for multimodal sentiment analysis. Furthermore, it can be seen that the shared and private features of non-textual modalities derived from the cross-modality prediction task can provide more interpretable clues for interactions between different modalities. Thus, these non-textual shared-private features can be fused together with textual features to improve multimodal sentiment analysis. On MOSI datasets, there is a large gap between the performance of the complete model of the present application and our underlying model. The small amount of data that this application attributes to MOSI datasets is not enough for training the underlying model, but in the complete model the model also benefits from shared and private information.
TABLE 1 Experimental results on MOSI and MOSEI
Figure GDA0003332246660000091
TABLE 2 ablation experimental results on MOSI and MOSEI
Figure GDA0003332246660000092
Ablation experiments were performed to distinguish the contribution of each part. As shown in table 2, ablating either the shared mask or the private mask compromises the performance of the model, indicating that both parts are useful for emotion prediction. The sharing mask can enable the emotion regression model to obtain the characteristics of modal sharing, and therefore a more robust regression model is formed. Private masking concentrates the regression model on modality-specific features, which provides additional information for emotion prediction. With the help of sharing and private masks, the regression model in the sharing-private framework can independently fuse text features and two types of non-text features, and is more effective.
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.

Claims (8)

1. A multi-mode emotion classification method taking a text as a core is characterized by comprising the following steps:
the method comprises the following steps: extracting a text characteristic sequence, a visual characteristic sequence and an acoustic characteristic sequence from data, then training a cross-modal prediction model I by using the text characteristic sequence and the visual characteristic sequence, then training a cross-modal prediction model II by using the text characteristic sequence and the acoustic characteristic sequence, and finishing the model training when the loss function values of the cross-modal prediction model I and the cross-modal prediction model II are not reduced any more;
step two: inputting the text characteristic sequence to be tested into a cross-modal prediction model I to obtain an output visual characteristic sequence, then obtaining a visual sharing characteristic and a visual private characteristic according to the output visual characteristic sequence,
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, and then obtaining an acoustic shared feature and an acoustic private feature according to the output acoustic feature sequence;
step three: fusing a text feature sequence to be tested with the visual sharing feature and the acoustic sharing feature, and then fusing a fusion result with the visual private feature and the acoustic private feature to obtain a final fusion result;
step four: inputting the final fusion result into a classifier for classification;
the visual sharing feature and the acoustic sharing feature are features which do not contain additional information relative to the text feature, and the visual private feature and the acoustic private feature are features which contain information which is not contained in the text feature;
the visual sharing feature and the acoustic sharing feature in the step two are obtained through the following steps:
inputting a text feature sequence to be detected into a cross-modal prediction model I to obtain an output visual feature sequence, further obtaining N text features with the largest attention weight corresponding to each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature are obtained, namely the visual shared features;
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, further obtaining N text features with the maximum attention weight corresponding to each feature in the output acoustic feature sequence, wherein the feature in the acoustic feature sequence is a shared feature corresponding to each text feature in the N text features, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature are obtained, namely the acoustic shared features;
n is 3, 4 or 5;
the private characteristics in the second step are obtained through the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely a visual private feature;
inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature corresponding to the largest five loss function values as a private feature, namely an acoustic private feature.
2. The method of claim 1, wherein N is 5.
3. The method of claim 1, wherein the visual sharing feature and the acoustic sharing feature in the second step are obtained by:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a text feature of which the attention weight is greater than 0.05 in each feature in the output visual feature sequence, wherein the feature in the visual feature sequence is a shared feature corresponding to each text feature in the text features of which the attention weight is greater than 0.05, and finally executing the steps on each feature in the visual feature sequence until all shared features corresponding to each text feature, namely the visual shared features, are obtained;
inputting the text feature sequence to be tested into a cross-modal prediction model II to obtain an output acoustic feature sequence, then obtaining a text feature with the attention weight larger than 0.05 in each feature in the output acoustic feature sequence, and then obtaining the feature in the acoustic feature sequence, namely the shared feature corresponding to each text feature in the text feature with the attention weight larger than 0.05, and finally executing the steps on each feature in the acoustic feature sequence until all shared features corresponding to each text feature, namely the acoustic shared features, are obtained.
4. The method for multi-modal emotion classification with text as core according to claim 2 or 3, wherein the private features in the second step are obtained by the following steps:
inputting a text feature sequence to be tested into a cross-modal prediction model I to obtain an output visual feature sequence, then obtaining a loss function value of each feature in the output visual feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely a visual private feature;
inputting the text feature sequence to be tested into the cross-modal prediction model I to obtain an output acoustic feature sequence, then obtaining a loss function value of each feature in the output acoustic feature sequence, and then taking the feature with the loss function value larger than 0.02 as a private feature, namely an acoustic private feature.
5. The method of claim 1, wherein the first cross-modal prediction model and the second cross-modal prediction model comprise an encoder and a decoder.
6. The method of claim 5, wherein the encoder and decoder are implemented by LSTM or transform.
7. The method of claim 6, wherein the step three of merging comprises the following steps:
step three, firstly: inputting the visual characteristic sequence into a first LSTM to obtain a visual characteristic representation sequence, inputting the text characteristic into a second LSTM to obtain a text characteristic representation sequence, and inputting the acoustic characteristic sequence into a third LSTM to obtain an acoustic characteristic representation sequence;
step three: a cross-modal attention mechanism is used for fusing the text feature representation sequence and the visual feature representation sequence corresponding to the visual sharing feature to obtain a visual sharing representation sequence; using a cross-modal attention mechanism to fuse the text feature representation sequence and the acoustic feature representation sequence corresponding to the acoustic sharing feature to obtain an acoustic sharing representation sequence;
step three: splicing the obtained visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence, sending the spliced visual sharing representation sequence, the acoustic sharing representation sequence and the text characteristic representation sequence into a fourth LSTM to obtain a sharing fusion representation, and transforming the sharing fusion representation by using a self-attention mechanism to obtain a sharing representation;
step three and four: the visual private characterization is obtained by fusing the visual characteristic representation sequence corresponding to the visual private characteristic by using an attention mechanism, and the acoustic private characterization is obtained by fusing the acoustic characteristic representation sequence corresponding to the acoustic private characteristic by using the attention mechanism;
step three and five: and splicing the shared representation, the visual private representation and the acoustic private representation to obtain a final fusion result.
8. The method of claim 7, wherein the step four middle classifier is softmax, Logistic, or SVM.
CN202110652703.2A 2021-06-09 2021-06-09 Multi-mode emotion classification method taking text as core Active CN113312530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110652703.2A CN113312530B (en) 2021-06-09 2021-06-09 Multi-mode emotion classification method taking text as core

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110652703.2A CN113312530B (en) 2021-06-09 2021-06-09 Multi-mode emotion classification method taking text as core

Publications (2)

Publication Number Publication Date
CN113312530A CN113312530A (en) 2021-08-27
CN113312530B true CN113312530B (en) 2022-02-15

Family

ID=77378264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110652703.2A Active CN113312530B (en) 2021-06-09 2021-06-09 Multi-mode emotion classification method taking text as core

Country Status (1)

Country Link
CN (1) CN113312530B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089602B (en) * 2021-11-04 2024-05-03 腾讯科技(深圳)有限公司 Information processing method, apparatus, electronic device, storage medium, and program product
CN115186683B (en) * 2022-07-15 2023-05-23 哈尔滨工业大学 Attribute-level multi-modal emotion classification method based on cross-modal translation
CN115983280B (en) * 2023-01-31 2023-08-15 烟台大学 Multi-mode emotion analysis method and system for uncertain mode deletion
CN117636074B (en) * 2024-01-25 2024-04-26 山东建筑大学 Multi-mode image classification method and system based on feature interaction fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288609A (en) * 2019-05-30 2019-09-27 南京师范大学 A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
CN111460223A (en) * 2020-02-25 2020-07-28 天津大学 Short video single-label classification method based on multi-mode feature fusion of deep network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753549B (en) * 2020-05-22 2023-07-21 江苏大学 Multi-mode emotion feature learning and identifying method based on attention mechanism
CN112489635B (en) * 2020-12-03 2022-11-11 杭州电子科技大学 Multi-mode emotion recognition method based on attention enhancement mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288609A (en) * 2019-05-30 2019-09-27 南京师范大学 A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
CN111460223A (en) * 2020-02-25 2020-07-28 天津大学 Short video single-label classification method based on multi-mode feature fusion of deep network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis";Kaicheng Yang等;《Proceedings of the 28th ACM International Conference on Multimedia》;20201012;第521-528页 *
"基于深度多模态特征融合的短视频分类";张丽娟等;《北京航空航天大学学报》;20210315;第47卷(第3期);第478-485页 *
"面向多模态信息的情绪分类方法研究";吴良庆;《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》;20210215;全文 *

Also Published As

Publication number Publication date
CN113312530A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN113312530B (en) Multi-mode emotion classification method taking text as core
Niu et al. Multi-modal multi-scale deep learning for large-scale image annotation
Chandrasekaran et al. Multimodal sentimental analysis for social media applications: A comprehensive review
US20200134398A1 (en) Determining intent from multimodal content embedded in a common geometric space
Cheng et al. Aspect-based sentiment analysis with component focusing multi-head co-attention networks
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN113343706A (en) Text depression tendency detection system based on multi-modal features and semantic rules
Qi et al. DuReadervis: A Chinese dataset for open-domain document visual question answering
Das et al. Hatemm: A multi-modal dataset for hate video classification
Nadeem et al. SSM: Stylometric and semantic similarity oriented multimodal fake news detection
Bölücü et al. Hate Speech and Offensive Content Identification with Graph Convolutional Networks.
CN117251791B (en) Multi-mode irony detection method based on global semantic perception of graph
CN116561592B (en) Training method of text emotion recognition model, text emotion recognition method and device
CN113361252A (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN113918710A (en) Text data processing method and device, electronic equipment and readable storage medium
Al-Tameemi et al. Multi-model fusion framework using deep learning for visual-textual sentiment classification
Hoek et al. Automatic coherence analysis of Dutch: Testing the subjectivity hypothesis on a larger scale
Chundi et al. SAEKCS: Sentiment analysis for English–Kannada code switchtext using deep learning techniques
Gupta et al. Dsc iit-ism at semeval-2020 task 8: Bi-fusion techniques for deep meme emotion analysis
Swamy et al. Nit-agartala-nlp-team at semeval-2020 task 8: Building multimodal classifiers to tackle internet humor
Mukherjee Extracting aspect specific sentiment expressions implying negative opinions
Tian et al. Emotion-aware multimodal pre-training for image-grounded emotional response generation
Xie et al. ReCoMIF: Reading comprehension based multi-source information fusion network for Chinese spoken language understanding
Wan et al. Emotion cause detection with a hierarchical network
Fu et al. Sentiment Analysis of Tourist Scenic Spots Internet Comments Based on LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant