CN113705703A

CN113705703A - Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism

Info

Publication number: CN113705703A
Application number: CN202111021378.6A
Authority: CN
Inventors: 金勇�; 胡林利; 陈宏明
Original assignee: WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD
Current assignee: WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2021-11-26

Abstract

The invention discloses a graph-text multi-modal emotion recognition method based on BilSTM and an attention mechanism, which comprises the following steps of: collecting text data and picture data; vector preprocessing, wherein a text and a picture are independently subjected to vector expression; the text vector and the picture vector are respectively subjected to combined training of an attention mechanism attention and a GRU model; and combining the vectors of the text and the picture to identify a final comprehensive result through a softmax function. The method comprises the steps of preprocessing texts and pictures respectively by adopting WORD2VEC and CNN technologies to obtain preliminary vector expressions, performing cross training by adopting a BilSTM, GRU and attention mechanism, and fusing results to a softmax layer to perform final supervised label recognition. The experiment carries out model training analysis on more than 19000 pieces of data (each piece of data comprises texts and pictures), and the result proves that the machine learning effect of fusing the pictures and the characters is better.

Description

Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism

Technical Field

The invention relates to the technical field of machine learning and multi-modal emotion analysis, in particular to a graph-text multi-modal emotion recognition method based on BilSTM and attention mechanism.

Background

Nowadays, social network technology platforms are rapidly developed, people express more and more abundantly on the network platforms, including videos, characters, pictures, audios and other forms, and particularly, many people express their own opinions and moods through the pictures and the characters. How to analyze emotion in multi-modal data is an important topic in the field of current machine learning.

Compared with single-mode data, multiple modes contain more effective information, and the information can be supplemented with each other, for example, a piece of friend circle information of 'weather today' is identified, and a picture with a laughter is matched below, so that the emotional viewpoint cannot be directly identified from the text point of view, but the picture with the laughter can be basically determined to be a negative emotion. Therefore, different modalities can complement each other in the aspect of emotional expression, and the analysis and recognition effects combined with the combination are better. In addition, current emotion analysis methods mainly focus on single modality, for example, text emotion analysis mainly performs recognition analysis on emotion in characters, and involves less picture or audio-video analysis. The present invention is therefore directed to teletext emotion recognition analysis for multimodal data comprising pictures and text.

Disclosure of Invention

Based on the technical problems in the background art, the invention provides a graphics and text multi-mode emotion recognition method based on a BilSTM and an attention mechanism, which aims to perform emotion recognition analysis by adopting a depth network model combining the BilSTM, GRU and the attention mechanism, and solves the problems that the existing single-mode data is little in acquired effective information and affects emotion recognition accuracy.

The invention provides the following technical scheme: the image-text multi-modal emotion recognition method based on the BilSTM and the attention mechanism comprises the following steps:

s1, collecting data, namely collecting text data and picture data in a period of time;

s2, vector preprocessing, wherein the text and the picture are independently subjected to vector expression, and the text vector and the picture vector are subjected to feature tuning training through a bidirectional LSTM model;

s3, the text vector and the picture vector are respectively trained by combining an attention model and a GRU model through an attention mechanism, and the result of the text training and the result of the picture training are subjected to cross influence to respectively obtain implicit expression vectors of the text and the picture;

and S4, combining the vectors of the text and the picture, and identifying a final comprehensive result through a softmax function.

Preferably, the data obtained by crawling in step S1 needs to undergo data cleaning, data integration, and manual labeling to form a training set, a verification set, and a test set, and the result label of the labeling is: positive, negative, neutral.

Preferably, the data set ratio in step S1 is: the proportion of the training set, the verification set and the test set is 7:2:1, and different models are obtained by combining different parameter configuration training.

Preferably, the vector expression of the text in step S2 is represented by W₁，W₂，…，W_kForming an embedding vector; the vector expression of the picture is a vector which is preprocessed by a rolling machine neural network after the pixel expression of the picture.

The invention provides a graph-text multi-modal emotion recognition method based on a BilSTM and an attention mechanism, which comprises the steps of preprocessing a text and a picture respectively by adopting WORD2VEC and CNN technologies to obtain a primary vector expression, then performing cross training by adopting the BilSTM, GRU and attention mechanism, and fusing the result to a softmax layer to perform final supervised label recognition. The experiment carries out model training analysis on more than 19000 pieces of data (each piece of data comprises texts and pictures), and the result proves that the machine learning effect of fusing the pictures and the characters is better.

Drawings

FIG. 1 is a schematic diagram of a teletext multi-modal processing scheme according to the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, in order to analyze data of a plurality of modalities, information complementation between the plurality of modalities is realized. In the project, a lower graph interactive deep learning framework is adopted for image-text multi-mode emotion recognition, a model setting picture and characters can independently express emotion (independently used as vector expression) or be mutually used as an auxiliary part to supplement and express emotion (vector interaction), then vector combination is used as comprehensive characteristics of images and texts, and finally, emotion comprehensive results are recognized through a softmax function.

The invention provides a technical scheme that: the image-text multi-modal emotion recognition method based on the BilSTM and the attention mechanism comprises the following steps:

s1, collecting data, namely collecting text data and picture data in a period of time; the data after crawling need form training set, verification set and test set through data cleaning, data integration and manual labeling, and the result label of labeling is: positive, negative, neutral.

The proportion of the data set for model training is: the proportion of the training set, the verification set and the test set is 7:2:1, and different models are obtained by combining different parameter configuration training.

the vector representation of the text is represented as W₁，W₂，…，W_kForming an embedding vector; the vector expression of the picture is a vector which is preprocessed by a rolling machine neural network after the pixel expression of the picture.

And S3, the text vector and the picture vector are respectively subjected to combined training of attentions and GRU models, the result of the text training and the result of the picture training are subjected to cross influence, and implicit expression vectors of the text and the picture, namely a hidden text vector (hidden text vector) and a hidden picture vector (hidden image vector), are respectively obtained.

As shown in fig. 1, that is, for the right picture, the result obtained by the left text passing through ATT + GRU is blended into the result of ATT + GRU of the right picture with a lower weight, and then input into the ATT + GRU layer of the next layer; similar processing is performed on the left text, and the result obtained by the right picture is merged into the next ATT + GRU layer of text analysis with lower weight.

Example (b):

first, data crawlers. With the monitoring of network emotion of new coronavirus as background, using a Python script data acquisition tool to crawl different network sites such as news, microblogs, forums, WeChat public numbers and the like, wherein the main keywords of the crawler are as follows: new corona, pneumonia, epidemic, coronavirus, etc. The crawled data needs to be subjected to data cleaning, data integration and manual marking (the marked result labels are positive, negative and neutral), and a training set, a verification set and a test set are formed.

And secondly, parameter configuration. The deep learning frame comprises parameters such as the dimensionality of embeddings, the CNN convolution number, the unit number of LSTM, the layer number of ATT and GRU, softmax regularization, network learning rate and the like for configuration.

And thirdly, vector preprocessing. WORD vector training is carried out on the text by using WORD2VEC, and the picture is processed by using CNN to obtain a preliminary characteristic vector.

And fourthly, training a model. The proportion of the data set during the experiment was: the proportion of the training set, the verification set and the test set is 7:2: 1. And different models are obtained by combining different parameter configuration training.

And fifthly, evaluating the result. And comparing training results of different models, and comprehensively analyzing the accuracy and performance indexes of the models to obtain a model design scheme for industrial application.

The invention carries out emotion recognition analysis aiming at image-text multi-mode data, and integrates complementary information of images and characters in network public sentiment to develop an image-text multi-mode emotion recognition model framework in order to avoid limitation caused by single mode.

The method comprises the steps of preprocessing texts and pictures respectively by adopting WORD2VEC and CNN technologies to obtain preliminary vector expressions, performing cross training by adopting a BilSTM, GRU and attention mechanism, and fusing results to a softmax layer to perform final supervised label recognition. The experiment carries out model training analysis on more than 19000 pieces of data (each piece of data comprises texts and pictures), and the result proves that the machine learning effect of fusing the pictures and the characters is good.

By researching the image-text multi-mode emotion recognition method, the invention realizes the integrated emotion recognition of the network public opinion images and characters, and provides a certain technical basis for the automatic cognitive recognition of social emotion.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. The image-text multi-modal emotion recognition method based on the BilSTM and the attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

2. The teletext multi-modal emotion recognition method based on the BilSTM and attention mechanism of claim 1, wherein: the data crawled in the step S1 need to undergo data cleaning, data integration, and manual labeling to form a training set, a verification set, and a test set, and the result labels of the labeling are: positive, negative, neutral.

3. The teletext multi-modal emotion recognition method based on the BilSTM and attention mechanism of claim 2, wherein: the data set proportion in step S1 is: the proportion of the training set, the verification set and the test set is 7:2:1, and different models are obtained by combining different parameter configuration training.

4. The teletext multi-modal emotion recognition method based on the BilSTM and attention mechanism of claim 1, wherein: the vector expression of the text in the step S2 is represented as W₁，W₂，…，W_kForming an embedding vector; the vector expression of the picture is a vector which is preprocessed by a rolling machine neural network after the pixel expression of the picture.