CN113705703A - Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism - Google Patents
Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism Download PDFInfo
- Publication number
- CN113705703A CN113705703A CN202111021378.6A CN202111021378A CN113705703A CN 113705703 A CN113705703 A CN 113705703A CN 202111021378 A CN202111021378 A CN 202111021378A CN 113705703 A CN113705703 A CN 113705703A
- Authority
- CN
- China
- Prior art keywords
- text
- vector
- picture
- data
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a graph-text multi-modal emotion recognition method based on BilSTM and an attention mechanism, which comprises the following steps of: collecting text data and picture data; vector preprocessing, wherein a text and a picture are independently subjected to vector expression; the text vector and the picture vector are respectively subjected to combined training of an attention mechanism attention and a GRU model; and combining the vectors of the text and the picture to identify a final comprehensive result through a softmax function. The method comprises the steps of preprocessing texts and pictures respectively by adopting WORD2VEC and CNN technologies to obtain preliminary vector expressions, performing cross training by adopting a BilSTM, GRU and attention mechanism, and fusing results to a softmax layer to perform final supervised label recognition. The experiment carries out model training analysis on more than 19000 pieces of data (each piece of data comprises texts and pictures), and the result proves that the machine learning effect of fusing the pictures and the characters is better.
Description
Technical Field
The invention relates to the technical field of machine learning and multi-modal emotion analysis, in particular to a graph-text multi-modal emotion recognition method based on BilSTM and attention mechanism.
Background
Nowadays, social network technology platforms are rapidly developed, people express more and more abundantly on the network platforms, including videos, characters, pictures, audios and other forms, and particularly, many people express their own opinions and moods through the pictures and the characters. How to analyze emotion in multi-modal data is an important topic in the field of current machine learning.
Compared with single-mode data, multiple modes contain more effective information, and the information can be supplemented with each other, for example, a piece of friend circle information of 'weather today' is identified, and a picture with a laughter is matched below, so that the emotional viewpoint cannot be directly identified from the text point of view, but the picture with the laughter can be basically determined to be a negative emotion. Therefore, different modalities can complement each other in the aspect of emotional expression, and the analysis and recognition effects combined with the combination are better. In addition, current emotion analysis methods mainly focus on single modality, for example, text emotion analysis mainly performs recognition analysis on emotion in characters, and involves less picture or audio-video analysis. The present invention is therefore directed to teletext emotion recognition analysis for multimodal data comprising pictures and text.
Disclosure of Invention
Based on the technical problems in the background art, the invention provides a graphics and text multi-mode emotion recognition method based on a BilSTM and an attention mechanism, which aims to perform emotion recognition analysis by adopting a depth network model combining the BilSTM, GRU and the attention mechanism, and solves the problems that the existing single-mode data is little in acquired effective information and affects emotion recognition accuracy.
The invention provides the following technical scheme: the image-text multi-modal emotion recognition method based on the BilSTM and the attention mechanism comprises the following steps:
s1, collecting data, namely collecting text data and picture data in a period of time;
s2, vector preprocessing, wherein the text and the picture are independently subjected to vector expression, and the text vector and the picture vector are subjected to feature tuning training through a bidirectional LSTM model;
s3, the text vector and the picture vector are respectively trained by combining an attention model and a GRU model through an attention mechanism, and the result of the text training and the result of the picture training are subjected to cross influence to respectively obtain implicit expression vectors of the text and the picture;
and S4, combining the vectors of the text and the picture, and identifying a final comprehensive result through a softmax function.
Preferably, the data obtained by crawling in step S1 needs to undergo data cleaning, data integration, and manual labeling to form a training set, a verification set, and a test set, and the result label of the labeling is: positive, negative, neutral.
Preferably, the data set ratio in step S1 is: the proportion of the training set, the verification set and the test set is 7:2:1, and different models are obtained by combining different parameter configuration training.
Preferably, the vector expression of the text in step S2 is represented by W1,W2,…,WkForming an embedding vector; the vector expression of the picture is a vector which is preprocessed by a rolling machine neural network after the pixel expression of the picture.
The invention provides a graph-text multi-modal emotion recognition method based on a BilSTM and an attention mechanism, which comprises the steps of preprocessing a text and a picture respectively by adopting WORD2VEC and CNN technologies to obtain a primary vector expression, then performing cross training by adopting the BilSTM, GRU and attention mechanism, and fusing the result to a softmax layer to perform final supervised label recognition. The experiment carries out model training analysis on more than 19000 pieces of data (each piece of data comprises texts and pictures), and the result proves that the machine learning effect of fusing the pictures and the characters is better.
Drawings
FIG. 1 is a schematic diagram of a teletext multi-modal processing scheme according to the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, in order to analyze data of a plurality of modalities, information complementation between the plurality of modalities is realized. In the project, a lower graph interactive deep learning framework is adopted for image-text multi-mode emotion recognition, a model setting picture and characters can independently express emotion (independently used as vector expression) or be mutually used as an auxiliary part to supplement and express emotion (vector interaction), then vector combination is used as comprehensive characteristics of images and texts, and finally, emotion comprehensive results are recognized through a softmax function.
The invention provides a technical scheme that: the image-text multi-modal emotion recognition method based on the BilSTM and the attention mechanism comprises the following steps:
s1, collecting data, namely collecting text data and picture data in a period of time; the data after crawling need form training set, verification set and test set through data cleaning, data integration and manual labeling, and the result label of labeling is: positive, negative, neutral.
The proportion of the data set for model training is: the proportion of the training set, the verification set and the test set is 7:2:1, and different models are obtained by combining different parameter configuration training.
S2, vector preprocessing, wherein the text and the picture are independently subjected to vector expression, and the text vector and the picture vector are subjected to feature tuning training through a bidirectional LSTM model;
the vector representation of the text is represented as W1,W2,…,WkForming an embedding vector; the vector expression of the picture is a vector which is preprocessed by a rolling machine neural network after the pixel expression of the picture.
And S3, the text vector and the picture vector are respectively subjected to combined training of attentions and GRU models, the result of the text training and the result of the picture training are subjected to cross influence, and implicit expression vectors of the text and the picture, namely a hidden text vector (hidden text vector) and a hidden picture vector (hidden image vector), are respectively obtained.
As shown in fig. 1, that is, for the right picture, the result obtained by the left text passing through ATT + GRU is blended into the result of ATT + GRU of the right picture with a lower weight, and then input into the ATT + GRU layer of the next layer; similar processing is performed on the left text, and the result obtained by the right picture is merged into the next ATT + GRU layer of text analysis with lower weight.
And S4, combining the vectors of the text and the picture, and identifying a final comprehensive result through a softmax function.
Example (b):
first, data crawlers. With the monitoring of network emotion of new coronavirus as background, using a Python script data acquisition tool to crawl different network sites such as news, microblogs, forums, WeChat public numbers and the like, wherein the main keywords of the crawler are as follows: new corona, pneumonia, epidemic, coronavirus, etc. The crawled data needs to be subjected to data cleaning, data integration and manual marking (the marked result labels are positive, negative and neutral), and a training set, a verification set and a test set are formed.
And secondly, parameter configuration. The deep learning frame comprises parameters such as the dimensionality of embeddings, the CNN convolution number, the unit number of LSTM, the layer number of ATT and GRU, softmax regularization, network learning rate and the like for configuration.
And thirdly, vector preprocessing. WORD vector training is carried out on the text by using WORD2VEC, and the picture is processed by using CNN to obtain a preliminary characteristic vector.
And fourthly, training a model. The proportion of the data set during the experiment was: the proportion of the training set, the verification set and the test set is 7:2: 1. And different models are obtained by combining different parameter configuration training.
And fifthly, evaluating the result. And comparing training results of different models, and comprehensively analyzing the accuracy and performance indexes of the models to obtain a model design scheme for industrial application.
The invention carries out emotion recognition analysis aiming at image-text multi-mode data, and integrates complementary information of images and characters in network public sentiment to develop an image-text multi-mode emotion recognition model framework in order to avoid limitation caused by single mode.
The method comprises the steps of preprocessing texts and pictures respectively by adopting WORD2VEC and CNN technologies to obtain preliminary vector expressions, performing cross training by adopting a BilSTM, GRU and attention mechanism, and fusing results to a softmax layer to perform final supervised label recognition. The experiment carries out model training analysis on more than 19000 pieces of data (each piece of data comprises texts and pictures), and the result proves that the machine learning effect of fusing the pictures and the characters is good.
By researching the image-text multi-mode emotion recognition method, the invention realizes the integrated emotion recognition of the network public opinion images and characters, and provides a certain technical basis for the automatic cognitive recognition of social emotion.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (4)
1. The image-text multi-modal emotion recognition method based on the BilSTM and the attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:
s1, collecting data, namely collecting text data and picture data in a period of time;
s2, vector preprocessing, wherein the text and the picture are independently subjected to vector expression, and the text vector and the picture vector are subjected to feature tuning training through a bidirectional LSTM model;
s3, the text vector and the picture vector are respectively trained by combining an attention model and a GRU model through an attention mechanism, and the result of the text training and the result of the picture training are subjected to cross influence to respectively obtain implicit expression vectors of the text and the picture;
and S4, combining the vectors of the text and the picture, and identifying a final comprehensive result through a softmax function.
2. The teletext multi-modal emotion recognition method based on the BilSTM and attention mechanism of claim 1, wherein: the data crawled in the step S1 need to undergo data cleaning, data integration, and manual labeling to form a training set, a verification set, and a test set, and the result labels of the labeling are: positive, negative, neutral.
3. The teletext multi-modal emotion recognition method based on the BilSTM and attention mechanism of claim 2, wherein: the data set proportion in step S1 is: the proportion of the training set, the verification set and the test set is 7:2:1, and different models are obtained by combining different parameter configuration training.
4. The teletext multi-modal emotion recognition method based on the BilSTM and attention mechanism of claim 1, wherein: the vector expression of the text in the step S2 is represented as W1,W2,…,WkForming an embedding vector; the vector expression of the picture is a vector which is preprocessed by a rolling machine neural network after the pixel expression of the picture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111021378.6A CN113705703A (en) | 2021-09-01 | 2021-09-01 | Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111021378.6A CN113705703A (en) | 2021-09-01 | 2021-09-01 | Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113705703A true CN113705703A (en) | 2021-11-26 |
Family
ID=78658820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111021378.6A Pending CN113705703A (en) | 2021-09-01 | 2021-09-01 | Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113705703A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116049743A (en) * | 2022-12-14 | 2023-05-02 | 深圳市仰和技术有限公司 | Cognitive recognition method based on multi-modal data, computer equipment and storage medium |
-
2021
- 2021-09-01 CN CN202111021378.6A patent/CN113705703A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116049743A (en) * | 2022-12-14 | 2023-05-02 | 深圳市仰和技术有限公司 | Cognitive recognition method based on multi-modal data, computer equipment and storage medium |
CN116049743B (en) * | 2022-12-14 | 2023-10-31 | 深圳市仰和技术有限公司 | Cognitive recognition method based on multi-modal data, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463609B (en) | Method for solving video question-answering by using layered space-time attention codec network mechanism | |
CN111488931B (en) | Article quality evaluation method, article recommendation method and corresponding devices | |
CN101634996A (en) | Individualized video sequencing method based on comprehensive consideration | |
CN111311364B (en) | Commodity recommendation method and system based on multi-mode commodity comment analysis | |
CN110705490B (en) | Visual emotion recognition method | |
CN113297370A (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN111325571A (en) | Method, device and system for automatically generating commodity comment labels for multitask learning | |
Zhao et al. | Flexible presentation of videos based on affective content analysis | |
CN117891940B (en) | Multi-modal irony detection method, apparatus, computer device, and storage medium | |
CN111563373A (en) | Attribute-level emotion classification method for focused attribute-related text | |
CN114170411A (en) | Picture emotion recognition method integrating multi-scale information | |
CN113987167A (en) | Dependency perception graph convolutional network-based aspect-level emotion classification method and system | |
Lan et al. | Image aesthetics assessment based on hypernetwork of emotion fusion | |
CN113705703A (en) | Image-text multi-modal emotion recognition method based on BilSTM and attention mechanism | |
CN118468882A (en) | Deep multi-mode emotion analysis method based on image-text interaction information and multi-mode emotion influence factors | |
CN117150320B (en) | Dialog digital human emotion style similarity evaluation method and system | |
CN111859925B (en) | Emotion analysis system and method based on probability emotion dictionary | |
CN114219514A (en) | Illegal advertisement identification method and device and electronic equipment | |
CN118115781A (en) | Label identification method, system, equipment and storage medium based on multi-mode model | |
CN112132075A (en) | Method and medium for processing image-text content | |
CN117114745A (en) | Method and device for predicting intent vehicle model | |
CN114297390B (en) | Aspect category identification method and system in long tail distribution scene | |
CN111340329A (en) | Actor assessment method and device and electronic equipment | |
Deng et al. | Multimodal Sentiment Analysis Based on a Cross-ModalMultihead Attention Mechanism. | |
Zhao et al. | Supplementing Missing Visions via Dialog for Scene Graph Generations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |