CN106886580B

CN106886580B - Image emotion polarity analysis method based on deep learning

Info

Publication number: CN106886580B
Application number: CN201710059051.5A
Authority: CN
Inventors: 毋立芳; 刘爽; 祁铭超; 张磊; 简萌
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-01-23
Filing date: 2017-01-23
Publication date: 2020-01-17
Anticipated expiration: 2037-01-23
Also published as: CN106886580A

Abstract

A picture emotion polarity analysis method based on deep learning relates to the technical field of image content understanding and big data analysis. The traditional picture emotion analysis method has poor final prediction accuracy due to the fact that models and features are simple. At present, a deep learning method is used for training in a large-scale training set, but the noise of the training set is too large, so that the final performance is limited. The invention adopts a mode of directly acquiring data from the network, and the data scale is large. Only emotional polarity information of a common word that needs to be obtained at the time of data preparation may need to be manually labeled. Then, the whole image acquisition and cleaning work can be automatically completed, and the required labor cost is low. In the data acquisition stage, two data cleaning processes are introduced, so that the noise of a large part of pictures inconsistent with the label can be eliminated. The method uses the priori knowledge in the training set to filter the training set, so that the noise of the training set is reduced, and the picture emotion prediction accuracy is improved by the aid of an improved network structure.

Description

Image emotion polarity analysis method based on deep learning

Technical Field

The invention relates to the technical field of image content understanding and big data analysis, in particular to a picture emotion analysis method.

Background

With the development of the internet and the popularization of smart phones, social networks have an irreplaceable position in people's daily life. More and more people are beginning to express their own opinions through social networking platforms, and a large amount of user generated data is generated accordingly.

User Generated Content (UGC) refers to the original Content uploaded by a User, which originates from the User and ultimately serves the User. In the web2.0 era, users are not passively accepting internet contents, but participate in the internet contents as subjects, and besides the roles of users, the internet contents also become producers and propagators.

In the face of huge user generated data, how to effectively utilize the data becomes a problem which needs to be solved urgently at present. With respect to these data, research related to opinion mining and sentiment analysis began to be a research hotspot. They analyze UGC data for public opinion analysis, for responses of the public to an event, for predicting a box office, for predicting stock trends, etc.

But these studies and methods are currently based on textual information. In the social network, the user data is diversified and includes not only text, but also pictures, videos and the like.

For characters, people with different backgrounds in different regions may have different understandings, but for pictures, the reactions of people are often consistent. And now devices for graphics computing are becoming cheaper and more powerful, which makes it possible to do large-scale graphics computing.

At present, for the emotion analysis problem of pictures, a supervised learning method is generally adopted. Firstly, collecting a picture set with labels, then training a model by using a machine learning method, and finally carrying out emotion analysis on a new picture by using the trained model.

Early methods utilized manually collected sets of pictures and classified using simple classifiers, such as: the article "sentributibute: image sensory analysis from a mid-levelperspective" published by JianboYuan in 2013 uses a manual annotation data set of SUN, which includes 14340 manual annotation images, and the images are subjected to emotion analysis by using an SVM as a learning tool and being assisted by facial expression recognition.

With the complexity of machine learning models, small-scale data sets have not been able to meet training requirements. It is common in recent work to acquire data sets in a manner that the network collects the data sets. For example: an article "Analyzing and Predicting sentiment of Images on the Social Web" published by Stefan Siersdorfer in 2010 uses the words of the top 1000 positive and negative sentiment intensities in the sentiment dictionary of SentiWordNet as search words to search in Flickr to obtain 586000 Images for training of sentiment analysis models; in an article, "Large-scale Visual sensory Ontology and Detectors Using objective Noun Pairs", published by Damian Borth in 2013, 1200 Adjective name Pairs are used for searching and sorting in Flickr to form a Large-scale emotion analysis data set Sentibank. Sentibank is a widely used emotion analysis data set at present, but because pictures in the sentiment analysis data set are directly acquired from a network and then stored, the noise is large, and the subsequent emotion analysis precision is severely restricted.

Some of the latest methods are methods using deep learning. For example: the method is characterized in that a PCNN is constructed by Using a Sentibank data set and a self-learning thought to improve a Deep learning network in an article 'road Image generative Analysis Using progressive Trained and domain transferred Deep Networks' published by Quanzing Young 2015, and the noise problem in the network data set can be resisted to a certain extent.

In summary, the traditional picture emotion analysis method needs a small data set, but the final prediction accuracy is not ideal due to the simplicity of models and features. Some current methods using deep learning train in a large-scale training set, but the final performance is limited due to the excessive noise of the training set. The invention provides a picture emotion polarity analysis method based on deep learning, which is characterized in that priori knowledge is used for a training set to filter the training set, so that the noise of the training set is reduced, and the picture emotion prediction accuracy is improved by means of an improved network structure.

Disclosure of Invention

The invention aims to provide a picture emotion polarity analysis method based on deep learning, and the frame of the method is shown in figure 1.

The method comprises three stages, namely data acquisition, deep learning model training and picture emotion polarity analysis.

According to the method, firstly, some emotion vocabularies are used as search terms to obtain related pictures from a picture website, and then the emotion polarities corresponding to the emotion vocabularies are used as picture labels to obtain an initial data set. And then filtering the data set by utilizing the emotion polarity of the search word, the picture label and the emotion consistency of the picture description characters to obtain a purer data set. And then, training the CNN model by using the obtained data set by using a deep learning method to obtain an emotion polarity classification model. And finally, carrying out emotion polarity analysis on the picture by using the CNN model trained in the previous step.

The picture emotion analysis method specifically comprises the following steps:

1. data acquisition

The method can be applied to most of picture social network sites with picture searching functions. Since the website has the maximum retrieval number limit, and in order to ensure the richness and the balance of data, a large number of search terms are used for retrieval to obtain pictures in the method.

1.1. A priori knowledge preparation

In order to ensure the emotion polarity accuracy of the search word, an emotion dictionary of word emotion polarity is prepared before data acquisition. In the method, the main emotion polarity of the commonly used vocabulary can be provided by using the main emotion polarity emotion dictionary of the emotion vocabulary. The dominant emotional polarity of a word is the emotional polarity that the word expresses in its usual context. The emotion dictionary needs to be constructed by manual labeling or by using an existing public dictionary, and words in the emotion dictionary are constructed in a (word, emotion intensity) manner, wherein the value range of the emotion intensity is [ -1,1], the more the emotion intensity is close to 1, the more positive the emotion polarity of the word is represented, and the more negative the emotion polarity is represented if the emotion polarity is close to-1, specific examples are as follows:

remorse-0.9

Violent anger-0.9

From 0.7

0.7 pieces of thousands of knives

Lezi 0.5

0.5 land for five bodies

1.2. Search term selection

In order to acquire data from the network, firstly, search terms need to be prepared, and in the method, a strategy for collecting the search terms from the network is selected. The method comprises the following specific steps:

1.2.1 using words (such as happy and sad) containing definite emotion polarity as initial search words to search in the picture website, collecting search results and extracting description words in the search results, wherein the description words refer to description information related to pictures, such as labels, introduction and context text information of the pictures.

1.2.2 using word segmentation tool to perform word segmentation processing on the description words and remove stop words, performing part-of-speech analysis on the independent words in the description words, and extracting nouns and adjectives in the description words. And the nouns and adjectives are paired one by one (taking the Cartesian product). The paired results are stored in the form of adjectives and nouns as an initial search word bank.

1.2.3, performing data cleaning on the initial search word library obtained in 1.2.2, wherein the aim of the cleaning is to remove the part of the search word library where the emotional polarities of the adjectives and the nouns conflict. Using the emotion dictionary obtained in 1.1, analyzing the polarity relationship of the adjective nouns in each search thesaurus and removing the conflict, and formalizing the rule for any (adjective, noun) pair in the search thesaurus as follows:

f₁(A,N)＝Sen(A)+Sen(N) (1)

wherein A represents an adjective in the word pair and N represents a noun in the word pair. The Sen (x) function represents the emotion polarity for word x obtained from the emotion dictionary (obtained in 1.1), i.e., if the emotion intensity is (0, 1)]The Sen () function returns 1, returns-1 if the emotion intensity is [ -1,0), and returns 0 if the word x is not present in the emotion dictionary, considering x not to contain emotion. If f is₁A value of 0 indicates that there is a conflict between the adjective nouns or that there is no emotion, and should be removed. If f is₁A non-0 indicates that no conflict exists and should be preserved.

And 1.2.4, performing emotion marking on the screened search word bank by using the emotion dictionary obtained in the step 1.1 and generating a final search word bank. The emotion label of each (adjective, noun) pair in the search word library is obtained by adding the emotion intensity of the adjective and the noun. Specific examples are:

adjectives: perfect of-0.9

The noun: groaning-0.5

Emotion label: -0.9+ (-0.5) ═ 1.4

1.3. Search using search term

And (3) carrying out image retrieval by using the retrieval word bank obtained by the step 1.2.4, and specifically comprising the following steps:

(1) and taking out a pair of emotional words from the search word bank.

(2) And searching in the website to obtain a search result.

(3) And extracting the picture and corresponding description words from the retrieval result, wherein the description words refer to description information related to the picture, and can be labels, introduction and context text information of the picture.

(4) And performing word segmentation processing on the description characters by using a word segmentation tool, removing stop words, and taking independent words in the description characters as description information.

(5) And taking the emotion marking information corresponding to the emotion words for the retrieval as a label of the extracted picture.

(6) The (pictures, description information, tags) are stored in the database as triplets.

(7) And (4) repeating the steps (1) to (6) until all the words in the search word bank are used.

So far, we obtain an emotion picture database.

1.4. Data set cleansing

Because the noise of internet data is very big, the data cleaning work is very important. In the method, the emotional polarity of descriptive information words of the picture and the consistency of the label of the picture are utilized to remove a possibly existing noise image. The method comprises the following specific steps:

(1) and taking out a three-tuple element from the emotion picture database obtained in the step 1.3.

(2) And (4) judging the polarity of the description information words one by using the emotion dictionary obtained in the step 1.1.

(3) And (3) carrying out consistency analysis on the polarity obtained in the step (2) and the polarity of the tag item of the triple, if the polarities of the two are in conflict, considering the triple element as a noise element, and deleting the triple element from the database, wherein the formalization of the rule is represented as follows for any pair (picture, description information and tag) in the emotion picture database:

f₂(Label,Tag)＝∑(not(sgn(Label)+Sen(Tag_i))) (1)

wherein Label represents a Label in the triple, Tag represents description information in the triple, and Tag represents description information in the triple_iRepresents the ith independent word in the description information. sgn (x) is a sign function, and the function expression is shown as (3). The not (x) function is a logical negation function, if x is equal to 0, then not (x) is 1, otherwise if x is not equal to 0, then not (x) is 0. If f is₂If the result of (2) is greater than 0, it indicates that there is a conflict between the tag of the picture and the corresponding description information, and the triplet should be deleted from the database. If f is negative or positive₂A result of 0 indicates that no conflict exists and should be preserved.

(4) Repeating (1) - (3) until all pictures in the database have been analyzed.

2. Deep learning model training

After the data set is obtained, model training may be performed. In the method, an improved CNN model is adopted, and an auxiliary loss layer is added on the basis of the traditional CNN model, so that the CNN model has better performance.

2.1. Designing deep Convolutional Neural Network (CNN)

Fig. 2 shows the CNN model framework used in the present method. The network consists of 5 convolutional layers, 3 fully-connected layers and 1 softmax layer, wherein the activation function of the neuron selects a ReLU function, the parameters of the first two convolutional layers and the second 5 convolutional layers are added with a poling layer, the convolutional layers, the poling layer and the first two layers of the fully-connected layers are completely consistent with the configuration of AlexNet, and the modification size of the last fully-connected layer is 2 and is named as fc8_ s. The output of the softmax layer is the emotional polarity (positive, negative) of the picture. An Euclidean loss layer and a corresponding full-connected layer fc8_ e are additionally added during model training, and the output is the emotional intensity of a picture and is used for measuring the prediction error at the real number level. The image needs to be normalized to 256 x 256 RGB images when it is input into the CNN network.

2.2. Training CNN model

And (3) training the CNN model by using the data set obtained in the step (1), and firstly, restoring each element of the data set into a triple form (picture, real-value label and binary label). The way of binarization is (0, 1) quantized to 1 and [ -1,0] quantized to 0. the picture in the triplet is used as input, the label is used as a supervision signal at real number level and used as a measure of Euclidean Loss, and the label after binarization is used as a supervision signal of emotion polarity and used as a measure of Softmax Loss.

3. Sentiment analysis of pictures

Using the trained CNN model in 2 as an emotion classifier, normalizing the picture to 256 × 256 size, and inputting the normalized picture into the model to generate emotion polarity prediction output.

Compared with the prior art, the method has the following advantages:

1. the available data size is large

The present invention uses a direct data acquisition from the network that is much larger in data size than the original manually collected data set.

2. The labor cost is low

In the invention, only the emotional polarity information of a common word which needs to be obtained in data preparation may need to be manually marked. Then, the whole image acquisition and cleaning work can be automatically completed, and the required labor cost is low.

3. Low data noise

In the data acquisition stage, two data cleaning processes are introduced, so that the noise of a large part of pictures inconsistent with the label can be eliminated, and compared with the traditional method for directly acquiring the network data set, the method has lower data noise.

4. The prediction precision is high

When the same data set is used for training, the accuracy of the model provided by the invention can be improved compared with that of the traditional CNN model.

Drawings

FIG. 1 is a diagram of a picture emotion analysis framework designed by the present invention;

FIG. 2 is a CNN model framework used in the present invention;

FIG. 3 is an example of emotion words used in the practice of the present invention;

FIG. 4 is an example of a removed term after practicing the present invention;

FIG. 5 is an example of a graph of the results of ablation after the practice of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings. The invention aims to provide a picture emotion polarity analysis method based on deep learning, and the frame of the method is shown in figure 1. The invention is described in further detail below with reference to the figures and examples.

The invention has the following implementation steps:

1. data acquisition

The method can be applied to most of picture social network sites with picture searching functions. In the implementation, we select Flickr, a photo social networking site to collect data. Flickr currently allows the peak picture number to be returned to 2000 for a retrieval request.

1.1. A priori knowledge preparation

In the implementation process, an English emotion dictionary is selected, and for convenience, an existing vocabulary dictionary is used, wherein the vocabulary dictionary is provided by an article "Sentiment analysis on twitterthrough topic-based lexicon expansion" of Zhou Zhixin in 2014, 21000 independent vocabularies are included, each vocabulary has a corresponding main emotion mark, the value range of the intensity is [ -1,1], the more 1 the emotion intensity is, the more positive the emotion polarity of the word is, and otherwise, the more negative the emotion polarity is, the more negative the emotion polarity. Fig. 3 lists some words randomly selected from the dictionary and the corresponding annotation information.

1.2. Search term selection

In order to acquire data from the network, search terms need to be prepared first, and here we select a strategy for collecting the search terms from the network. The method comprises the following specific steps:

1.2.1 we use the words containing explicit emotion polarity as the initial search words to search in the Flickr website, collect the search results and extract the descriptive text (description and tag information) of each picture, in the specific implementation we choose anger, dispatch, fear, sadness, happy, exception, awe, amusement as the initial search words.

f₁(A,N)＝Sen(A)+Sen(N) (1)

wherein A represents an adjective in the word pair and N represents a noun in the word pair. The Sen (x) function represents the emotion polarity for obtaining word x from the emotion dictionary (found in 1.1), i.e., if the emotion intensity is (0, 1) then the Sen () function returns 1, if the emotion intensity is [ -1,0) then the Sen () function returns-1, if there is no word x in the emotion dictionary then x is considered not to contain an emotion and the function returns 0. If 0, it indicates that there is a conflict between the adjective nouns or contains no emotion, and should be removed. If not 0, this indicates that no conflict exists and should be preserved.

And 1.2.4, performing emotion marking on the screened search word bank by using the emotion dictionary obtained in the step 1.1 and generating a final search word bank. The emotion label of each (adjective, noun) pair in the search word library is obtained by adding the emotion intensity of the adjective and the noun.

1.3. Search using search term

(1) and taking out a pair of emotional words from the search word bank.

(2) And searching in the website to obtain a search result.

So far, we obtain an emotion picture database.

1.4. Data set cleansing

(1) taking out a three-tuple element from the emotion picture database obtained from 1.3

f₂(Label,Tag)＝∑(not(sgn(Label)+Sen(Tag_i))) (1)

(4) Repeating (1) - (3) until all pictures in the database have been analyzed.

Figure 5 lists some of the noise images removed using this rule.

2. Deep learning model training phase

2.1. Designing deep Convolutional Neural Network (CNN)

Fig. 2 shows the CNN model framework used in the present method. The network consists of 5 convolutional layers, 3 fully-connected layers and 1 softmax layer, wherein the activation function of the neuron selects a ReLU function, the parameters of the first two convolutional layers and the second 5 convolutional layers are added with a poling layer, the convolutional layers, the poling layer and the first two layers of the fully-connected layers are completely consistent with the configuration of AlexNet, and the modification size of the last fully-connected layer is 2 and is named as fc8_ s. The output of the softmax layer is the emotional polarity (positive, negative) of the picture. An Euclidean loss layer and a corresponding full-connected layer (named fc8_ e) are additionally added during model training, and the output is the emotional intensity of a picture and is used for measuring the prediction error at the real number level. The image needs to be normalized to 256 x 256 RGB images when it is input into the CNN network.

2.2. Training CNN model

The deep learning model training can be carried out under a Caffe framework, and as the CNN network used by the method is completely consistent with the parameters of the first 7 layers of AlexNet, the model trained on ImageNet by AlexNet can be used for fine adjustment in the obtained data set during training. The learning ratio of the first 7 layers is set to 1, and the learning ratios of the two fully connected layers fc8_ s, fc8_ e are set to 10. The basic learning rate and the iteration number can be determined according to the data scale and the learning condition of the model.

3. Sentiment analysis of pictures

4. Model evaluation

The data cleaning method provided by the invention is used for cleaning data in a sentiBank Image library, the cleaned data set is used for training a Deep learning model provided by the invention, and then a test is carried out on a Twitter picture emotion data set (published in 2015 by Quanzing Young's article, "Robust Image sentational Analysis Using progressive transmitted and Domain Transferred Deep Networks", here, a 5-aggregate subset is used), so that the prediction accuracy can reach 81.95%, and is improved by more than 4% compared with the traditional Deep learning method.

Claims

1. A picture emotion polarity analysis method based on deep learning is divided into three stages, namely a data acquisition stage, a deep learning model training stage and a picture emotion polarity analysis stage; the method is characterized by comprising the following specific steps:

the data acquisition comprises the following specific steps:

1.1. a priori knowledge preparation

An emotion dictionary of main emotion polarities of emotion vocabularies needs to be prepared, the emotion dictionary needs to be constructed in a manual labeling mode or an existing public dictionary is used, and words in the emotion dictionary are constructed in a (word and emotion intensity) mode;

1.2. search term selection

Selecting a strategy for collecting search terms from a network; the method comprises the following specific steps:

1.2.1 using words containing definite emotion polarity as initial search words to search in a picture website, collecting search results and extracting description words in the search results, wherein the description words refer to description information related to pictures and comprise labels, introduction and context text information of the pictures;

1.2.2 utilizing a word segmentation tool to perform word segmentation processing on the description characters, removing stop words, performing part-of-speech analysis on independent words in the description characters, and extracting nouns and adjectives in the description characters; and the nouns and the adjectives are paired one by one; storing the paired results as an initial search word library in an adjective and noun manner;

1.2.3, carrying out data cleaning on the initial search word bank obtained in the step 1.2.2 for one time; using the emotion dictionary obtained in 1.1, analyzing the polarity relationship of the adjective nouns in each search thesaurus and removing the conflict, and formalizing the rule for any (adjective, noun) pair in the search thesaurus as follows:

f₁(A,N)＝Sen(A)+Sen(N) (1)

wherein A represents an adjective in the word pair, and N represents a noun in the word pair;the Sen (x) function represents the emotion polarity of word x obtained from the emotion dictionary obtained in 1.1, i.e., if the emotion intensity is (0, 1)]Then the Sen () function returns 1, if the emotion intensity is [ -1,0) then the Sen () function returns-1, if there is no word x in the emotion dictionary then x is considered not to contain emotion, the function returns 0; if f is₁If the value is 0, the conflict exists between the adjective nouns or the emotions are not contained, and the emotions are removed; if f is₁If not, it indicates that there is no conflict and should be reserved;

1.2.4 performing emotion marking on the screened search word bank by using the emotion dictionary obtained in the step 1.1 and generating a final search word bank; the emotion label of each (adjective, noun) pair in the search word library is obtained by adding the emotion intensity of the adjective and the noun;

1.3. search using search term

(1) taking out a pair of emotional words from the retrieval word bank;

(2) searching in a website to obtain a search result;

(3) extracting pictures and corresponding description words from the retrieval result, wherein the description words refer to description information related to the pictures and comprise labels, introduction and context text information of the pictures;

(4) utilizing a word segmentation tool to perform word segmentation processing on the description characters and remove stop words, and taking independent words in the description characters as description information;

(5) taking emotion marking information corresponding to the emotion words for the retrieval as a label of the extracted picture;

(6) storing (pictures, description information, tags) as triples in a database;

(7) repeating the steps (1) to (6) until all the words in the search word bank are used;

thus, an emotion picture database is obtained;

1.4. data set cleansing

The method comprises the following specific steps:

1) taking out a three-tuple element from the emotion picture database obtained in the step 1.3;

2) judging the polarity of the description information words one by using the emotion dictionary obtained in the step 1.1;

3) and (3) carrying out consistency analysis on the polarity obtained in the step (2) and the polarity of the tag item of the triple, if the polarities of the two are in conflict, considering the triple element as a noise element, and deleting the triple element from the database, wherein the formalization of the rule is represented as follows for any pair (picture, description information and tag) in the emotion picture database:

f₂(Label,Tag)＝∑(not(sgn(Label)+Sen(Tag_i))) (2)

wherein Label represents a Label in the triple, Tag represents description information in the triple, and Tag represents description information in the triple_iRepresenting the ith independent word in the description information; sgn (x) is a sign function, and the function expression is shown in formula (3); the not (x) function is a logical negation function, if x is equal to 0, then not (x) is 1, otherwise, if x is not equal to 0, then not (x) is 0; if f is₂If the result is greater than 0, it indicates that there is a conflict between the tag of the picture and the corresponding description information, and the triple should be deleted from the database; if f is negative or positive₂If the result is 0, the conflict does not exist and the conflict should be reserved;

4) repeating 1) -3) until all pictures in the database have been analyzed;

the deep learning model training comprises the following specific steps:

2.1 designing deep convolutional neural networks

The network consists of 5 convolutional layers, 3 fully-connected layers and 1 softmax layer, wherein the activation function of a neuron selects a ReLU function, the parameters of the first two convolutional layers and the second 5 convolutional layers are completely consistent with the configuration of AlexNet, and the modified size of the last fully-connected layer is 2 and is named as fc8_ s; the output of the softmax layer is the emotion polarity of the picture; an Euclidean loss layer and a corresponding full-connection layer are additionally added during model training, the emotion intensity of the picture is output, and the emotion intensity is used for measuring the prediction error of the real number level; when the image is input into a CNN network, the image needs to be normalized into 256 × 256 RGB images;

2.2 training the CNN model

Training a CNN model by using a data set obtained by the data acquisition, and firstly, storing each element of the data set again in a triple form (picture, real-value label and binary label); the binarization mode is that (0, 1) is quantized to be 1 and [ -1,0] is quantized to be 0, the picture in the triple is used as input, the label is used as a supervision signal of a real number level and used as the measurement of Euclidean Loss, and the label after binarization is used as a supervision signal of emotion polarity and used as the measurement of Softmax Loss;

performing emotion analysis on the picture;

using the trained CNN model as an emotion classifier, the picture is first normalized to 256 × 256 dimensions and then input into the model to generate an emotion polarity prediction output.