CN112712127A

CN112712127A - Image emotion polarity classification method combined with graph convolution neural network

Info

Publication number: CN112712127A
Application number: CN202110019810.1A
Authority: CN
Inventors: 毋立芳; 张恒; 邓斯诺; 石戈; 简萌; 相叶
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-27

Abstract

An image emotion polarity classification method combined with a graph convolution neural network relates to the technical field of intelligent media calculation and computer vision; firstly, extracting object information from a training sample, and establishing a graph model by using the object information and visual characteristics in each picture; secondly, extracting object interaction information contained in the graph model by using a graph convolution network, and fusing the object interaction information with the characteristics of the convolution neural network; then preprocessing the training sample and transmitting the preprocessed training sample into a network, and iteratively updating the parameters of the model by using a loss function and an optimizer until convergence is reached, thereby finishing training; and finally, sending the test data into a network to obtain the prediction result and classification accuracy of the model on the test data. According to the invention, the interactive features of the objects in the image in the emotion space are extracted, so that the classification features are more consistent with the emotion features of the objects and the human emotion triggering mechanism, and high-level semantic features are added on the basis of the visual features, which is beneficial to improving the performance of the emotion classification algorithm in the actual application scene.

Description

Image emotion polarity classification method combined with graph convolution neural network

Technical Field

The invention belongs to the technical field of computer vision, and relates to an image emotion polarity classification method combined with a graph convolution neural network.

Background

With the rapid development of image social networks, more and more people like to express their moods using pictures. Images containing emotional information have an important position in enhancing the view of content delivery and effectively affecting viewers. In the face of massive picture data containing user emotion, the analysis of emotion information contained in the pictures can greatly promote the development of social media, and the method is widely applied to the fields of education, advertisement, entertainment and the like. Therefore, image emotion analysis has become one of the recent research hotspots.

Earlier image emotion analysis methods mainly used statistical characteristics of images, such as colors, textures, lines and other artificial features, to realize emotion classification of images, but because a large semantic gap and an emotion gap exist between low-level features of images and human emotions, a good emotion polarity analysis effect is not obtained. In recent years, with the rapid development of social media and electronic computers, deep learning is continuously developed in the field of computer vision, and the deep learning has better effects in the fields of image classification, object detection, target tracking and the like. The role of advanced visual features extracted through convolutional neural networks in the field of computer vision has also attracted some researchers to apply them to image emotion classification. In 2015, You et al designed an image emotion classification algorithm implemented by using a convolutional neural network, and obtained a better classification effect compared with the conventional method, but the performance improvement of the method is limited due to the inherent limitation of the self-learning theory. With the intensive research on a convolutional neural network and a human emotion principle, Yang et al propose an image emotion classification algorithm combined with an image instance segmentation algorithm in 2018, and extract an area containing rich semantics and emotion content in an image by combining an algorithm with a good effect in the current image instance segmentation field, so that the enhancement of visual features in the image emotion classification algorithm is realized, and the accuracy of the image emotion classification algorithm is improved. Wu et al designed a convolutional neural network combining an image saliency detection algorithm and image emotion classification in 2019, and aiming at the connection between the human attention mechanism principle and human emotion, a region which attracts the most attention in an image obtained by a human is simulated by the saliency detection algorithm, and the polarity feedback adjustment is performed by using an error back propagation mechanism, so that the accuracy of image emotion classification is further improved.

The inspiration of people is aroused by some latest achievements and research achievements in the field of human emotion principle, and the effect of the classification algorithm is improved by acquiring and enhancing visual features contained in partial objects in the image by the existing image emotion classification algorithm. However, the existing research has certain limitations, and the existing research realizes the enrichment and enhancement of the visual features of the object on the basis of object segmentation, significance detection and the like, but the utilization of the object features is only limited to the visual features, and the interactive relationship of the object in the emotion space is not utilized, so that the designed model has limitations and limited performance improvement. On the basis, an emotion classification method combining the emotion relation and the visual features of the objects is designed by using a graph convolution model, the visual features of the images are used, the mutual influence of all objects in the images in an emotion space is also considered, the features of the images in an emotion semantic level are fully mined, and the emotion classification accuracy of the images is improved.

Disclosure of Invention

The invention aims to design an image emotion polarity classification method combined with a graph convolution neural network, and a frame diagram of the image emotion polarity classification method is shown in figure 1.

Aiming at the problems of the existing research method, a model utilizing the emotional relation characteristics of objects in the image is designed and combined with a graph convolution neural network, so that the model can simultaneously acquire the relation characteristics of the objects in the emotional space in the image and the visual characteristics of the image. By combining an open-source panorama segmentation algorithm Detectron2, a method for establishing a corresponding graph model according to a picture segmentation result is designed, and a graph convolution model is utilized to express the interaction characteristic, namely the relationship characteristic, of an object in an image in an emotion space while enriching the visual characteristic of the image, and is combined with a basic convolution neural network, so that the enrichment of high-level semantic characteristics of the image is realized, and the accuracy of emotion polarity analysis of the image is improved.

The method comprises the following specific steps:

step 1, obtaining image object information: carrying out panoramic segmentation processing on each picture in the data set by using a panoramic segmentation model to obtain information such as the category, position, area and the like of an object in the picture, and labeling the emotional polarity and intensity of object category words obtained by panoramic segmentation in the data set by using the labeling result of the words in SentiWordNet;

step 2, establishing a graph model: and establishing a corresponding graph model by taking the object as a node, the reciprocal of the emotional space distance between the object words and other objects as an edge weight and the brightness and texture characteristics of the corresponding region of the object as node characteristics.

Step 3, establishing a deep network model: using a basic convolution neural network model VGG-16 and a graph convolution model GCN, merging the outputs of the two models, inputting the merged outputs into a full connection layer, and replacing the final classification layer of the original model by using the number of classes to be classified as the output dimensionality of the full connection layer;

step 4, training a model: preprocessing an image in modes of scaling, random overturning and the like, inputting the image into a network model, optimizing the image by using a random gradient descent method, evaluating the performance of the model by using a cross entropy function and learning the parameters of the model;

step 5, obtaining the emotion types of the images to be detected: and (4) after the images in the database are subjected to the preprocessing step as the synchronization step 4, inputting the images into the model trained in the step 4 to obtain the corresponding emotion types.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress:

the invention provides a novel image emotion classification algorithm, which combines an image convolution network with a basic deep convolution neural network and adds a high-level semantic feature of object emotion relation features on the basis of high-level visual features. Aiming at the problem that object information contained in a picture is not labeled in an existing public emotion data set, a design method is used for converting the picture into a graph model by utilizing research results in the field of panorama segmentation, the graph model is updated and enhanced by utilizing a graph convolution network and is fused with visual features, emotion features in the picture are accurately extracted by utilizing an emotion triggering mechanism, and a better emotion classification effect is obtained.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is an architecture diagram of a convolutional neural network for training image emotion classification based on the method.

FIG. 2 is an overall flow chart of emotion image classification based on the method.

Detailed Description

The invention provides an image emotion polarity classification method combined with a graph convolution neural network. The overall structure of the present invention is shown in fig. 1. The invention is used for simulation in Windows10 and Pyhcharm environments. The specific implementation flow of the invention is shown in fig. 2, and the specific implementation steps are as follows:

In step 1, a method for acquiring image object information by using a panoramic segmentation model is designed:

the method can be used for emotion classification of images in a large real social network, so that a universal public emotion data set Flickr and Instagram (hereinafter referred to as FI data set) which is extracted and arranged from Flickr and Instagram is selected in the example, the data set has the characteristics of large data scale and accurate emotion marking, and is more in line with a real network environment.

Using the panorama segmentation model in detelton 2, each picture in the FI dataset is segmented and information such as the type, position, pixel point position, etc. of the object is saved. Aiming at the class words of an object, a word emotion labeling method utilizing an emotion dictionary SentiWordNet is designed, and a specific calculation method is as follows:

S_w＝S_p-S_n

wherein S_pPositive emotional intensity for object word W, where p represents positive, S_nThe negative emotion intensity subscript n of the object represents negative emotion, S ' is noun and adjective of the current word W contained in SentiWordNet, m is the number of S ', and S '_ip、S′_inFor marking the SentiWordNetNote the positive and negative emotion intensity of S', where the subscript i represents the order in n words, p, n represent positive and negative, respectively, with values between 0 and 1, and the emotion value of the current word W is calculated using the sum-average method. And simultaneously, representing the emotional intensity S of the current word by the difference value of the positive and negative emotional intensities of each word.

In step 2, a method for establishing a graph model by using image object information is designed:

and using object information obtained by panoramic segmentation, using objects contained in the picture as nodes of the graph model, and using the reciprocal of the distance of the words in the emotion space as the edge weight between the nodes in the graph model. The distance of the words in the emotion space is calculated by adopting the following formula:

wherein S_i、S_jRespectively represent two words W_i、W_jWhen the emotion polarities of the two words are opposite, the distance of the words in the emotion space is described by adding 1 to the absolute value of the difference of the emotion intensities of the two words, and when the emotion polarities of the two words are the same, the absolute value of the difference of the emotion intensities is used as the emotion distance. In particular, when the emotional intensity of two words is 0 at the same time, we specify that the emotional distance of the two words is 0.5. since the emotional value of a word is between 0-1, 0.5 is used to distinguish the case of two neutral words and two homopolar words.

And taking the brightness characteristic and the texture characteristic as node characteristics in the graph model. And (4) obtaining the image area where each object is located by using the position information of the object frame obtained in the step (1). The brightness histogram of the pixels in the picture is taken as the brightness characteristic, namely the RGB value of the pixels in the image area where the object is located is converted into the hue, saturation and brightness of an HSI space, meanwhile, the brightness value is quantized, the distribution curve of the quantized brightness value is taken as the brightness characteristic, the brightness is quantized to be 0-255, and finally, the 256-dimensional characteristic vector is obtained.

Meanwhile, for the texture characteristics, the area where each object is located is calculated by using a gray level co-occurrence matrix method. Here we propose to perform the calculation in the 45 ° or 135 ° direction, the feature quality obtained by using the 0 ° or 90 ° calculation is low, and the result after the calculation is quantized to a feature vector of 256 dimensions.

And finally, splicing the brightness features and the texture features to obtain 512-dimensional feature vectors as features corresponding to each node in the graph model, and establishing the graph model by using the edge weight matrix A as an edge of the graph model.

In step 3, a deep network model is established:

and extracting the relation characteristics by using a GCN model. The method is realized by using a structure of stacked GCN, wherein a two-layer GCN structure is used in the method, and the input characteristic H of the current layer k^kIs output from the previous layer, and the output result H is calculated in the following way^k+1：

Wherein

The result of adding the edge weight A obtained by the calculation of the adjacent matrix according to the emotion value of the object and the unit matrix is obtained; w^kThe weight matrix of the current convolutional layer k is obtained by random initialization, and training and adjustment are carried out in the training process according to a loss function until the training is finished; a is a non-linear activation function,

calculated from the following formula:

wherein

Is composed of

Wherein i represents

The coordinates of (a) are (b),

is composed of

Is an element of (i), ij is

Coordinates of (2).

And obtaining a characteristic vector with 2048-dimensional relational characteristics through calculation of a GCN model.

For the visual features, a VGG16 is adopted to obtain a VGG16 model which is pre-trained on ImageNet, the classification layer with the final input dimension of 2048 × 1024 and the output dimension of 1 × 1024 is removed after loading, and the output dimension of the classification layer is adjusted to 2048 × 1 to serve as the visual features. The GCN network comprises two map convolutional layers in total, wherein the input dimension of the first map convolutional layer is 512 x 1, the output is 1024 x 1, the input dimension of the second convolutional layer is 1024 x 1, and the output is 2048 x 1, namely the final relational feature vector. Performing splicing operation on the visual features and the relational features to obtain 4096 x 1 feature vectors, inputting the feature vectors into a final classification layer, outputting the dimension of 1 x 2, namely predicting probability of each category, and taking the corresponding emotion category of the position to which the maximum item belongs as the emotion category of image output;

in the step 4, the training of the model is realized through operations such as data preprocessing, data input, calculation of a loss function and the like:

the images are pre-processed by scaling, random flipping and the like, in this example, the parameter of random cropping is set to 224 × 224, and the probability of random flipping is set to 0.5. And inputting the batch with the fixed size into the network model, and taking the sample of the batch with the fixed size as a batch. The fixed batch size setting will improve the training effect of the model to a certain extent as much as possible, but due to the limitation of the experimental platform, it is recommended to select 8, 16 or 32, in this example, the fixed batch size setting is 16. And automatically comparing the output prediction result with the input training set label through the final classification layer, and counting the proportion of the correct number of samples in the whole training sample as the accuracy of the training set in the round. And meanwhile, when an output vector is obtained, a loss value of the current model can be calculated by using a loss function shown below, and the loss value is fed back to the optimizer for processing and then carrying out back propagation to update each parameter in the model.

In the calculation of the loss function, the cross entropy loss function shown as follows is used, and the purpose is to keep the distance between classes and make the images between different emotion classes farther:

wherein m is the number of images in each batch during training, N is the number of emotion classes in the data set, and x_iFor the features of the ith picture in the batch obtained from the basic backbone network in the foregoing 3 before the classification layer, w and b are the values of weight and bias parameters in the classification layer, and the subscript y_iIndicating the class of prediction after the classification level, j representing the class number corresponding to the prediction result, e.g. wherein

Indicates that the ith picture of the batch is judged as y_iClass is the value of the bias parameter in the classification layer.

In consideration of the convergence speed and the convergence effect, the optimizer in the method selects a random gradient descent method as an optimization method. The parameter setting of the optimizer mainly comprises two items of initial learning rate (learning rate) and momentum (momentum), wherein the initial learning rate is generally selected according to the convergence condition of the model in the equivalence of 0.1, 0.01, 0.0001 and 0.00001, the embodiment recommends 0.01, and the convergence effect is more stable at the initial value. The momentum is in principle between 0 and 1, in this case preferably a default value of 0.9 in the stochastic gradient descent method. Because the setting of the fixed learning rate is not beneficial to the deep network to search better parameters in the second half of training, the method increases the strategy of reducing the learning rate in fixed rounds in the training process. Wherein the reduced rounds recommend 1-2 reductions in 20-30 rounds and the total number of training rounds recommends 50-80 rounds. In this example, the optimizer is set to reduce the learning rate every 20 rounds and every 30 rounds, and model parameters are trained and learned for 80 rounds to ensure effective convergence of the training effect, and too few rounds of setting may not be converged, and too many rounds of setting may increase the training time but may not improve the effect.

After each round of training samples is finished, parameters of the model are fixed, the parameters are cut into the network model in a scaling mode by adopting the fixed size of the data of the verification set in the FI data set, in the example, the cutting parameters are set to be 224 x 224, the output of the model is compared with the labels of the samples, the proportion occupied by the correct samples, namely the accuracy rate of the verification set, is counted, if the accuracy rate of the verification set of the current round number is higher than the accuracy rate of the previous highest verification set, the accuracy rate of the verification set with the highest current accuracy rate is saved, and the model trained by the current round number is saved. After all rounds of training are finished, the model under the highest verification set accuracy rate is finally stored, namely the trained optimal model;

in the step 5, obtaining the emotion type of the image to be detected:

and (3) cutting the test set data or any image in the FI data set according to a fixed-size scaling center like the image in the verification set in the synchronization step 4, and then inputting the test set data or any image into the model one by one or in batches by a fixed quantity. In the example, the parameter of the fixed-size zoom center cropping is set to 224 × 224, and in order to improve the processing efficiency under the same experimental conditions, the test set data in the example recommends that 16 is the batch size, and test images are output to the model according to batches for testing. And after model processing, comparing the output result after the classification layer with the label of the sample, and counting the proportion of the correct sample, namely the accuracy of the test set. And the emotion type corresponding to the output result is the image emotion type judged by the model change.

The test set in the FI data set is subjected to the model test in the example, the accuracy result is 0.8808, which is higher than the best effect in the research content of the current similar method: the accuracy of the Visual Sentiment preference ON Automatic Discovery of influence Regions published in the 2018 high-level journal IEEE TRANSACTIONS MULTIMEDIA was 0.8635.

Claims

1. The image emotion polarity classification method combined with the graph convolution neural network is characterized by comprising the following steps of:

step 1, obtaining image object information: carrying out panoramic segmentation processing on each picture in the data set by using a panoramic segmentation model to obtain the category and position information of an object in the picture, and carrying out emotional polarity and intensity labeling on the object category words obtained by panoramic segmentation in the data set according to the labeling result of the words in SentiWordNet;

step 2, establishing a graph model: establishing a corresponding graph model by taking an object as a node, taking the reciprocal of the distance of the object words in the emotion space as an edge weight and taking the brightness and texture characteristics of a region corresponding to the object as node characteristics;

step 4, training a model: preprocessing an image, inputting the image into a network model, optimizing by using a random gradient descent method, evaluating the performance of the model by using a cross entropy function and learning the parameters of the model;

step 5, obtaining the emotion types of the images to be detected: and (4) preprocessing the images in the database, and inputting the preprocessed images into the model trained in the step (4) to obtain the corresponding emotion types.

2. The method of claim 1, wherein: in the step 1, the object information of the picture is identified by using a panorama segmentation algorithm, and a word emotion labeling method of an emotion dictionary SentiWordNet is used, wherein the specific calculation method is as follows:

S_w＝S_p-S_n

wherein S_pPositive emotional intensity for object word W, where p represents positive, S_nThe negative emotion intensity subscript n of the object represents negative emotion, S ' is noun and adjective of the current word W contained in SentiWordNet, m is the number of S ', and S '_ip、S′_inIs S 'noted in SentiWordNet'_iThe value of the positive and negative emotion intensity is between 0 and 1, and the emotion value of the current word W is calculated by using a summing average method; representing the emotional intensity S of the current word by the difference value of the positive and negative emotional intensities of each word_W。

3. The method of claim 1, wherein: in the step 2, the object information obtained by using the panorama segmentation algorithm takes the object characteristics contained in the picture as nodes of the graph model, and takes the reciprocal distance of the words in the emotion space as the edge weight between the nodes in the graph model; for a plurality of object words W contained in the picture₁、W₂…, calculating two words W by the following formula_i、W_jDistance in emotion space:

S_i、S_jrepresenting the word W calculated according to the method in the step 1_i、W_jCorresponding to the emotion values, when the emotion polarities of the two words are opposite, the distance of the words in the emotion space is described by adding 1 to the absolute value of the difference of the emotion intensities of the two words, and when the emotion polarities of the two words are the same, the absolute value of the difference of the emotion intensities is used as the emotion distance; when the emotional intensity of two words is 0 at the same time, the emotion of two words is specifiedThe sensing distance is 0.5 to distinguish the situations of two neutral words and two homopolar words;

finally, with two words W_i、W_jThe reciprocal of the emotional distance is used as the edge weight A of the corresponding node in the graph model_ijRepeating the steps to calculate all words W respectively₁、W₂…, obtaining an edge weight matrix A of each picture;

taking the brightness characteristic and the texture characteristic as node characteristics in the graph model; obtaining an image area where each object is located by using object position information obtained in a panoramic segmentation algorithm; taking a brightness histogram of pixels in a picture as a brightness characteristic, namely converting an RGB value of the pixels in an image area where an object is located into hue, saturation and brightness of an HSI space, quantizing the brightness value, taking a quantized brightness value distribution curve as the brightness characteristic, quantizing the brightness into 0-255, and finally obtaining a 256-dimensional characteristic vector;

meanwhile, for texture features, calculating the area where each object is located by using a gray level co-occurrence matrix method; calculating in the direction of 45 degrees or 135 degrees, and quantizing the calculated result into a 256-dimensional characteristic vector;

4. The method of claim 1, wherein: in the step 3, the process is carried out,

extracting relation characteristics in the graph model by using a GCN model; structure implementation using stacked GCN, using a two-layer GCN structure, where the input feature H of the current layer k^kIs output from the previous layer, and the output result H is calculated in the following way^k ⁺¹：

Wherein

The result of adding the edge weight A obtained by the calculation of the adjacent matrix according to the emotion value of the object and the unit matrix is obtained; w^kThe weight matrix of the current convolutional layer k is obtained by random initialization, and training and adjustment are carried out in the training process according to the loss function until the training is finished; a is a non-linear activation function,

calculated from the following formula:

wherein

Is composed of

Wherein i represents

The coordinates of (a) are (b),

is composed of

Is an element of (i), ij is

Coordinates of (5);

calculating by a GCN model to obtain characteristic vectors with 2048 x 1 dimensions of relational characteristics;

for visual features, a VGG16 is adopted to obtain a VGG16 model which is pre-trained on ImageNet, the last classification layer with input dimension of 2048 x 1024 and output dimension of 1024 x 1 is removed after loading, and the output dimension of the classification layer is adjusted to 2048 x 1 and serves as the visual features;

and (5) splicing the relational features and the visual features to obtain 4096 x 1 feature vectors and inputting the feature vectors into the full-connection layer to realize the classification of the image emotions.

5. The method of claim 1, wherein: in said step 4, the loss function uses cross-entropy loss to make a basic loss measure; the specific loss function is as follows:

wherein m is the number of images in each batch during training, N is the number of emotion classes in the data set, and x_iObtaining characteristics of the ith picture in the batch from the basic backbone network in the step 3 before the classification layer; w and b are weight and bias parameter values in the classification layer, are obtained by random initialization, are trained and adjusted according to a loss function in the training process until the training is finished, and have subscript y_iRepresenting the predicted category after the classification layer, j represents the category number corresponding to the prediction result, wherein

Indicates that the ith picture of the batch is judged as y_iA value of a bias parameter in a classification layer at the time of classification;

in the training process, 0.01 is used as an initial learning rate, the learning rate is reduced to one tenth of the current learning rate every 20 rounds, and the training is finished after the training times reach more than 80.