CN112613552A

CN112613552A - Convolutional neural network emotion image classification method combining emotion category attention loss

Info

Publication number: CN112613552A
Application number: CN202011506810.6A
Authority: CN
Inventors: 毋立芳; 邓斯诺; 张恒; 石戈; 简萌; 相叶
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-04-06
Anticipated expiration: 2040-12-18
Also published as: CN112613552B

Abstract

A convolutional neural network emotion image classification method combining emotion category attention loss relates to the technical field of intelligent media calculation and computer vision; firstly, carrying out category weight calculation on a training sample to obtain an emotion category attention weight vector; secondly, modifying the final classification layer and the loss function of the convolutional neural network according to the emotion category number and the emotion category attention loss; then preprocessing the training sample and transmitting the preprocessed training sample into a network, so that the network can achieve convergence after iterative updating of parameters of a loss function and an optimizer, and training is completed; and finally, sending the preprocessed test image into a network, and calculating the emotion image classification accuracy of the obtained model and the prediction category of the model on the test emotion image. According to the invention, when the emotion images are classified into emotion categories through the convolutional neural network, the classification result more conforming to the sample distribution characteristics of the data set can be obtained in a self-adaptive manner, and the training and the use of the emotion classification algorithm in different practical application scenes are facilitated.

Description

Convolutional neural network emotion image classification method combining emotion category attention loss

Technical Field

The invention belongs to the technical field of computer vision, and relates to a convolutional neural network emotion image classification method combining emotion category attention loss.

Background

With the development of social media, people tend to record and share their moods by publishing image information on the internet. These pictures containing emotional information often also contain the emotional tendency and attitude of the publisher to a certain kind of things. The emotional tendency of people is obtained from the massive images to know the attitude of the user group, and the method plays an important role in commodity recommendation, public opinion analysis and social media management. Therefore, how to efficiently and automatically identify and analyze a large number of images containing emotional information by using a computer algorithm is a problem which needs to be solved urgently at present.

Early emotion analysis methods adopted the use of bottom-layer manual features for emotion classification, such as color, line, texture, or the use of noun adjectives to extract middle-layer representations in the images for the detector Sentibank. The method benefits from the strong feature extraction capability of deep learning, the effect obtained by using the algorithm based on the convolutional neural network in the image classification task is better and better, and the emotion image classification task is also newly developed. For example, You et al designed a deep convolutional neural network for image emotion classification in 2016, and added a filtering function in its feedback mechanism, by which the erroneous labeled data in the training set is filtered out, thereby effectively improving the image emotion classification capability. She et al designed a weakly supervised coupled convolutional network based on a deep convolutional neural network ResNet101 for image emotion classification in 2019, captured regions causing emotion through class activation maps, and performed feedback adjustment by using an error back propagation mechanism, thereby further improving the accuracy of image emotion classification.

However, in a general image classification task, images of different categories are relatively clearly distinguished, and different emotion images have no clear boundary between emotion categories. Therefore, the loss function needs to be calculated by adopting different measurement modes for the similarity distance of the sample characteristics, so as to improve the discrimination of different categories. In this respect, Zhang et al propose a face recognition technology combining a deep convolutional neural network and central loss in 2017, and by using two loss functions, namely a traditional loss function and a central loss function, as supervision signals in a transfer learning process, extracted features are aggregated in different classes and dispersed, so that the discrimination capability of the model for outputting face features is improved. Yang et al propose a method for training a convolutional neural network by using triple constraints in 2018 to effectively position images at emotion levels, and achieve multitask processing of emotion image retrieval and classification by calculating correlation among features by considering the relation of different emotion polarities. Although these studies take into account the distance within the binding class, there is no targeted loss function design for emotion classes with an unbalanced sample number, which is not enough in a real social media environment. In a real social media environment, the number of emotion images cannot be uniformly distributed according to categories. This may cause the generally trained model to fail to specifically consider the individual emotion classes, resulting in a loss of classification performance.

Under the initiation, a method combining cross entropy loss and emotion category attention center loss is designed, the distance between emotion image categories is expanded, and meanwhile, the distance in a convergence category is different according to different categories, so that the accuracy of emotion image classification is improved.

Disclosure of Invention

In order to solve the problems, the emotion image classification algorithm is invented, attention center loss of emotion classes and cross entropy loss are combined, so that the feature distance of samples of different classes is farther, and the distance between samples of the same class is closer. The chance that portions of the samples are distributed at the interface between classes that are further from the center within the class due to the lack of intra-class constraints is reduced, thereby reducing the likelihood that these samples will be misclassified. Meanwhile, different binding strengths are adopted according to different sample categories, so that when the number of the samples in each category is unbalanced, corresponding adjustment can be obtained, and the network model obtained by the method has a better emotion image classification effect.

The method comprises the following specific steps:

step 1, establishing an emotion category weight vector of an image: sorting and dividing the labeled emotion image data set, and regarding each type of image in the training set part as a one-dimensional image to obtain an emotion category weight vector W;

step 2, establishing a deep network model: selecting a deep network model such as ResNet-101, and replacing the final classification layer of the original model by using the number of classes to be classified as the dimension of an output vector;

step 3, designing a loss function: in order to solve the problems of unbalanced proportion of heterogeneous samples and convergence of inter-class and intra-class distances by utilizing emotion class weight information of an image, a loss function comprises calculation of the inter-class distance and the intra-class distance and consideration of weight proportions of different emotion classes;

step 4, training a model: preprocessing the image divided in the step 1 in modes of scaling, random overturning and the like, inputting the preprocessed image into the network model in the step 2, optimizing the preprocessed image by using a random gradient descent method, and learning model parameters by calculating loss through the loss function in the step 3;

step 5, obtaining the emotion types of the images to be detected: and (4) after the preprocessing steps of fixed-size scaling and center cutting are carried out on the images in the data set, inputting the images into the model trained in the step (4) to obtain the corresponding emotion categories.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress:

the invention provides a convolutional neural network emotion image classification method combining emotion category attention loss. Parameters of the convolutional neural network are updated in a feedback mode by calculating the combination of cross entropy loss and category attention center loss, so that the distance between samples of the same category is closer while the characteristic distance of samples of different categories is farther. The chance that portions of the samples are distributed at the interface between classes that are further from the center within the class due to the lack of intra-class constraints is reduced, thereby reducing the likelihood that these samples will be misclassified. Meanwhile, different bundling strengths are adopted according to different sample categories, so that when the number of samples in each category is unbalanced, corresponding adjustment can be obtained, and the network model obtained by the method has a better emotion classification effect in the test of the emotion image data set with unbalanced categories.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is an architecture diagram of a convolutional neural network for training image emotion classification based on the method.

FIG. 2 is an overall flow chart of emotion image classification based on the method.

Detailed Description

The invention provides a convolutional neural network emotion image classification method combining emotion category attention loss. The overall structure of the present invention is shown in fig. 1. The embodiment of the invention simulates in win10 and JupyterNotebook environment, and the FI data set is used for training by the method of the invention, so that the image emotion classification model capable of realizing high accuracy is obtained. After the model is obtained, the test image can be input into the model to obtain the emotion classification result of the image. The specific implementation flow of the invention is shown in fig. 2, and the specific implementation steps are as follows:

step 2, establishing a deep network model: selecting ResNet-101 as a backbone network of the deep network model, and replacing the last classification layer of the original model by using the number of classes to be classified as an output vector dimension to obtain a deep network model to be used;

step 4, training the model: preprocessing the image divided in the step 1 in modes of scaling, random overturning and the like, inputting the preprocessed image into the network model in the step 2, optimizing the preprocessed image by using a random gradient descent method, and learning model parameters by calculating loss through the loss function in the step 3;

In step 1, an emotion category weight vector W of the image is established:

the method can be used for emotion classification of images in a large real social network, so that in the example, a universal public emotion data set Flickr & Instagram (hereinafter referred to as FI data set) formed by arranging Flickr and Instagram is selected, and compared with a traditional emotion data set, the data set has the characteristics of large data scale and unbalanced emotion types, and is more in line with a real network environment.

Sorting training sets in the FI data set according to the original 8 classes of labels, and obtaining the weight w of each class through weight calculation_iThe calculation method is as follows:

wherein N is the number of categories, N_iIn this example, N is set to 8 for each emotion type of the FI data set, as the number of training samples for each type. After the weight coefficients of all the classes are obtained, the weight coefficients of all the classes are connected in parallel to obtain an emotion class weight vector W ═ W₁,w₂,…,w_N]；

In step 2, a deep network model is established:

in the embodiment, ResNet-101 is adopted, a ResNet-101 model which is pre-trained on ImageNet is obtained, a classification layer with the final input dimension of 2048 x 1024 and the output dimension of 1 x 1024 is removed after loading, the number of categories to be classified is 8 and is used as the dimension of an output vector for replacement, the input dimension of the final classification layer is 2048 x 8, the output dimension is 1 x 8, namely the prediction probability of each category, and the corresponding emotion category of the position to which the maximum item belongs is taken as the emotion category of image output;

in step 3, a loss function is designed:

in order to solve the problems of unbalanced proportion of heterogeneous samples and convergence of inter-class distance and intra-class distance by using emotion category weight information of an image, a loss function comprises calculation of inter-class distance and intra-class distance and consideration of weight proportions of different emotion categories, and the designed loss function comprises two items of emotion category attention center loss and cross entropy loss:

emotional category attention center loss: the central penalty of adding emotion class weight is used for constraint to implement a mechanism for attention control according to emotion class. Weighted center losses are constructed for discriminative intra-class bundling for different classes. In order to solve the problem of class imbalance of data, the weighting center loss is specially added with a weighting part to reduce the influence caused by the class imbalance of the samples. The specific loss function is as follows:

wherein m is the number of images in each batch during training, and is generally an integer power of 2, such as 16 or 32, according to the video memory size of the experimental platform, and is set to 16 in this example, W is the weight vector obtained in step 1, and f is the weight vector obtained in step 1_iFeatures of the image obtained before the classification layer via the basic backbone network in basic step 2,

is of the category y_iThe characteristic center is composed of the mean values of the characteristics obtained from the images of the same category in each dimension in the current batchA feature vector.

Cross entropy loss: the basic measure of loss is made using cross-entropy loss, which aims to preserve inter-class distance. The constructed cross-entropy loss is used to make the images farther apart between different emotion classes. The specific loss function is as follows:

wherein m is the number of images per batch and L_cM in (1) is the same, N is the number of emotion classes in the data set, and in this example, N is 8, x according to the FI data set_iFor the features of the ith picture in the batch obtained from the basic backbone network in step 2 before the classification layer, w and b are the values of weight and bias parameters in the classification layer, and the subscript y_iAnd j is represented as a category after the classification level, e.g., where

Indicates that the ith picture of the batch is judged as y_iClass is the value of the bias parameter in the classification layer.

The total loss function is:

where α is the hyperparameter of the quantization loss function, set to 0.6 in this example.

In step 4, training of the model is performed:

the images were pre-processed by scaling, random flipping, etc., with the parameters for random cropping set to 448 x 448 in this example, and the probability of random flipping set to 0.5. And inputting the batch with the fixed size into the network model, and taking the sample of the batch with the fixed size as a batch. The fixed batch size setting is as large as possible to improve the training effect of the model to a certain extent, but due to the limitation of the experimental platform, it is recommended to select 8, 16 or 32, in this example, the fixed batch size is set to 16. And through the final classification layer, the output result is automatically compared with the input training set label, and the proportion of the number of the correct samples in the whole training samples is counted and recorded as the accuracy of the training set in the round. And meanwhile, when the output vector is obtained, the loss value of the current round is also obtained according to the loss function calculated in the step 3, and the obtained loss value is fed back to the optimizer for processing and then carrying out back propagation to update each parameter in the model.

In consideration of the convergence speed and the convergence effect, the optimizer in the method selects a random gradient descent method as an optimization method. The parameter setting of the optimizer mainly comprises two items of initial learning rate (learning rate) and momentum (momentum), wherein the initial learning rate is generally selected according to the convergence condition of the model in the equivalence of 0.1, 0.01, 0.0001 and 0.00001, and the convergence effect is more stable at the initial value by adopting 0.0001 in the embodiment. The momentum is in principle between 0 and 1, in this example a default value of 0.9 in the stochastic gradient descent method is recommended. Because the setting of the fixed learning rate is not beneficial to the deep network to find better parameters in the second half of training, the method increases the strategy of reducing the learning rate in a fixed turn in the training process. Wherein the reduced rounds are recommended to be reduced for 1-2 times in 14-20 rounds, and the total training rounds are recommended to be 50-80 rounds until the loss function value oscillates and does not obviously decrease any more, namely the model is determined to be converged, and the training can be finished. In this example, the optimizer is set to reduce the learning rate by half every 14 rounds and every 20 rounds, and 60 rounds of training learning are performed on the model parameters to ensure effective convergence of the training effect, and if the number of the set rounds is too small, the training time is increased, but the effect is not improved.

After each round of training of the training samples, parameters of the model are fixed, the FI data set is adopted for fixed-size scaling cutting, the parameters are transmitted into the network model, in the example, the cutting parameters are set to 448 x 448, model output and labels of the samples are compared, the proportion occupied by correct samples, namely the accuracy rate of the verification set, is counted, if the accuracy rate of the verification set of the current round number is higher than the accuracy rate of the previous highest verification set, the current accuracy rate is saved as the highest accuracy rate of the verification set, and the model trained by the current round number is saved. After all rounds of training are finished, the model under the highest verification set accuracy rate is finally stored, namely the trained optimal model;

in the step 5, obtaining the emotion type of the image to be detected:

and (4) cutting the test set data or any image in the FI data set according to a fixed-size scaling center as the verification set image in the synchronization step 4, and then inputting the test set data or any image into the model one by one or in batches according to a fixed number. In this example, the parameter of the fixed-size zoom center cropping is set to 448 x 448, in order to improve the processing efficiency under the same experimental conditions, the test set data in this example recommends that 16 be the batch size, and test images are output to the model in batches for testing. And after model processing, comparing the output result after the classification layer with the label of the sample, and counting the proportion of the correct sample, namely the accuracy of the test set. And the emotion type corresponding to the output result is the image emotion type judged by the model.

The test set in the FI data set is subjected to model test in the example, the accuracy result is 0.7087, which is higher than the best effect in the research content of the current similar method: published in the high-level journal IEEE TRANSACTIONS MULTIMEDIA of this yearAdaptive Deep Metric Learning for Affective Image Retrieval and Classification0.6837, and is also higher than the best result of multi-classification of the data set by only using the label information as reference: published in the high-level journal IEEE TRANSACTIONS MULTIMEDIA of this yearWSCNet: Weakly Supervised Coupled Networks for Visual Sentiment Classification and DetectionAccuracy 0.7007.

Claims

1. The convolutional neural network emotion image classification method combining emotion classification attention loss is characterized by comprising the following steps of:

step 1, establishing an emotion category weight vector of an image: regarding each type of image in the data set as a one-dimensional image to obtain an emotion category weight vector W;

step 2, establishing a deep network model: selecting a deep network model ResNet-101, using the ResNet-101 as a basic backbone network, and replacing the last classification layer of the original model by using the number of classes to be classified as an output vector dimension to generate emotion classes;

step 4, training a model: preprocessing an image, inputting the image into a network model, optimizing by using a random gradient descent method, and learning model parameters;

step 5, obtaining the emotion types of the images to be detected: and (4) after the images in the database are subjected to the same preprocessing step as the step 4, inputting the images into the model trained in the step 4 to obtain the corresponding emotion types.

2. The method of claim 1, wherein: in the step 1, a specific method for establishing the emotion category weight of the image is as follows: regarding the same class image in the image training set with the emotion class number N as a dimension C_iI ═ 1,2, …, N; calculating and counting the number n of samples in different dimensions_iI ═ 1,2, …, N; obtaining each class weight through calculation, combining to form a weight vector W ═ W₁,w₂,…,w_N]Each w_iThe calculation method is as follows:

3. the method of claim 2, wherein: in the step 3, the loss function comprises two losses, namely emotion category attention center loss and cross entropy loss;

4.1 emotional Category attention center loss

The emotion class attention center loss particularly increases the weight part to reduce the influence caused by the imbalance of the sample classes; the specific loss function is as follows:

wherein m is the number of images in each batch during training, W is the obtained weight vector, f_iFeatures derived from images passing through the underlying backbone network before the classification level, C_yiIs of the category y_iThe characteristic center is a characteristic vector formed by the mean values of the characteristics in all dimensions obtained from the same type of images in the current batch;

4.2 Cross entropy loss

A basic measure of loss is made using cross-entropy loss, with the aim of preserving inter-class distance; the constructed cross entropy loss is used for enabling images among different emotion categories to be farther; the specific loss function is as follows:

wherein m is the number of images in each batch during training, N is the number of emotion classes in the data set, and x_iFor the ith picture in the batch, obtaining the characteristics from the ResNet-101 basic backbone network before the classification layer, wherein w and b are the weight and bias parameter values in the classification layer, and the subscript y_iAnd j is represented as a class after the classification level,

indicates that the ith picture of the batch is judged as y_iA value of a bias parameter in a classification layer at the time of classification;

the total loss function is:

where α is the hyperparameter of the quantization loss function, set to 0.6.

4. The method of claim 1, wherein: and performing data preprocessing on the picture to be trained in a random center cutting and random horizontal turning mode, training the model in a random gradient descending mode, and storing the trained model.