CN111026898A

CN111026898A - Weak supervision image emotion classification and positioning method based on cross space pooling strategy

Info

Publication number: CN111026898A
Application number: CN201911259699.2A
Authority: CN
Inventors: 徐丹; 彭国琴
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-04-17

Abstract

The invention provides a cross space pooling strategy-based weak supervision image emotion classification and positioning method, which comprises the steps of adopting a cross space pooling strategy based on ResNet-101 to finally generate a feature vector with dimension as class number; the regions in the image that induce emotion are then captured by aggregating the feature maps of all the emotion classes. The invention utilizes the cross space pooling strategy to enable the convolutional neural network to learn information with discrimination for each type of emotion, thereby improving the emotion classification performance and greatly improving the classification accuracy.

Description

Weak supervision image emotion classification and positioning method based on cross space pooling strategy

Technical Field

The invention relates to the technical field of image processing, in particular to a weak supervision image emotion classification and positioning method based on a cross space pooling strategy.

Background

People increasingly like to express emotions actively by uploading pictures to social media such as Twitter, microblog and the like, and at the same time, people need to know the emotions of people in some fields such as security, monitoring, education and the like, so that visual emotion analysis research of images is more and more concerned, and with the depth of artificial intelligence in various fields, people have higher and higher expectations for computer understanding of emotions conveyed by images^[1-3]Object recognition^[4-6]And semantic segmentation^[7-9]The visual recognition task has good effectTherefore, the deep learning method is also applied to the image emotion analysis^[10-13]In image emotion analysis, the effect of machine learning based on depth features is superior to that of the traditional method for manually designing features^[14-17]The traditional manual design features mainly consider color, texture, principal components and the like.

Borth et al^[18-19]Defining visual emotion ontology for describing images by taking adjective-name word pairs (ANPs) as elements, proposing SentiBank to detect description of emotion in images based on visual underlying features, and constructing visual emotion concept in classification^[20]The method proposes that two High-Level Concepts (High-Level Concepts) of an object and a scene are considered in emotion analysis, the emotion analysis of an image is related to the High-Level semantics and low-Level features at the same time, different emotion categories are related to different High-Level semantic Concepts, firstly, the relationship between the High-Level semantics and the emotion is constructed, and then, the emotion prediction is realized through Support Vector Regression (SVR)^[21]It is proposed to analyze the character emotion in the image by considering the context in emotion prediction, train two convolutional neural networks, and finally fuse the features of the two neural networks^[17]The EmotionROI dataset is provided, the dataset labels regions inducing emotion in an image, and a Full Convolution Network (FCNEL) with Euclidean loss is used for predicting an image emotion stimulus graph (MSE). The methods based on high-level semantics all attempt to learn features from emotion related factors in the image to improve classification performance, and the selection of the emotion related factors of the image becomes key.

With the success of deep learning in large-scale object identification, many methods of weakly supervised convolutional neural networks use a Multiple Instance Learning (MIL) algorithm^[22]To achieve target detection, MIL definitionAn image is a set of regions and it is assumed that an image labeled positive contains at least one object instance of a certain category, while an image labeled negative does not contain an object of the category of interest^[23]However, emotion is more subjective, assuming that one instance (object) appears in only one category, and is suboptimal for emotion detection^[24]Class activation graphs (CAMs) are proposed that use a global averaging pooling layer to aggregate activation graphs for a particular class after the topmost convolutional layer by modifying the network structure to a full convolutional network^[25]A gradient class activation map (Grad _ CAM) is proposed, which computes gradients through back propagation, then merges with a feature map, computes an activation map for a particular class, and can implement back propagation for any layer, but is usually computed at the last convolutional layer^[26]The WILDCAT method is provided, a plurality of morphological information (such as the head or the leg of a dog) related to the category is learned, local features related to different types of modes are definitely designed in a model, and the provided model can complete image classification and weak supervision object positioning and segmentation, and considers object objective information, Zhu and the like^[27]The methods are directed to general classification tasks, detect regions related to specific objects in an image, tend to mark foreground object regions in the image and are actually a recognition problem (recognizing cats or dogs in the image).

When an image is observed, human emotion is excited, different regions have different contributions to emotion induction, and how to automatically locate excited human in the imageLocation of image emotion-like regions is a more challenging problem than object region location, because the image's emotional semantics are not only related to salient object (foreground) regions in the image, but also to the overall semantic information conveyed by the image^[28]The WSCNet network architecture is provided, emotion detection and classification are completed by training two branches, and the result of emotion detection of the first branch is used during classification^[29]The eye movement data is used for positioning a human attention area in the image, and the emotional significance prediction is realized by designing a convolutional neural network, wherein a sub-network is used for learning semantic and spatial information of an image scene.

[1]Krizhevsky A,Sutskever I,Hinton G.Imagenet classification withdeep convolutional neural networks[C]//Proceedings of the 25th InternationalConference on Neural Information Processing Systems.Lake Tahoe,ACM Press,2012:1097-1105.

[2]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conferenceon Learning Representations.San Diego:ACM Press,2015.

[3]He K,Zhang X,Ren S,et al.Deep residual learning for imagerecognition[C]//Proceedings of the 2016IEEE Conference on Computer Vision andPattern Recognition.Las Vegas:IEEE Press,2016:770-778.

[4]Girshick R,Donahue J,Darrell T,Malik J.Rich feature hierarchiesfor accurate object detection and semantic segmentation[C]//Proceedings ofthe 2014IEEE Conference on Computer Vision and Pattern Recognition.Columbus:IEEE Press,2014:580-587.

[5]Girshick R.Fast R-CNN[C]//Proceedings of the 2015IEEEInternational Conference on Computer Vision.Washington:IEEE Press,2015:1440-1448

[6]Dai J.,Li Y,He,K,Sun J.R-FCN:Object detection via region-basedfully convolutional networks[C]//Proceedings of 30th International Conferenceon Neural Information Processing Systems.Barcelona:IEEE Press,2016:379-387.

[7]Chen L C,Papandreou G,Kokkinos I,et al.Semantic Image Segmentationwith Deep Convolutional Nets and Fully Connected CRFs[C]//Proceedings of the3rd International Conference on Learning Representations.San Diego,2015:357-361.

[8]Long J,Shelhamer E,Darrell T.Fully Convolutional Networks forSemantic Segmentation[C]//Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition.Boston:IEEE Press,2015:3431-3440.

[9]Dai J,He K,Li Y,et al.Instance-sensitive fully convolutionalnetworks[C]//Proceedings of the 14th European Conference of ComputerVision.Amsterdam:Springer,Cham,2016:534-549.

[10]Peng K C,Chen T,Sadovnik A,et al.A mixed bag of emotions:model,predict,and transfer emotion distributions[C]//Proceedings of the 2015 IEEEConference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:860-868.

[11]You Q,Luo J,Jin H,et al.Building a large scale dataset for imageemotion recognition:The Fine Print and The Benchmark[C]//Proceeding of the30th Conference on Artificial Intelligence.Phoenix:ACM Press,2016:308-314.

[12]You Q,Luo J,Jin H,et al.Quanzeng You,Jiebo Luo,Hailin Jin,Jianchao Yang.Robust image sentiment analysis using progressively trained anddomain transferred deep networks[C]//Proceedings of the 29th Conference onArtificial Intelligence.Austin:ACM Press,2015:381-388.

[13]Víctor C,Brendan J,Xavier Giró-i-Nieto.From pixels to sentiment:Fine-tuning CNNs for visual sentiment prediction.Image Vision Computing,2017(65):15–22.

[14]Yanulevskaya V,Gemert J C V,Roth K,et al.Emotional valencecategorization using holistic image features[C]//Proceedings of the 2008 IEEEInternational Conference on Image Processing.San Diego:IEEE Press,2008:101-104.

[15]Zhao S,Gao Y,Jiang X,et al.Exploring principles-of-art featuresfor image emotion recognition[C]//Proceedings of the 2014ACM InternationalConference on Multimedia.Orlando:ACM Press,2014:47-56.

[16]Machajdik J,Hanbury A.Affective image classification usingfeatures inspired by psychology and art theory[C]//Proceedings of the 2010ACMInternational Conference on Multimedia.Firenze:ACM Press,2010:83-92.

[17]Peng K C,Sadovnik A,Gallagher A,et al.Where do emotions comefrom？predicting the emotion stimuli map[C]//Proceedings of the 2016 IEEEInternational Conference on Image Processing.Phoenix:IEEE Press,2016:614-618.

[18]Borth D,Ji R,Chen T,et al.Large-scale visual sentiment ontologyand detectors using adjective noun pairs[C]//Proceedings of the 2013ACMMultimedia Conference.Barcelona:ACM Press,2013:223-232.

[19]Chen T,Borth D,Darrell T,et al.DeepSentibank:Visual sentimentconcept classification with deep convolutional neural networks[J].ComputerScience,2014.

[20]Ali A R,Shahid U,Ali M,et al.High-level concepts for affectiveunderstanding of images[C]//Proceedings of 2017 IEEE Winter ConferenceonApplications of Computer Vision.Santa Rosa:IEEE Press,2017:678-687.

[21]Kosti R,Alvarez J M,Recasens A,et al.Emotion Recognition inContext[C]//Proceedings of the 2017 IEEE Conference on Computer Vision andPattern Recognition.Honolulu:IEEE Press,2017:1960-1968.

[22]Bilen H,Vedaldi A.Weakly supervised deep detection networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and PatternRecognition.Las Vegas:IEEE Press,2016:2846-2854.

[23]Cinbis R.G,Verbeek J,and Schmid C.Weakly supervised objectlocalization with multi-fold multiple instance learning[J].IEEE Transactionson Pattern Analysis and Machine Intelligence,2017,39(1):189–203.

[24]Zhou B,Khosla A,Lapedriza A,et al.Learning Deep Features forDiscriminative Localization[C]//Proceedings of the 2016 IEEE Conference onComputer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:2921-2929.

[25]Selvaraju R R,Cogswell M,Das A,et al.Grad-CAM:Visual Explanationsfrom Deep Networks via Gradient-Based Localization[C]//Proceedings of the2017 IEEE International Conference on Computer Vision.Venice:IEEE Press,2017:618-626.

[26]Durand T,Mordan T,Thome N,et al.Wildcat:Weakly supervisedlearning ofdeep convnets for image classification,pointwise localization andsegmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Visionand Pattern Recognition.Honolulu:IEEE Press,2017:5957-5966.

[27]Zhu Y,Zhou Y,Ye Q,et al.Soft proposal networks for weaklysupervised object localization[C]//Proceedings of the 2017 IEEE InternationalConference on Computer Vision.Venice:IEEE Press,2017:1859-1868.

[28]Yang J F,She D Y,Lai Y K,et al.Weakly Supervised Coupled Networksfor Visual Sentiment Analysis[C]//Proceedings of the 2018IEEE Conference onComputer Vision and Pattern Recognition.Salt Lake City:IEEE Press.2018:7584-7592.

[29]Fan S J,Shen Z Q,Jiang M,et al.Emotional Attention:A Study ofImage Sentiment and Visual Attention[C]//Proceedings of the 2018IEEEConference on Computer Vision and Pattern Recognition.Salt Lake City:IEEEPress,2018:7521-7531。

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a cross space pooling strategy-based weak supervision image emotion classification and positioning method with high image emotion classification accuracy.

A weak supervision image emotion classification and positioning method based on a cross space pooling strategy comprises the following steps:

step 1: deleting the pooling layer and the full-connected layer of the full convolution network based on the full convolution network ResNet-101, performing convolution operation on the feature maps generated by conv5 in the ResNet-101 by using a convolution kernel of 1 multiplied by 1, and generating a specific number (k) of feature maps for each category;

step 2: extracting global information of each feature map by using global average pooling;

and step 3: finding out the characteristic diagram with the maximum response through the maximum pooling operation, finally generating a characteristic vector with the dimensionality being the category number, and marking the value of each vector as S^c；

Wherein,

representing the characteristics of the jth channel of the C-th class in F', k representing the number of characteristic channels generated by each class, C representing the C-th emotion, the total number of emotion classes being C, G_aveRepresenting global average pooling; the feature after convolution by 1 × 1 is denoted as F';

and 4, step 4: the pre-training model weight values visually identified on ImageNet set the learning rates of the full convolution layer and the cross space strategy to 0.0001 and 0.001 respectively. The whole model is subjected to iterative training for 30 rounds, the learning rate is reduced by 10 times every 10 rounds, the decade is set to be 0.005, and the momentum is set to be 0.9; random horizontal flipping and clipping of the expanded data are used in training to reduce overfitting, and finally, the size of the image input by the model is 448 x 448;

and 5: in the forward process of each batch, a cross entropy loss value is calculated

Wherein

N is the size of the batch, representing the number of samples trained in a previous session, y_iReal emotion labeling representing ith training sample；S^lIs the value of the ith element of the feature vector defined in step 3, representing the score of the ith category in the network;

step 6: updating the weight parameter using a random gradient descent in a reverse pass according to the calculated loss function value;

and 7: repeating the steps 5 to 6 until a round of training is completed, and performing model test according to the test data set;

and 8: repeating the step 7 until the model reaches the optimal value or the total iteration rounds are completed;

and step 9: generating image emotion activation map

Has the advantages that:

according to the cross space pooling strategy-based weak supervision image emotion classification and positioning method, the 1 x 1 convolution kernel, the global average pooling operation and the maximum pooling operation are utilized to enable the convolutional neural network to learn information with discrimination for each type of emotion, so that the emotion classification performance is improved, and the classification accuracy is greatly improved.

According to the invention, under a simple convolutional neural network architecture, only image level labeling information is used, and through the proposed cross space pooling strategy, the convolutional neural network learns more information with identification capability, the performance of image emotion classification is improved, emotion is understood from the image semantics, emotion area positioning related to emotion is better realized, and the influence and contribution of each pixel in the image on image emotion induction are marked.

Drawings

FIG. 1 is a model for generating an emotional activation map;

FIG. 2 is a comparison graph of emotional region localization performance;

FIG. 3 is a comparison graph of the results of several object location methods applied to the location of the affective region and the location of the affective region according to the method of the present invention;

FIG. 4 is a comparison graph of the positioning performance of the WSCNET method and the method of the present invention in the emotional region;

FIG. 5 is a comparison of the generated emotion activation map and the emotion feature map of the predicted emotion classification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention are described clearly and completely below, and it is obvious that the described embodiments are some, not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a method for classifying image emotions and positioning emotion areas, which aims to solve the problems that the existing image emotion classification is low in accuracy and emotion area positioning is not involved in the existing research.

The technical scheme adopted by the invention comprises the following two parts:

first, cross-space pooling strategy: based on a full convolution network ResNet-101, deleting the last two layers (a global pooling layer and a full connection layer) of RestNet-101, replacing the last two layers in ResNet101 with the cross space pooling provided by the application, firstly realizing information integration across channels by using a 1 x 1 convolution kernel, reducing the number of channels of features, generating a specific number of feature maps for each class, then extracting the global information of each feature map by using global average pooling, then finding the feature map with the maximum response through the maximum pooling operation, and finally generating a feature vector with the dimension of the number of classes, wherein the value of each vector is marked as S^c；

Firstly, generating a response characteristic diagram with category consciousness aiming at each category of emotion, and if the number of emotion categories is C, then C response characteristic diagrams exist; then, S corresponding to the S^cThe weights are combined to obtain comprehensive positioning information, and positioning information of the emotion area is not obtained from the maximum response characteristic diagram of the specific class.

The method realizes image emotion classification and emotion region positioning in a unified framework, generates an emotion activation diagram representing the induced emotion related region, can obtain fine-grained and pixel-level image labels only by requiring image-level labels, and represents the contribution of each pixel point to image emotion classification. The invention further illustrates the relationship between the image emotion activation map and the image emotion category prediction result, and the emotion characteristic map which is closer to the finally generated emotion activation map contributes to the classification more greatly, thereby playing a leading role in emotion classification.

The method for classifying and positioning the emotion of the weakly supervised image based on the cross space pooling strategy comprises the following steps:

step 1: based on ResNet-101, the last two layers of the network are deleted and replaced by the cross-space pooling strategy provided by the application, namely, the feature maps generated by conv5 in ResNet-101 are firstly convolved by a convolution kernel of 1 × 1, so that cross-channel information integration is realized, the number of channels of the features is reduced, and a specific number (k) of feature maps are generated for each class. The feature after convolution by 1 × 1 is denoted as F'. Then, global information of each feature map is extracted by utilizing global average pooling, then the feature map with the maximum response is found by utilizing maximum pooling operation, finally, a feature vector with the dimensionality being the category number is generated, and the value of each vector is marked as S^c

Wherein,

representing the characteristics of the jth channel of the C-th class in F', k representing the number of characteristic channels generated by each class, C representing the C-th emotion, the total number of emotion classes being C, G_aveRepresenting global average pooling;

step 2: the depth model proposed in step 1 is initialized. The initialized convolutional neural network weight parameters are pre-training model weight values visually identified on ImageNet, and the learning rates of the full convolutional layer and the cross space strategy are respectively set to be 0.0001 and 0.001. The whole model is iteratively trained for 30 rounds, the learning rate is reduced by 10 times every 10 rounds, decade is set to be 0.005, momentum is set to be 0.9, random horizontal flipping and cutting of expansion data are used in training to reduce overfitting, and finally, the size of an input picture of the model is 448 multiplied by 448.

And step 3: in the forward process of each batch, a cross entropy loss value is calculated

Wherein

N is the size of the batch, representing the number of samples trained in a previous session, y_iRepresenting the real emotion marking of the ith training sample; s^lIs that

The value of the ith element of the feature vector defined in the step 3 represents the score of the ith category in the network;

and 4, step 4: the weight parameters are updated in the reverse pass using a random gradient descent (SGD) according to the calculated loss function values.

And 5: and (4) repeating the steps 3 to 4 until one round of training is finished, and carrying out model test according to the test data set.

Step 6: and (5) repeating the step until the model reaches the optimal value or the total iteration rounds are completed.

And 7: generating image emotion activation map

Experimental example 1:

the cross space pooling strategy provided by the invention enables the convolutional neural network to learn information with higher discrimination for no-class emotion, and improves the emotion classification performance, as shown in table 1, compared with other methods, the method provided by the invention has the advantage that the classification accuracy is greatly improved.

TABLE 1 Classification accuracy (%) comparison

Experimental example 2:

in the cross space pooling strategy of the invention, compared with the general average pooling operation, the global average pooling operation increases the receptive field of a convolution kernel, can catch the global semantic information in an image better and is more robust to space conversion, and then an emotion vector is generated for emotion classification through the maximum pooling operation, the relationship between each element in the vector and a feature map of a convolution layer is more direct, namely the relationship between a category and the feature map is more direct, such as the corresponding relationship in FIG. 1, the cross space pooling strategy provided by the application can replace a pooling layer and a full connection layer in the original network architecture to avoid that the original full connection layer in ResNet-101 ignores the spatial information of an object in the image, in the extracted feature map of CNN, each feature represents a partial feature of the whole network, and the information of the object and the semantic information in different feature maps can be better utilized through the cross space pooling strategy, FIG. 2 is a comparison graph of emotional area localization performance, comparing Mean Absolute Error (MAE), precision (precision), recall (recall), and F, respectively₁The smaller the MAE, the better, and the larger the precision, call, and F1, the better. The method of the present invention is illustrated in fig. 2 by numerical values on several evaluation mechanisms with the best localization performance in the weakly supervised learning method.

FIG. 3 compares the results of applying several object positioning methods to the positioning of the emotional regions with the results of the present invention, and the results of several evaluation mechanisms are marked on the heat map, which shows that the method of the present invention can position more regions related to emotion, and that the evaluation mechanism is highest call, which shows that the truly labeled emotional regions are positioned more. FIG. 4 compares the emotion region localization of the WSCNET method with the method of the present invention, where precision is 0.94 but recall is 0.15Please refer to the method as precision 0.82 and recall 0.85, which are usually contradictory, and use F₁(F₁(2 precision + recycle) as a comprehensive evaluation index, F of the method of the present application₁Value 0.83, and F of the WSCNet method₁The value is 0.26, which is obviously higher than F in the WSCNet method₁The method is used for solving the problem that more regions in the real label are positioned as the emotion regions, and the positioning performance of the method on the induced emotion regions is better.

FIG. 5 compares the emotional feature map of the emotion classification with the emotion activation map, 5c marks the result and probability of emotion prediction, 5d marks the processing (p) and call (r) of the image, the highlighted area in the emotion activation map is the area with the largest contribution to the emotion classification, and these areas directly influence the classification result.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A weak supervision image emotion classification and positioning method based on a cross space pooling strategy is characterized by comprising the following steps:

Wherein,

representing the characteristics of the jth channel of the class c in the F ', wherein the F' is the characteristics after 1 × 1 convolution; k represents the number of characteristic channels generated by each category, C represents the C-th emotion, and the total number of emotion categories is C and G_aveRepresenting global average pooling;

and 4, step 4: initializing a convolutional neural network weight parameter, namely a pre-training model weight value visually identified on ImageNet, and respectively setting the learning rates of a full convolutional layer and a cross space strategy to be 0.0001 and 0.001; the whole model is subjected to iterative training for 30 rounds, the learning rate is reduced by 10 times every 10 rounds, the decade is set to be 0.005, and the momentum is set to be 0.9; reducing overfitting by using random horizontal turning and cutting expansion data in training, and finally, inputting a picture with the size of 448 multiplied by 448 by a model;

Wherein

N is the size of the batch, representing the number of samples trained in a previous session, y_iRepresenting the real emotion marking of the ith training sample; s^lIs the value of the ith element of the feature vector defined in step 3, representing the score of the ith category in the network;

and step 9: generating image emotion activation map