CN111026898A - Weak supervision image emotion classification and positioning method based on cross space pooling strategy - Google Patents
Weak supervision image emotion classification and positioning method based on cross space pooling strategy Download PDFInfo
- Publication number
- CN111026898A CN111026898A CN201911259699.2A CN201911259699A CN111026898A CN 111026898 A CN111026898 A CN 111026898A CN 201911259699 A CN201911259699 A CN 201911259699A CN 111026898 A CN111026898 A CN 111026898A
- Authority
- CN
- China
- Prior art keywords
- emotion
- image
- representing
- classification
- pooling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 90
- 238000011176 pooling Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 15
- 230000004913 activation Effects 0.000 claims description 14
- 230000004044 response Effects 0.000 claims description 7
- 238000010586 diagram Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 abstract 1
- 230000002996 emotional effect Effects 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 230000004807 localization Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 241000282472 Canis lupus familiaris Species 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241000282327 Felis silvestris Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004424 eye movement Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Library & Information Science (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a cross space pooling strategy-based weak supervision image emotion classification and positioning method, which comprises the steps of adopting a cross space pooling strategy based on ResNet-101 to finally generate a feature vector with dimension as class number; the regions in the image that induce emotion are then captured by aggregating the feature maps of all the emotion classes. The invention utilizes the cross space pooling strategy to enable the convolutional neural network to learn information with discrimination for each type of emotion, thereby improving the emotion classification performance and greatly improving the classification accuracy.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to a weak supervision image emotion classification and positioning method based on a cross space pooling strategy.
Background
People increasingly like to express emotions actively by uploading pictures to social media such as Twitter, microblog and the like, and at the same time, people need to know the emotions of people in some fields such as security, monitoring, education and the like, so that visual emotion analysis research of images is more and more concerned, and with the depth of artificial intelligence in various fields, people have higher and higher expectations for computer understanding of emotions conveyed by images[1-3]Object recognition[4-6]And semantic segmentation[7-9]The visual recognition task has good effectTherefore, the deep learning method is also applied to the image emotion analysis[10-13]In image emotion analysis, the effect of machine learning based on depth features is superior to that of the traditional method for manually designing features[14-17]The traditional manual design features mainly consider color, texture, principal components and the like.
Borth et al[18-19]Defining visual emotion ontology for describing images by taking adjective-name word pairs (ANPs) as elements, proposing SentiBank to detect description of emotion in images based on visual underlying features, and constructing visual emotion concept in classification[20]The method proposes that two High-Level Concepts (High-Level Concepts) of an object and a scene are considered in emotion analysis, the emotion analysis of an image is related to the High-Level semantics and low-Level features at the same time, different emotion categories are related to different High-Level semantic Concepts, firstly, the relationship between the High-Level semantics and the emotion is constructed, and then, the emotion prediction is realized through Support Vector Regression (SVR)[21]It is proposed to analyze the character emotion in the image by considering the context in emotion prediction, train two convolutional neural networks, and finally fuse the features of the two neural networks[17]The EmotionROI dataset is provided, the dataset labels regions inducing emotion in an image, and a Full Convolution Network (FCNEL) with Euclidean loss is used for predicting an image emotion stimulus graph (MSE). The methods based on high-level semantics all attempt to learn features from emotion related factors in the image to improve classification performance, and the selection of the emotion related factors of the image becomes key.
With the success of deep learning in large-scale object identification, many methods of weakly supervised convolutional neural networks use a Multiple Instance Learning (MIL) algorithm[22]To achieve target detection, MIL definitionAn image is a set of regions and it is assumed that an image labeled positive contains at least one object instance of a certain category, while an image labeled negative does not contain an object of the category of interest[23]However, emotion is more subjective, assuming that one instance (object) appears in only one category, and is suboptimal for emotion detection[24]Class activation graphs (CAMs) are proposed that use a global averaging pooling layer to aggregate activation graphs for a particular class after the topmost convolutional layer by modifying the network structure to a full convolutional network[25]A gradient class activation map (Grad _ CAM) is proposed, which computes gradients through back propagation, then merges with a feature map, computes an activation map for a particular class, and can implement back propagation for any layer, but is usually computed at the last convolutional layer[26]The WILDCAT method is provided, a plurality of morphological information (such as the head or the leg of a dog) related to the category is learned, local features related to different types of modes are definitely designed in a model, and the provided model can complete image classification and weak supervision object positioning and segmentation, and considers object objective information, Zhu and the like[27]The methods are directed to general classification tasks, detect regions related to specific objects in an image, tend to mark foreground object regions in the image and are actually a recognition problem (recognizing cats or dogs in the image).
When an image is observed, human emotion is excited, different regions have different contributions to emotion induction, and how to automatically locate excited human in the imageLocation of image emotion-like regions is a more challenging problem than object region location, because the image's emotional semantics are not only related to salient object (foreground) regions in the image, but also to the overall semantic information conveyed by the image[28]The WSCNet network architecture is provided, emotion detection and classification are completed by training two branches, and the result of emotion detection of the first branch is used during classification[29]The eye movement data is used for positioning a human attention area in the image, and the emotional significance prediction is realized by designing a convolutional neural network, wherein a sub-network is used for learning semantic and spatial information of an image scene.
[1]Krizhevsky A,Sutskever I,Hinton G.Imagenet classification withdeep convolutional neural networks[C]//Proceedings of the 25th InternationalConference on Neural Information Processing Systems.Lake Tahoe,ACM Press,2012:1097-1105.
[2]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conferenceon Learning Representations.San Diego:ACM Press,2015.
[3]He K,Zhang X,Ren S,et al.Deep residual learning for imagerecognition[C]//Proceedings of the 2016IEEE Conference on Computer Vision andPattern Recognition.Las Vegas:IEEE Press,2016:770-778.
[4]Girshick R,Donahue J,Darrell T,Malik J.Rich feature hierarchiesfor accurate object detection and semantic segmentation[C]//Proceedings ofthe 2014IEEE Conference on Computer Vision and Pattern Recognition.Columbus:IEEE Press,2014:580-587.
[5]Girshick R.Fast R-CNN[C]//Proceedings of the 2015IEEEInternational Conference on Computer Vision.Washington:IEEE Press,2015:1440-1448
[6]Dai J.,Li Y,He,K,Sun J.R-FCN:Object detection via region-basedfully convolutional networks[C]//Proceedings of 30th International Conferenceon Neural Information Processing Systems.Barcelona:IEEE Press,2016:379-387.
[7]Chen L C,Papandreou G,Kokkinos I,et al.Semantic Image Segmentationwith Deep Convolutional Nets and Fully Connected CRFs[C]//Proceedings of the3rd International Conference on Learning Representations.San Diego,2015:357-361.
[8]Long J,Shelhamer E,Darrell T.Fully Convolutional Networks forSemantic Segmentation[C]//Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition.Boston:IEEE Press,2015:3431-3440.
[9]Dai J,He K,Li Y,et al.Instance-sensitive fully convolutionalnetworks[C]//Proceedings of the 14th European Conference of ComputerVision.Amsterdam:Springer,Cham,2016:534-549.
[10]Peng K C,Chen T,Sadovnik A,et al.A mixed bag of emotions:model,predict,and transfer emotion distributions[C]//Proceedings of the 2015 IEEEConference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:860-868.
[11]You Q,Luo J,Jin H,et al.Building a large scale dataset for imageemotion recognition:The Fine Print and The Benchmark[C]//Proceeding of the30th Conference on Artificial Intelligence.Phoenix:ACM Press,2016:308-314.
[12]You Q,Luo J,Jin H,et al.Quanzeng You,Jiebo Luo,Hailin Jin,Jianchao Yang.Robust image sentiment analysis using progressively trained anddomain transferred deep networks[C]//Proceedings of the 29th Conference onArtificial Intelligence.Austin:ACM Press,2015:381-388.
[13]Víctor C,Brendan J,Xavier Giró-i-Nieto.From pixels to sentiment:Fine-tuning CNNs for visual sentiment prediction.Image Vision Computing,2017(65):15–22.
[14]Yanulevskaya V,Gemert J C V,Roth K,et al.Emotional valencecategorization using holistic image features[C]//Proceedings of the 2008 IEEEInternational Conference on Image Processing.San Diego:IEEE Press,2008:101-104.
[15]Zhao S,Gao Y,Jiang X,et al.Exploring principles-of-art featuresfor image emotion recognition[C]//Proceedings of the 2014ACM InternationalConference on Multimedia.Orlando:ACM Press,2014:47-56.
[16]Machajdik J,Hanbury A.Affective image classification usingfeatures inspired by psychology and art theory[C]//Proceedings of the 2010ACMInternational Conference on Multimedia.Firenze:ACM Press,2010:83-92.
[17]Peng K C,Sadovnik A,Gallagher A,et al.Where do emotions comefrom?predicting the emotion stimuli map[C]//Proceedings of the 2016 IEEEInternational Conference on Image Processing.Phoenix:IEEE Press,2016:614-618.
[18]Borth D,Ji R,Chen T,et al.Large-scale visual sentiment ontologyand detectors using adjective noun pairs[C]//Proceedings of the 2013ACMMultimedia Conference.Barcelona:ACM Press,2013:223-232.
[19]Chen T,Borth D,Darrell T,et al.DeepSentibank:Visual sentimentconcept classification with deep convolutional neural networks[J].ComputerScience,2014.
[20]Ali A R,Shahid U,Ali M,et al.High-level concepts for affectiveunderstanding of images[C]//Proceedings of 2017 IEEE Winter ConferenceonApplications of Computer Vision.Santa Rosa:IEEE Press,2017:678-687.
[21]Kosti R,Alvarez J M,Recasens A,et al.Emotion Recognition inContext[C]//Proceedings of the 2017 IEEE Conference on Computer Vision andPattern Recognition.Honolulu:IEEE Press,2017:1960-1968.
[22]Bilen H,Vedaldi A.Weakly supervised deep detection networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and PatternRecognition.Las Vegas:IEEE Press,2016:2846-2854.
[23]Cinbis R.G,Verbeek J,and Schmid C.Weakly supervised objectlocalization with multi-fold multiple instance learning[J].IEEE Transactionson Pattern Analysis and Machine Intelligence,2017,39(1):189–203.
[24]Zhou B,Khosla A,Lapedriza A,et al.Learning Deep Features forDiscriminative Localization[C]//Proceedings of the 2016 IEEE Conference onComputer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:2921-2929.
[25]Selvaraju R R,Cogswell M,Das A,et al.Grad-CAM:Visual Explanationsfrom Deep Networks via Gradient-Based Localization[C]//Proceedings of the2017 IEEE International Conference on Computer Vision.Venice:IEEE Press,2017:618-626.
[26]Durand T,Mordan T,Thome N,et al.Wildcat:Weakly supervisedlearning ofdeep convnets for image classification,pointwise localization andsegmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Visionand Pattern Recognition.Honolulu:IEEE Press,2017:5957-5966.
[27]Zhu Y,Zhou Y,Ye Q,et al.Soft proposal networks for weaklysupervised object localization[C]//Proceedings of the 2017 IEEE InternationalConference on Computer Vision.Venice:IEEE Press,2017:1859-1868.
[28]Yang J F,She D Y,Lai Y K,et al.Weakly Supervised Coupled Networksfor Visual Sentiment Analysis[C]//Proceedings of the 2018IEEE Conference onComputer Vision and Pattern Recognition.Salt Lake City:IEEE Press.2018:7584-7592.
[29]Fan S J,Shen Z Q,Jiang M,et al.Emotional Attention:A Study ofImage Sentiment and Visual Attention[C]//Proceedings of the 2018IEEEConference on Computer Vision and Pattern Recognition.Salt Lake City:IEEEPress,2018:7521-7531。
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a cross space pooling strategy-based weak supervision image emotion classification and positioning method with high image emotion classification accuracy.
A weak supervision image emotion classification and positioning method based on a cross space pooling strategy comprises the following steps:
step 1: deleting the pooling layer and the full-connected layer of the full convolution network based on the full convolution network ResNet-101, performing convolution operation on the feature maps generated by conv5 in the ResNet-101 by using a convolution kernel of 1 multiplied by 1, and generating a specific number (k) of feature maps for each category;
step 2: extracting global information of each feature map by using global average pooling;
and step 3: finding out the characteristic diagram with the maximum response through the maximum pooling operation, finally generating a characteristic vector with the dimensionality being the category number, and marking the value of each vector as Sc;
Wherein,representing the characteristics of the jth channel of the C-th class in F', k representing the number of characteristic channels generated by each class, C representing the C-th emotion, the total number of emotion classes being C, GaveRepresenting global average pooling; the feature after convolution by 1 × 1 is denoted as F';
and 4, step 4: the pre-training model weight values visually identified on ImageNet set the learning rates of the full convolution layer and the cross space strategy to 0.0001 and 0.001 respectively. The whole model is subjected to iterative training for 30 rounds, the learning rate is reduced by 10 times every 10 rounds, the decade is set to be 0.005, and the momentum is set to be 0.9; random horizontal flipping and clipping of the expanded data are used in training to reduce overfitting, and finally, the size of the image input by the model is 448 x 448;
and 5: in the forward process of each batch, a cross entropy loss value is calculated
WhereinN is the size of the batch, representing the number of samples trained in a previous session, yiReal emotion labeling representing ith training sample;SlIs the value of the ith element of the feature vector defined in step 3, representing the score of the ith category in the network;
step 6: updating the weight parameter using a random gradient descent in a reverse pass according to the calculated loss function value;
and 7: repeating the steps 5 to 6 until a round of training is completed, and performing model test according to the test data set;
and 8: repeating the step 7 until the model reaches the optimal value or the total iteration rounds are completed;
and step 9: generating image emotion activation map
Has the advantages that:
according to the cross space pooling strategy-based weak supervision image emotion classification and positioning method, the 1 x 1 convolution kernel, the global average pooling operation and the maximum pooling operation are utilized to enable the convolutional neural network to learn information with discrimination for each type of emotion, so that the emotion classification performance is improved, and the classification accuracy is greatly improved.
According to the invention, under a simple convolutional neural network architecture, only image level labeling information is used, and through the proposed cross space pooling strategy, the convolutional neural network learns more information with identification capability, the performance of image emotion classification is improved, emotion is understood from the image semantics, emotion area positioning related to emotion is better realized, and the influence and contribution of each pixel in the image on image emotion induction are marked.
Drawings
FIG. 1 is a model for generating an emotional activation map;
FIG. 2 is a comparison graph of emotional region localization performance;
FIG. 3 is a comparison graph of the results of several object location methods applied to the location of the affective region and the location of the affective region according to the method of the present invention;
FIG. 4 is a comparison graph of the positioning performance of the WSCNET method and the method of the present invention in the emotional region;
FIG. 5 is a comparison of the generated emotion activation map and the emotion feature map of the predicted emotion classification.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention are described clearly and completely below, and it is obvious that the described embodiments are some, not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method for classifying image emotions and positioning emotion areas, which aims to solve the problems that the existing image emotion classification is low in accuracy and emotion area positioning is not involved in the existing research.
The technical scheme adopted by the invention comprises the following two parts:
first, cross-space pooling strategy: based on a full convolution network ResNet-101, deleting the last two layers (a global pooling layer and a full connection layer) of RestNet-101, replacing the last two layers in ResNet101 with the cross space pooling provided by the application, firstly realizing information integration across channels by using a 1 x 1 convolution kernel, reducing the number of channels of features, generating a specific number of feature maps for each class, then extracting the global information of each feature map by using global average pooling, then finding the feature map with the maximum response through the maximum pooling operation, and finally generating a feature vector with the dimension of the number of classes, wherein the value of each vector is marked as Sc;
Firstly, generating a response characteristic diagram with category consciousness aiming at each category of emotion, and if the number of emotion categories is C, then C response characteristic diagrams exist; then, S corresponding to the ScThe weights are combined to obtain comprehensive positioning information, and positioning information of the emotion area is not obtained from the maximum response characteristic diagram of the specific class.
The method realizes image emotion classification and emotion region positioning in a unified framework, generates an emotion activation diagram representing the induced emotion related region, can obtain fine-grained and pixel-level image labels only by requiring image-level labels, and represents the contribution of each pixel point to image emotion classification. The invention further illustrates the relationship between the image emotion activation map and the image emotion category prediction result, and the emotion characteristic map which is closer to the finally generated emotion activation map contributes to the classification more greatly, thereby playing a leading role in emotion classification.
The method for classifying and positioning the emotion of the weakly supervised image based on the cross space pooling strategy comprises the following steps:
step 1: based on ResNet-101, the last two layers of the network are deleted and replaced by the cross-space pooling strategy provided by the application, namely, the feature maps generated by conv5 in ResNet-101 are firstly convolved by a convolution kernel of 1 × 1, so that cross-channel information integration is realized, the number of channels of the features is reduced, and a specific number (k) of feature maps are generated for each class. The feature after convolution by 1 × 1 is denoted as F'. Then, global information of each feature map is extracted by utilizing global average pooling, then the feature map with the maximum response is found by utilizing maximum pooling operation, finally, a feature vector with the dimensionality being the category number is generated, and the value of each vector is marked as Sc
Wherein,representing the characteristics of the jth channel of the C-th class in F', k representing the number of characteristic channels generated by each class, C representing the C-th emotion, the total number of emotion classes being C, GaveRepresenting global average pooling;
step 2: the depth model proposed in step 1 is initialized. The initialized convolutional neural network weight parameters are pre-training model weight values visually identified on ImageNet, and the learning rates of the full convolutional layer and the cross space strategy are respectively set to be 0.0001 and 0.001. The whole model is iteratively trained for 30 rounds, the learning rate is reduced by 10 times every 10 rounds, decade is set to be 0.005, momentum is set to be 0.9, random horizontal flipping and cutting of expansion data are used in training to reduce overfitting, and finally, the size of an input picture of the model is 448 multiplied by 448.
And step 3: in the forward process of each batch, a cross entropy loss value is calculated
WhereinN is the size of the batch, representing the number of samples trained in a previous session, yiRepresenting the real emotion marking of the ith training sample; slIs that
The value of the ith element of the feature vector defined in the step 3 represents the score of the ith category in the network;
and 4, step 4: the weight parameters are updated in the reverse pass using a random gradient descent (SGD) according to the calculated loss function values.
And 5: and (4) repeating the steps 3 to 4 until one round of training is finished, and carrying out model test according to the test data set.
Step 6: and (5) repeating the step until the model reaches the optimal value or the total iteration rounds are completed.
And 7: generating image emotion activation map
Experimental example 1:
the cross space pooling strategy provided by the invention enables the convolutional neural network to learn information with higher discrimination for no-class emotion, and improves the emotion classification performance, as shown in table 1, compared with other methods, the method provided by the invention has the advantage that the classification accuracy is greatly improved.
TABLE 1 Classification accuracy (%) comparison
Experimental example 2:
in the cross space pooling strategy of the invention, compared with the general average pooling operation, the global average pooling operation increases the receptive field of a convolution kernel, can catch the global semantic information in an image better and is more robust to space conversion, and then an emotion vector is generated for emotion classification through the maximum pooling operation, the relationship between each element in the vector and a feature map of a convolution layer is more direct, namely the relationship between a category and the feature map is more direct, such as the corresponding relationship in FIG. 1, the cross space pooling strategy provided by the application can replace a pooling layer and a full connection layer in the original network architecture to avoid that the original full connection layer in ResNet-101 ignores the spatial information of an object in the image, in the extracted feature map of CNN, each feature represents a partial feature of the whole network, and the information of the object and the semantic information in different feature maps can be better utilized through the cross space pooling strategy, FIG. 2 is a comparison graph of emotional area localization performance, comparing Mean Absolute Error (MAE), precision (precision), recall (recall), and F, respectively1The smaller the MAE, the better, and the larger the precision, call, and F1, the better. The method of the present invention is illustrated in fig. 2 by numerical values on several evaluation mechanisms with the best localization performance in the weakly supervised learning method.
FIG. 3 compares the results of applying several object positioning methods to the positioning of the emotional regions with the results of the present invention, and the results of several evaluation mechanisms are marked on the heat map, which shows that the method of the present invention can position more regions related to emotion, and that the evaluation mechanism is highest call, which shows that the truly labeled emotional regions are positioned more. FIG. 4 compares the emotion region localization of the WSCNET method with the method of the present invention, where precision is 0.94 but recall is 0.15Please refer to the method as precision 0.82 and recall 0.85, which are usually contradictory, and use F1(F1(2 precision + recycle) as a comprehensive evaluation index, F of the method of the present application1Value 0.83, and F of the WSCNet method1The value is 0.26, which is obviously higher than F in the WSCNet method1The method is used for solving the problem that more regions in the real label are positioned as the emotion regions, and the positioning performance of the method on the induced emotion regions is better.
FIG. 5 compares the emotional feature map of the emotion classification with the emotion activation map, 5c marks the result and probability of emotion prediction, 5d marks the processing (p) and call (r) of the image, the highlighted area in the emotion activation map is the area with the largest contribution to the emotion classification, and these areas directly influence the classification result.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (1)
1. A weak supervision image emotion classification and positioning method based on a cross space pooling strategy is characterized by comprising the following steps:
step 1: deleting the pooling layer and the full-connected layer of the full convolution network based on the full convolution network ResNet-101, performing convolution operation on the feature maps generated by conv5 in the ResNet-101 by using a convolution kernel of 1 multiplied by 1, and generating a specific number (k) of feature maps for each category;
step 2: extracting global information of each feature map by using global average pooling;
and step 3: finding out the characteristic diagram with the maximum response through the maximum pooling operation, finally generating a characteristic vector with the dimensionality being the category number, and marking the value of each vector as Sc;
Wherein,representing the characteristics of the jth channel of the class c in the F ', wherein the F' is the characteristics after 1 × 1 convolution; k represents the number of characteristic channels generated by each category, C represents the C-th emotion, and the total number of emotion categories is C and GaveRepresenting global average pooling;
and 4, step 4: initializing a convolutional neural network weight parameter, namely a pre-training model weight value visually identified on ImageNet, and respectively setting the learning rates of a full convolutional layer and a cross space strategy to be 0.0001 and 0.001; the whole model is subjected to iterative training for 30 rounds, the learning rate is reduced by 10 times every 10 rounds, the decade is set to be 0.005, and the momentum is set to be 0.9; reducing overfitting by using random horizontal turning and cutting expansion data in training, and finally, inputting a picture with the size of 448 multiplied by 448 by a model;
and 5: in the forward process of each batch, a cross entropy loss value is calculated
WhereinN is the size of the batch, representing the number of samples trained in a previous session, yiRepresenting the real emotion marking of the ith training sample; slIs the value of the ith element of the feature vector defined in step 3, representing the score of the ith category in the network;
step 6: updating the weight parameter using a random gradient descent in a reverse pass according to the calculated loss function value;
and 7: repeating the steps 5 to 6 until a round of training is completed, and performing model test according to the test data set;
and 8: repeating the step 7 until the model reaches the optimal value or the total iteration rounds are completed;
and step 9: generating image emotion activation map
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911259699.2A CN111026898A (en) | 2019-12-10 | 2019-12-10 | Weak supervision image emotion classification and positioning method based on cross space pooling strategy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911259699.2A CN111026898A (en) | 2019-12-10 | 2019-12-10 | Weak supervision image emotion classification and positioning method based on cross space pooling strategy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111026898A true CN111026898A (en) | 2020-04-17 |
Family
ID=70205332
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911259699.2A Pending CN111026898A (en) | 2019-12-10 | 2019-12-10 | Weak supervision image emotion classification and positioning method based on cross space pooling strategy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111026898A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111797936A (en) * | 2020-07-13 | 2020-10-20 | 长沙理工大学 | Image emotion classification method and device based on significance detection and multi-level feature fusion |
CN112329680A (en) * | 2020-11-13 | 2021-02-05 | 重庆邮电大学 | Semi-supervised remote sensing image target detection and segmentation method based on class activation graph |
CN113191381A (en) * | 2020-12-04 | 2021-07-30 | 云南大学 | Image zero-order classification model based on cross knowledge and classification method thereof |
CN113408511A (en) * | 2021-08-23 | 2021-09-17 | 南开大学 | Method, system, equipment and storage medium for determining gazing target |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399406A (en) * | 2018-01-15 | 2018-08-14 | 中山大学 | The method and system of Weakly supervised conspicuousness object detection based on deep learning |
CN108399380A (en) * | 2018-02-12 | 2018-08-14 | 北京工业大学 | A kind of video actions detection method based on Three dimensional convolution and Faster RCNN |
CN108960140A (en) * | 2018-07-04 | 2018-12-07 | 国家新闻出版广电总局广播科学研究院 | The pedestrian's recognition methods again extracted and merged based on multi-region feature |
CN109165692A (en) * | 2018-09-06 | 2019-01-08 | 中国矿业大学 | A kind of user's personality prediction meanss and method based on Weakly supervised study |
CN110119688A (en) * | 2019-04-18 | 2019-08-13 | 南开大学 | A kind of Image emotional semantic classification method using visual attention contract network |
CN110322509A (en) * | 2019-06-26 | 2019-10-11 | 重庆邮电大学 | Object localization method, system and computer equipment based on level Class Activation figure |
CN110334584A (en) * | 2019-05-20 | 2019-10-15 | 广东工业大学 | A kind of gesture identification method based on the full convolutional network in region |
-
2019
- 2019-12-10 CN CN201911259699.2A patent/CN111026898A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399406A (en) * | 2018-01-15 | 2018-08-14 | 中山大学 | The method and system of Weakly supervised conspicuousness object detection based on deep learning |
CN108399380A (en) * | 2018-02-12 | 2018-08-14 | 北京工业大学 | A kind of video actions detection method based on Three dimensional convolution and Faster RCNN |
CN108960140A (en) * | 2018-07-04 | 2018-12-07 | 国家新闻出版广电总局广播科学研究院 | The pedestrian's recognition methods again extracted and merged based on multi-region feature |
CN109165692A (en) * | 2018-09-06 | 2019-01-08 | 中国矿业大学 | A kind of user's personality prediction meanss and method based on Weakly supervised study |
CN110119688A (en) * | 2019-04-18 | 2019-08-13 | 南开大学 | A kind of Image emotional semantic classification method using visual attention contract network |
CN110334584A (en) * | 2019-05-20 | 2019-10-15 | 广东工业大学 | A kind of gesture identification method based on the full convolutional network in region |
CN110322509A (en) * | 2019-06-26 | 2019-10-11 | 重庆邮电大学 | Object localization method, system and computer equipment based on level Class Activation figure |
Non-Patent Citations (3)
Title |
---|
张景莲等: "基于特征融合的恶意代码分类研究", 《计算机工程》 * |
杨珂等: "基于机器学习的分布式光伏电站投建人信用风险评估模型研究", 《征信》 * |
王忠珂等: "一种PE文件特征提取方法研究与实现", 《第十届中国通信学会学术年会论文集中国通信学会青年工作委员会会议论文集》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111797936A (en) * | 2020-07-13 | 2020-10-20 | 长沙理工大学 | Image emotion classification method and device based on significance detection and multi-level feature fusion |
CN111797936B (en) * | 2020-07-13 | 2023-08-08 | 长沙理工大学 | Image emotion classification method and device based on saliency detection and multi-level feature fusion |
CN112329680A (en) * | 2020-11-13 | 2021-02-05 | 重庆邮电大学 | Semi-supervised remote sensing image target detection and segmentation method based on class activation graph |
CN112329680B (en) * | 2020-11-13 | 2022-05-03 | 重庆邮电大学 | Semi-supervised remote sensing image target detection and segmentation method based on class activation graph |
CN113191381A (en) * | 2020-12-04 | 2021-07-30 | 云南大学 | Image zero-order classification model based on cross knowledge and classification method thereof |
CN113408511A (en) * | 2021-08-23 | 2021-09-17 | 南开大学 | Method, system, equipment and storage medium for determining gazing target |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11195051B2 (en) | Method for person re-identification based on deep model with multi-loss fusion training strategy | |
CN109977918B (en) | Target detection positioning optimization method based on unsupervised domain adaptation | |
CN109344736B (en) | Static image crowd counting method based on joint learning | |
CN111026898A (en) | Weak supervision image emotion classification and positioning method based on cross space pooling strategy | |
CN109410168B (en) | Modeling method of convolutional neural network for determining sub-tile classes in an image | |
Pan et al. | Image aesthetic assessment assisted by attributes through adversarial learning | |
CN109614921B (en) | Cell segmentation method based on semi-supervised learning of confrontation generation network | |
CN108399406A (en) | The method and system of Weakly supervised conspicuousness object detection based on deep learning | |
CN103984959A (en) | Data-driven and task-driven image classification method | |
CN109783666A (en) | A kind of image scene map generation method based on iteration fining | |
CN106682696A (en) | Multi-example detection network based on refining of online example classifier and training method thereof | |
CN110827304B (en) | Traditional Chinese medicine tongue image positioning method and system based on deep convolution network and level set method | |
Gao et al. | An end-to-end broad learning system for event-based object classification | |
CN104966052A (en) | Attributive characteristic representation-based group behavior identification method | |
Fan | Research and realization of video target detection system based on deep learning | |
WO2020119624A1 (en) | Class-sensitive edge detection method based on deep learning | |
CN110751005A (en) | Pedestrian detection method integrating depth perception features and kernel extreme learning machine | |
Wang et al. | Single shot multibox detector with deconvolutional region magnification procedure | |
Zhu et al. | NAGNet: A novel framework for real‐time students' sentiment analysis in the wisdom classroom | |
Siam et al. | Automated student review system with computer vision and convolutional neural network | |
Li et al. | Image aesthetic quality evaluation using convolution neural network embedded learning | |
Aghera et al. | MnasNet based lightweight CNN for facial expression recognition | |
Lai et al. | Robust text line detection in equipment nameplate images | |
Karim et al. | Bangla Sign Language Recognition using YOLOv5 | |
Chen et al. | Saliency detection via topological feature modulated deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200417 |
|
RJ01 | Rejection of invention patent application after publication |