CN117593593B - Image emotion classification method for multi-scale semantic fusion under emotion gain - Google Patents
Image emotion classification method for multi-scale semantic fusion under emotion gain Download PDFInfo
- Publication number
- CN117593593B CN117593593B CN202410071984.6A CN202410071984A CN117593593B CN 117593593 B CN117593593 B CN 117593593B CN 202410071984 A CN202410071984 A CN 202410071984A CN 117593593 B CN117593593 B CN 117593593B
- Authority
- CN
- China
- Prior art keywords
- emotion
- image
- feature
- fusion
- scale
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 228
- 230000004927 fusion Effects 0.000 title claims abstract description 108
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000005520 cutting process Methods 0.000 claims abstract description 20
- 230000007246 mechanism Effects 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 37
- 238000012549 training Methods 0.000 claims description 28
- 238000012360 testing method Methods 0.000 claims description 25
- 230000004913 activation Effects 0.000 claims description 22
- 238000011176 pooling Methods 0.000 claims description 18
- 238000007781 pre-processing Methods 0.000 claims description 17
- 238000013461 design Methods 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 15
- 230000003044 adaptive effect Effects 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 11
- 102100021709 Rho guanine nucleotide exchange factor 4 Human genes 0.000 claims description 8
- 101710128386 Rho guanine nucleotide exchange factor 4 Proteins 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 7
- 238000007500 overflow downdraw method Methods 0.000 claims description 6
- 238000007499 fusion processing Methods 0.000 claims description 4
- 238000005728 strengthening Methods 0.000 claims description 4
- 206010063659 Aversion Diseases 0.000 claims description 3
- 230000005012 migration Effects 0.000 claims description 3
- 238000013508 migration Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 9
- 230000002708 enhancing effect Effects 0.000 abstract description 2
- 238000002474 experimental method Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000011478 gradient descent method Methods 0.000 description 4
- QDGIAPPCJRFVEK-UHFFFAOYSA-N (1-methylpiperidin-4-yl) 2,2-bis(4-chlorophenoxy)acetate Chemical compound C1CN(C)CCC1OC(=O)C(OC=1C=CC(Cl)=CC=1)OC1=CC=C(Cl)C=C1 QDGIAPPCJRFVEK-UHFFFAOYSA-N 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 206010034960 Photophobia Diseases 0.000 description 2
- 238000002679 ablation Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000008909 emotion recognition Effects 0.000 description 2
- 230000006397 emotional response Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000004630 mental health Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 208000027534 Emotional disease Diseases 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000000763 evoking effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012567 pattern recognition method Methods 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/52—Scale-space analysis, e.g. wavelet analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of computer vision, and discloses an image emotion classification method for multi-scale semantic fusion under emotion gain, which comprises the steps of generating small cut pieces of an original image through a cutting operation, and respectively carrying out feature recalibration on 64 small cut pieces to obtain emotion gains of different areas of the image; the characteristics of different scales are spliced and fused through a multi-scale fusion network, so that the loss of information in the convolution process is reduced; the emotion gain result is subjected to decision fusion with the multi-scale semantic fusion features, the PAFPN pyramid network is used for fusing image features with different scales, and the ECA network is used for enhancing the connection between channels, so that the information loss is greatly reduced; feature recalibration is carried out on the processed image and the features with different scales through a coordinated attention mechanism, the obtained attention features are fused, emotion features in the image are mined from the whole and the part, and the robustness of the model is effectively improved.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to an image emotion classification method for multi-scale semantic fusion under emotion gain.
Background
With the rapid development of the internet, more and more images appear in the fields of view of people, photographic photos, artistic drawings, social images, daily sharing images and various images, and the images are used for transmitting information and expressing emotion so as to enrich the life of people. The images are researched, the contained emotion is widely applied, the social platform analyzes the emotion of the user to push content, meanwhile, the public opinion monitoring is carried out, the emotion module is added in the network model to better carry out man-machine interaction, and the application of the social platform in a classroom can help teachers to pay attention to the emotion state of students in real time. The image emotion classification means that a computer analyzes and extracts emotion characteristics from an image, calculates the emotion characteristics by using a pattern recognition and machine learning method, and further understands the emotion of a person to realize classification of the image emotion. Because of the inconsistency of subjectivity of human cognition and objectivity of machine learning, computers have a gap with semantics when simulating human judgment of image emotion, which makes image emotion classification a challenge.
In the prior art, a deep learning method is mostly adopted to perform image emotion analysis, for example, a target detection method is combined, and a main body in an image is selected in a frame mode to help the emotion analysis; the emotion attention area in the image is obtained by an attention mechanism in combination with visual attention, and emotion characteristics in the image are extracted more accurately; and the pyramid network is combined to extract the features of the images with different scales, so that richer emotion information in the images is mined.
Most of these methods use the last layer of the convolutional network as the extracted feature map, usually selecting a high-level feature map that contains rich semantic features, but those underlying features such as color, texture, shape are ignored because of the layer-by-layer convolution, which have been shown to be relevant to emotional reactions; other methods using multi-scale networks do not focus on the information loss caused by the non-uniformity of different scales when fusing different scale features. Furthermore, a learner improves the performance of the model by extracting salient features in the image as class activation graphs, but different image regions have different effects on evoked emotions, these methods only notice salient regions in the image, but ignore non-salient but emotion contributing regions.
Through the above analysis, the problems and defects existing in the prior art are as follows:
1) The prior art only notices salient areas in the image, ignoring areas that are not significant but contribute to emotion;
2) When how to acquire emotion features in an image, only a salient region is noticed, and a region which is not salient but contributes to emotion is ignored;
3) When more representative semantic features in the image are acquired, the problem of information loss caused by multi-scale feature fusion is not solved.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides an image emotion classification method for multi-scale semantic fusion under emotion gain.
The technical scheme for solving the technical problems is as follows:
an image emotion classification method for multi-scale semantic fusion under emotion gain comprises the following steps:
s1: image preprocessing, namely dividing an FI image emotion data set P into a training set T and a test set M, randomly scaling and cutting images of the training set T, and finally obtaining an image set T' with the size of 448 multiplied by 3;
s2: establishing a depth network model, selecting a resnet-50 as a basic network, constructing an emotion gain network, and obtaining emotion gain feature diagrams of different areas and different scales through a dicing operation and feature recalibration; constructing a multi-scale semantic fusion network, wherein the multi-scale semantic fusion network is combined with a pyramid network and an ASEF network to obtain a deep multi-scale semantic fusion feature map; then, the two feature images are subjected to channel splicing, and finally, a multi-scale semantic fusion feature image under emotion gain is obtained;
s3: the loss function is designed, so that the image distance between different emotion categories is further caused by the constructed cross entropy loss for keeping the distance between the categories;
S4: training a model, namely inputting an image set T' in the S1 into a depth network model of the S2, training the model by using an SGD optimizer, and learning model parameters by calculating loss through a loss function of the S3;
s5: and (3) obtaining emotion types of the image data set P to be tested, and inputting the test images of the test set M in the S1 into the model trained in the S4 after the processing steps of fixed-size scaling and center cutting to obtain the corresponding emotion types.
On the basis of the technical scheme, the invention can be improved as follows.
Preferably, the FI image emotion data set P, the training set T and the test set M in S1 each include eight emotions, which are pleasure, photophobia, satisfaction, excitement, anger, aversion, fear and sadness, respectively.
Preferably, the underlying network in S2 is derived from a convolutional layer group migration of pre-trained resnet-50 on the large-scale dataset ImageNet; after inputting the image, obtaining four feature maps c2, c3, c4 and c5 with different scales, wherein the sizes of the image scales are 112×112, 56×56, 28×28 and 14×14 respectively;
the implementation steps of the emotion gain network design in the S2 are as follows:
s2.1.1: cutting each picture of the preprocessed image set T' to obtain 64 56X 56 cut blocks, stacking the cut blocks, and performing feature recalibration to obtain emotion weight feature graphs F of different areas of each picture of the image set T loc ;
S2.1.2: the three feature graphs c3, c4 and c5 with different scales of the resnet-50 are subjected to feature recalibration, and the obtained results are fused to obtain an integral emotion weight feature graph F glo ;
S2.1.3: f, carrying out emotion weight feature graphs F on different areas obtained in S2.1 loc And S2.2, different-scale emotion weight characteristic map F obtained in step glo And (3) fusing to finally obtain an emotion gain feature map F, wherein the formula is as follows:
(1)。
preferably, in S2.1.1, feature recalibration is performed using a CA attention mechanism, specifically as follows:
s2.1.1.1: in order to be able to capture the remote spatial interactions with accurate location information, inputs of size c×h×w are pooled according to W direction and H direction, respectively, to generate feature maps of sizes c×h×1 and c×1×w, respectively, with the formula:
(2);
(3);
wherein x represents an input of size c×h×w, C represents the number of channels, H represents high, W represents wide, x c (h, j) represents one pixel of the image with the channel number C and the direction W, x c (j, w) represents one pixel of the image with the channel number C and the direction H;representing the average pooling of the input x in the W direction; />Representing the average pooling of the input x in the H direction; />Representing that the input x in the W direction is summed and divided by the input number W, +. >Representing that the input x in the H direction is added and divided by the input quantity H;
s2.1.1.2: will beAnd->Performing splicing operation, then performing dimension reduction operation and activation operation by using a 1 multiplied by 1 convolution kernel, and generating a feature map f, wherein the formula is as follows:
(4);
wherein σ represents a sigmoid activation function, F 1 Representing a 1 x 1 convolution;,/>the pooling result of the average pooling in the W direction and the H direction obtained in the S2.1.1 step is represented by a piece of information;
s2.1.1.3: performing split operation on f along the space dimension to obtain f h And f w Then, the dimension-lifting operation is carried out by respectively utilizing 1X 1 convolution, and the final H-direction attention vector g is obtained by combining a sigmoid activation function h And a W-direction attention vector g w The formula is:
(5);
(6);
wherein sigma represents a sigmoid activation function,representing the result of the split and convolution operations performed in the H direction, < >>Representing the result of performing split and convolution operations in the W direction; f (F) h And F w Indicating the same convolution operation, subscripts H and W indicate in the directions of H and W;
s2.1.1.4: after the attention vectors in S2.1.3 are overlapped, emotion weight feature graphs F of different areas are obtained loc The formula is:
(7);
wherein,input representing position at (i, j, -j>Attention vector representing the position at i in the h-direction,/ >Represents the attention vector at j at a position in the w-direction,/v>Representing the sum of all the positional pixels in the image.
Preferably, the multi-scale semantic fusion network design step in S2 is as follows:
s2.2.1, using four feature maps c2, c3, c4, c5 with different scales of the resnet-50 as inputs of a pyramid network, performing side-to-side connection from top to bottom and from bottom to top, specifically comprising the following steps:
s2.2.1.1: c5 the number of channels was adjusted to 256 by a 1×1 convolution, and the result was named p5;
s2.2.1.2: p5 is scaled to 28 x 28 by upsampling, after which c4 and p5 are fused to obtain p4;
s2.2.1.3: fusing c3 and p4 to obtain p3, and fusing c2 and p3 to obtain p2;
s2.2.1.4: p2 adjusts the number of channels to 256 by a 1×1 convolution, and the result is named n2;
s2.2.1.5: n2 is scaled to 56 x 56 by downsampling, after which p3 and n2 are fused to obtain n3;
s2.2.1.6: fusing p4 with n3 to obtain n4, and fusing p5 with n4 to obtain n5;
s2.2.2: the output of the pyramid network is firstly used for strengthening the connection between the feature map channels through an ECA attention mechanism, then the fusion duty ratio of the feature maps with different scales is adaptively adjusted through a self-adaptive space network ASFF, and finally the multi-scale semantic fusion feature map is obtained, and the specific steps are as follows:
S2.2.2.1: the n3, n4 and n5 acquired in the step S2.4 are firstly subjected to cross-channel interaction among offices through a global average pooling layer and then a fast one-dimensional convolution kernel with the size of k, attention weights are acquired through sigmoid function activation, and finally e3, e4 and e5 are acquired according to the specific formula as follows:
(8);
(9);
(10);
wherein,representing the result of the sum of all positions of the input x in the W and H directions divided by WH, g (x) representing the global average pooling in the W and H directions, σ representing the activation function sigmoid, C1D k A 1 x1 convolution with a convolution kernel k, w representing the attention weight, y (x) representing the feature enhancement map after multiplication of the input feature with the attention weight;
s2.2.2.2: respectively carrying out feature fusion on e3, e4 and e5 obtained in S2.2.2.1 by taking each as a reference, and finally obtaining the self-adaptive space fusion feature F 1 ,F 2 ,F 3 The method comprises the steps of carrying out a first treatment on the surface of the The e3, e4 and e5 are re-named as x1, x2 and x3, the x3 is required to be adjusted to be consistent with the x1 size, the x1 convolution is firstly adopted to be adjusted to be consistent with the x1 channel number, and then the size is scaled to be the same size in an interpolation mode; and carrying out the same operation on x2 to ensure that the channel numbers and the scale sizes of x1, x2 and x3 are the same, and finally carrying out feature fusion, wherein the fusion formula is as follows:
(11);
wherein, 1 represents x of a selection standard, and the values are 1,2 and 3; ,/>,/>Representing that x1, x2 and x3 are scaled and sampled to obtain a result with the same scale as xl, alpha L ,β L And gamma L Respectively indicate->,/>And->Weight parameter F of (2) L Representing different scale features and corresponding weights alpha L ,β L And gamma L The result of the multiplication;
wherein the weight parameter alpha L ,β L And gamma L Then the characteristic of x 1-x 3 after each scaling is obtained by convolution of 1X 1, and the parameter alpha L ,β L And gamma L After splicing they are all in the range of [0,1 ] by softmax activation function]And the sum is 1;
s2.2.2.3: spatially adaptive fusion of three scales obtained in S2.2.2.2 1 ,F 2 ,F 3 The splicing fusion is specifically described as F 3 With dimensions as reference, for F 1 And F 2 Upsampling respectively to F 1 ,F 2 ,F 3 The three features are spliced in a channel manner to obtain a final multi-scale semantic fusion feature map; the formula is as follows:
(12);
wherein F is l Representing different scale features and corresponding weights alpha L ,β L And gamma L The result of the multiplication is an adaptive spatial fusion feature F L ,Representing three adaptive spatial fusion features F 1 ,F 2 ,F 3 The splicing result of (a) is the multi-scale semantic fusion feature map F mul 。
Preferably, the multi-scale semantic fusion feature map fusion method under the emotion gain in the S2 is designed as follows:
fusion of the emotion gain feature map and the multi-scale semantic fusion feature map is realized by a fusion method of channel splicing;
S2, final output, wherein the formula is as follows:
(13);
wherein F is an emotion gain characteristic diagram, F mul F for multi-scale semantic fusion of feature graphs out And (5) fusing the feature images for multi-scale semantics under the emotion gain.
Preferably, the cross entropy loss function constructed in S3 is formulated as follows:
(14);
wherein, l (y i =j) =1 indicates when y i Correctly classified into j classes, otherwise l (y i =j) =0, N represents the number of pictures in a particular dataset, C represents the emotion category involved,representing the probability that the i-th image is judged to be j, L fwcls Is a cross entropy loss.
Preferably, an image emotion classification system for multi-scale semantic fusion under emotion gain, which is applied to the image emotion classification method for multi-scale semantic fusion under emotion gain, comprises: the system comprises an image preprocessing module, a depth network model building module, a loss function design module, a model training module and an emotion type obtaining module;
the image preprocessing module is used for randomly scaling and cutting the image of the emotion data set to finally obtain an image with the size of 224 multiplied by 3;
the deep network model building module is used for selecting a resnet-50 as a basic network, building an emotion gain network and obtaining emotion weights of different areas through a dicing operation and feature recalibration; constructing a multi-scale semantic fusion network, and acquiring deep multi-scale semantic fusion features by combining a pyramid network and an ASEF network;
The loss function design module is used for keeping the distance between classes, and the constructed cross entropy loss is used for enabling the image distance between different emotion classes to be farther;
the model training module is used for preprocessing the divided images in a manner of scaling random overturn and the like, inputting the preprocessed images into a network model, optimizing the preprocessed images by using a random gradient descent method, and learning model parameters by calculating loss through a loss function;
the emotion type obtaining module is used for inputting the images in the data set into the trained model after the preprocessing steps of fixed-size scaling and center cutting, and obtaining the corresponding emotion type.
Compared with the prior art, the technical scheme of the application has the following beneficial technical effects:
firstly, the invention provides an image emotion classification method of multi-scale semantic fusion under emotion gain, which comprises the steps of firstly, generating small cut blocks from an original image through a cutting operation, respectively carrying out feature recalibration on 64 small cut blocks to obtain emotion feature images of different areas of the image, and simultaneously carrying out feature recalibration on three scale feature images 56×56, 28×28 and 14×14 after convolution to obtain an overall emotion feature image, and combining local and overall feature images to obtain final emotion gain; on the other hand, the feature images with different scales are adaptively fused through a multi-scale fusion network, so that the loss of information in the convolution process is reduced; and finally, carrying out decision fusion on the emotion gain result and the multi-scale semantic fusion feature map.
According to the invention, through random scaling and cutting, the model can process images with different sizes and proportions, and the generalization capability of the model is enhanced.
The invention ensures that the input image size is unified to 448 multiplied by 3, and provides stable and consistent input data for the subsequent deep learning model.
The invention adopts the resnet-50 as a basic network, and accelerates and optimizes the training process by means of the weight of the pre-training.
According to the emotion gain network, the multiscale and region specific emotion weights of the image can be obtained through the dicing operation and the feature recalibration, so that the emotion of the image can be more accurately classified.
The multi-scale semantic fusion network can capture multi-scale characteristics of images, and has higher accuracy for emotion classification.
The invention adopts cross entropy loss to ensure that images among different emotion categories have larger inter-category distances in the training process of the model, and improves the classification capability of the model.
The preprocessing modes such as scaling and random overturning of the image enhance the generalization capability of the model, so that the model can better process data in the real world.
The present invention optimizes the model using a random gradient descent method, an optimization technique that has proven to be effective in deep learning.
The invention uses the preprocessing mode of fixed size scaling and center cutting to ensure the consistency of the image to be tested and the training image, and improves the accuracy of emotion classification.
The method combines emotion gain and multi-scale semantic fusion in the aspect of image emotion classification, so that the method has high efficiency and accuracy in processing real world data. Furthermore, the use of deep learning techniques and specific loss function designs also provides them with powerful learning capabilities.
Secondly, on one hand, the PAFPN pyramid network is used for fusing image features with different scales, the ECA network is used for strengthening the connection between channels, and the ASFF network is adaptive to the features with different scales to solve the problem of scale inconsistency in the fusion process, so that information loss is greatly reduced; on the other hand, feature recalibration is carried out on the processed image and feature graphs with different scales through a coordinated attention mechanism, the obtained attention feature graphs are fused, emotion features in the image are mined from the whole and part, and the robustness of the model is effectively improved.
Thirdly, the expected benefits and commercial value after the technical scheme of the invention is converted are as follows: the invention can be applied to the fields of public opinion monitoring, education assistance, artistic learning and the like.
The technical scheme of the invention fills the technical blank in the domestic and foreign industries: aiming at the problem that only a salient region is noticed and a region which is not salient but contributes to emotion is ignored when the emotion features in the image are acquired, the emotion features in the image are mined locally and integrally, and the robustness of the model is effectively improved. Aiming at the situation that information loss caused by multi-scale feature fusion is not solved when more representative semantic features in an image are acquired, the method and the device for achieving the semantic feature fusion based on the multi-scale feature fusion, disclosed by the invention, self-adaptively adjust weights of the fusion of the features with different scales from two aspects of space and channels, and solve the problem of scale inconsistency in the fusion process.
Drawings
FIG. 1 is a flowchart of an image emotion classification method for multi-scale semantic fusion under emotion gain provided by an embodiment of the invention;
FIG. 2 is a diagram of an overall framework provided by an embodiment of the present invention;
FIG. 3 is a diagram of a multi-scale semantic fusion network provided by an embodiment of the present invention;
FIG. 4 is a graph of an emotion gain network provided by an embodiment of the present invention;
FIG. 5 is a block flow diagram provided by an embodiment of the present invention;
FIG. 6 is a confusion matrix thermodynamic diagram of the various modules provided by embodiments of the present invention; wherein, (a) resnet-50, (b) resnet-50+MPCA, (c) resnet-50+MASEF, (d) resnet-50+all;
FIG. 7 is a graph of visual experimental results provided by an embodiment of the present invention; wherein, (a) original drawings, (b) resnet-50, (c) resnet-50+MASEF, (d) resnet-50+MPCA, and (e) resnet-50+all.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the embodiment, shown in fig. 1-7, an image emotion classification method for multi-scale semantic fusion under emotion gain includes the following steps:
s1: image preprocessing, namely dividing an FI image emotion data set P into a training set T and a test set M, randomly scaling and cutting images of the training set T, and finally obtaining an image set T' with the size of 448 multiplied by 3;
s2: establishing a depth network model, selecting a resnet-50 as a basic network, constructing an emotion gain network, and obtaining emotion gain feature diagrams of different areas and different scales through a dicing operation and feature recalibration; constructing a multi-scale semantic fusion network, wherein the multi-scale semantic fusion network is combined with a pyramid network and an ASEF network to obtain a deep multi-scale semantic fusion feature map; then, the two feature images are subjected to channel splicing, and finally, a multi-scale semantic fusion feature image under emotion gain is obtained;
S3: the loss function is designed, so that the image distance between different emotion categories is further caused by the constructed cross entropy loss for keeping the distance between the categories;
s4: training a model, namely inputting an image set T' in the S1 into a depth network model of the S2, training the model by using an SGD optimizer, and learning model parameters by calculating loss through a loss function of the S3;
s5: and (3) obtaining emotion types of the image data set P to be tested, and inputting the test images of the test set M in the S1 into the model trained in the S4 after the processing steps of fixed-size scaling and center cutting to obtain the corresponding emotion types.
Each of the steps described above may be modified intelligently by using appropriate algorithms and techniques. For example, more advanced image preprocessing techniques may be used, or more advanced deep learning models and optimization algorithms may be used. In addition, automated tools and frameworks can be used to simplify and accelerate the process, such as deep learning frameworks using TensorFlow or PyTorch, and the like.
The overall structure of the convolutional neural network emotion image classification method based on emotion class attention loss is shown in fig. 2. The embodiment of the invention simulates in win10 and python environments, and uses the FI data set to train by the method of the invention, so as to obtain the image emotion classification model which can realize high accuracy. After the model is obtained, the test image can be input into the model to obtain the emotion classification result of the image.
The FI image emotion data set P, the training set T and the test set M in the S1 all comprise eight emotions, wherein the eight emotions are pleasure, photophobia, satisfaction, excitation, anger, aversion, fear and sadness respectively.
The underlying network in S2 is derived from a convolutional layer group migration of pre-trained resnet-50 on a large-scale dataset ImageNet; after inputting the image, obtaining four feature maps c2, c3, c4 and c5 with different scales, wherein the sizes of the image scales are 112×112, 56×56, 28×28 and 14×14 respectively;
the implementation steps of the emotion gain network design in the S2 are as follows:
s2.1.1: cutting each picture of the preprocessed image set T' to obtain 64 56X 56 cut blocks, stacking the cut blocks, and performing feature recalibration to obtain emotion weight feature graphs F of different areas of each picture of the image set T loc ;
S2.1.2: the three feature graphs c3, c4 and c5 with different scales of the resnet-50 are subjected to feature recalibration, and the obtained results are fused to obtain an integral emotion weight feature graph F glo ;
S2.1.3: f, carrying out emotion weight feature graphs F on different areas obtained in S2.1 loc And S2.2, different-scale emotion weight characteristic map F obtained in step glo And (3) fusing to finally obtain an emotion gain feature map F, wherein the formula is as follows:
(1);
The steps S2.1.1, S2.1.2 and S2.1.3 are the design steps of the emotion gain network.
S2.1.1, feature recalibration is performed using a CA attention mechanism, specifically as follows:
s2.1.1.1: in order to be able to capture the remote spatial interactions with accurate location information, inputs of size c×h×w are pooled according to W direction and H direction, respectively, to generate feature maps of sizes c×h×1 and c×1×w, respectively, with the formula:
(2);
(3);
wherein x represents an input of size c×h×w, C represents the number of channels, H represents high, W represents wide, x c (h, j) represents one pixel of the image with the channel number C and the direction W, x c (j, w) represents one pixel of the image with the channel number C and the direction H;representing the average pooling of the input x in the W direction; />Representing the average pooling of the input x in the H direction; />Representing that the input x in the W direction is summed and divided by the input number W, +.>Representing that the input x in the H direction is added and divided by the input quantity H;
s2.1.1.2: will beAnd->Performing splicing operation, then performing dimension reduction operation and activation operation by using a 1 multiplied by 1 convolution kernel, and generating a feature map f, wherein the formula is as follows:
(4);
wherein σ represents a sigmoid activation function, F 1 Representing a 1 x 1 convolution; ,/>The pooling result of the average pooling in the W direction and the H direction obtained in the S2.1.1 step is represented by a piece of information;
s2.1.1.3: performing split operation on f along the space dimension to obtain f h And f w Then, the dimension-lifting operation is carried out by respectively utilizing 1X 1 convolution, and the final H-direction attention vector g is obtained by combining a sigmoid activation function h And a W-direction attention vector g w The formula is:
(5);
(6);
wherein sigma represents a sigmoid activation function,representing the result of the split and convolution operations performed in the H direction, < >>Representing the result of performing split and convolution operations in the W direction; f (F) h And F w Indicating the same convolution operation, subscripts H and W indicate in the directions of H and W;
s2.1.1.4: after the attention vectors in S2.1.3 are overlapped, emotion weight feature graphs F of different areas are obtained loc The formula is:
(7);
wherein,input representing position at (i, j, -j>Attention vector representing the position at i in the h-direction,/>Represents the attention vector at j at a position in the w-direction,/v>Representing the sum of all the positional pixels in the image.
As shown in fig. 3, the multi-scale semantic fusion network design step in S2 is as follows:
s2.2.1, using four feature maps c2, c3, c4, c5 with different scales of the resnet-50 as inputs of a pyramid network, performing side-to-side connection from top to bottom and from bottom to top, specifically comprising the following steps:
S2.2.1.1: c5 the number of channels was adjusted to 256 by a 1×1 convolution, and the result was named p5;
s2.2.1.2: p5 is scaled to 28 x 28 by upsampling, after which c4 and p5 are fused to obtain p4;
s2.2.1.3: fusing c3 and p4 to obtain p3, and fusing c2 and p3 to obtain p2;
s2.2.1.4: p2 adjusts the number of channels to 256 by a 1×1 convolution, and the result is named n2;
s2.2.1.5: n2 is scaled to 56 x 56 by downsampling, after which p3 and n2 are fused to obtain n3;
s2.2.1.6: fusing p4 with n3 to obtain n4, and fusing p5 with n4 to obtain n5;
s2.2.2: the output of the pyramid network is firstly used for strengthening the connection between the feature map channels through an ECA attention mechanism, then the fusion duty ratio of the feature maps with different scales is adaptively adjusted through a self-adaptive space network ASFF, and finally the multi-scale semantic fusion feature map is obtained, and the specific steps are as follows:
s2.2.2.1: the n3, n4 and n5 acquired in the step S2.4 are firstly subjected to cross-channel interaction among offices through a global average pooling layer and then a fast one-dimensional convolution kernel with the size of k, attention weights are acquired through sigmoid function activation, and finally e3, e4 and e5 are acquired according to the specific formula as follows:
(8);
(9);
(10);
wherein,representing the result of the sum of all positions of the input x in the W and H directions divided by WH, g (x) representing the global average pooling in the W and H directions, σ representing the activation function sigmoid, C1D k A 1 x1 convolution with a convolution kernel k, w representing the attention weight, y (x) representing the feature enhancement map after multiplication of the input feature with the attention weight;
s2.2.2.2: respectively carrying out feature fusion on e3, e4 and e5 obtained in S2.2.2.1 by taking each as a reference, and finally obtaining the self-adaptive space fusion feature F 1 ,F 2 ,F 3 The method comprises the steps of carrying out a first treatment on the surface of the The e3, e4 and e5 are re-named as x1, x2 and x3, the x3 is required to be adjusted to be consistent with the x1 size, the x1 convolution is firstly adopted to be adjusted to be consistent with the x1 channel number, and then the size is scaled to be the same size in an interpolation mode; and carrying out the same operation on x2 to ensure that the channel numbers and the scale sizes of x1, x2 and x3 are the same, and finally carrying out feature fusion, wherein the fusion formula is as follows:
(11);
wherein, 1 represents x of a selection standard, and the values are 1,2 and 3;,/>,/>representing that x1, x2 and x3 are scaled and sampled to obtain a result with the same scale as xl, alpha L ,β L And gamma L Respectively indicate->,/>And->Weight parameter F of (2) L Representing different scale features and corresponding weights alpha L ,β L And gamma L The result of the multiplication;
wherein the weight parameter alpha L ,β L And gamma L Then the characteristic of x 1-x 3 after each scaling is obtained by convolution of 1X 1, and the parameter alpha L ,β L And gamma L After splicing they are all in the range of [0,1 ] by softmax activation function]And the sum is 1;
S2.2.2.3: spatially adaptive fusion of three scales obtained in S2.2.2.2 1 ,F 2 ,F 3 The splicing fusion is specifically described as F 3 With dimensions as reference, for F 1 And F 2 Upsampling respectively to F 1 ,F 2 ,F 3 The three features are spliced in a channel manner to obtain a final multi-scale semantic fusion feature map; the formula is as follows:
(12);
wherein F is l Representing different scale features and corresponding weights alpha L ,β L And gamma L The result of the multiplication is an adaptive spatial fusion feature F L ,Representing three adaptive spatial fusion features F 1 ,F 2 ,F 3 The splicing result of (a) is the multi-scale semantic fusion feature map F mul ;
The S2.2.1.1 to S2.2.2.3 are the design steps of the multi-scale semantic fusion network.
The multi-scale semantic fusion feature map fusion method under the emotion gain in S2 is designed as follows:
fusion of the emotion gain feature map and the multi-scale semantic fusion feature map is realized by a fusion method of channel splicing;
s2, final output, wherein the formula is as follows:
(13);
wherein F is an emotion gain characteristic diagram, F mul F for multi-scale semantic fusion of feature graphs out And (5) fusing the feature images for multi-scale semantics under the emotion gain.
The cross entropy loss function formula constructed in S3 is as follows:
(14);
wherein, l (y i =j) =1 indicates when y i Correctly classified into j classes, otherwise l (y i =j) =0, N represents the number of pictures in a particular dataset, C represents the emotion category involved,representing the probability that the i-th image is judged to be j, L fwcls Is a cross entropy loss.
An image emotion classification system for multi-scale semantic fusion under emotion gain, applied to the image emotion classification method for multi-scale semantic fusion under emotion gain, comprising: the system comprises an image preprocessing module, a depth network model building module, a loss function design module, a model training module and an emotion type obtaining module;
the image preprocessing module is used for randomly scaling and cutting the image of the emotion data set to finally obtain an image with the size of 224 multiplied by 3;
the deep network model building module is used for selecting a resnet-50 as a basic network, building an emotion gain network and obtaining emotion weights of different areas through a dicing operation and feature recalibration; constructing a multi-scale semantic fusion network, and acquiring deep multi-scale semantic fusion features by combining a pyramid network and an ASEF network;
the loss function design module is used for keeping the distance between classes, and the constructed cross entropy loss is used for enabling the image distance between different emotion classes to be farther;
the model training module is used for preprocessing the divided images in a manner of scaling random overturn and the like, inputting the preprocessed images into a network model, optimizing the preprocessed images by using a random gradient descent method, and learning model parameters by calculating loss through a loss function;
The emotion type obtaining module is used for inputting the images in the data set into the trained model after the preprocessing steps of fixed-size scaling and center cutting, and obtaining the corresponding emotion type.
The entire experiment was set up as follows:
in the embodiment of the invention, the parameter of random clipping is set to 448×448, and the probability of random rollover is set to 0.5. Training set of data sets, ratio of test set is 8:2.
the optimizer in the embodiment of the invention selects a random gradient descent method as an optimization method. The parameters of the optimizer set the learning rate lr=0.0005, the learning rate was reduced 10 times at 30 th epoch, then 10 times every 20 epochs, and the total training round was 100 epochs. In the embodiment of the invention, the cutting parameter is set to 448 multiplied by 448, the model output is compared with the label of the sample, the proportion of the correct sample, namely the accuracy of the test set, is counted, if the accuracy of the test set of the current number of rounds is higher than the accuracy of the previous highest test set, the accuracy of the test set with the highest current accuracy is saved, and the model trained by the current number of rounds is saved. After all rounds of training are finished, the model with the highest test set accuracy rate which is finally stored is the trained optimal model.
S5, in the experiment, the emotion type of the image to be detected is obtained:
the test set data or any image in the FI data set can be input into the model piece by piece or in batches in a fixed quantity after the preprocessing step of cutting the test set data or any image according to the fixed-size expansion center as the image in the S1. And comparing the output result after the classification layer with the label of the sample through model processing, and counting the proportion of the correct sample, namely the accuracy of the test set. And outputting the emotion type corresponding to the result to be the image emotion type judged by the model. The final accuracy result was 0.6843.
In order to prove the inventive and technical value of the technical solution of the present invention, this section is an application example on specific products or related technologies of the claim technical solution.
(1 emotion album)
Mood photo album is feasible. Firstly, people can have good emotion or bad emotion, and many times can be in the emotion and cannot realize the fact, so that some external force is needed to help a user; secondly, people have the habit of taking pictures randomly, interesting images are saved in social networking, most of the images are related to self emotion, and it is feasible to deduce the emotion of the user through the images.
Specifically, the emotion photo album can identify emotion of the image stored in the photo album by the user, then classify the image, calculate how the emotion of the day is at the end of the day by using an algorithm, and also select long-term tracking, how the emotion of the day is, how the emotion of the month is, even how the emotion of the year is, and analyze the emotion of the user by long-term tracking; in addition, the user can actively mark the emotion of some images, and the algorithm can more accurately recognize the emotion by means of the user mark. Above, the emotion album mainly carries out emotion recognition on the photo shot by the user and the stored image to deduce the emotion of the user so as to achieve the function of helping the user to adjust the emotion.
(2 student mental health auxiliary System
In a general teaching mode, a teacher and students often are one-to-many modes, resulting in the teacher not having an effort to look at all students. Many tragedy is often caused by this brief negligence, a structural disadvantage that requires external force to assist the teacher in completing his care of the students.
The student mental health auxiliary system can track the performance of students in classrooms for a long time, each student can have independent records, the mental condition of the student is deduced through emotion recognition of images such as facial expressions, body postures and the like of the student, once abnormality is found, the abnormal state is immediately marked and sent to a teacher, and then the teacher is used for pacifying the abnormal students.
(3 teenager art auxiliary learning system
The interactive platform for artistic cultivation of teenagers is different from the traditional interactive platform for comprehensively scoring pictures according to the color, the modeling, the lines and other elements of the pictures, and the art stars pay more attention to the fact that the teenagers learn the meaning, meaning and emotion of the pictures.
The user aims at teenagers, works can be uploaded on the platform after registration, and the platform can analyze the emotion of the works by utilizing an algorithm so as to help the teenagers further understand the spirit and meaning of the works. Different users can communicate with each other, and the platform can analyze the well-known actions to help learning.
The application embodiment of the invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the image emotion classification method of multi-scale semantic fusion under emotion gain.
The embodiment of the application of the invention provides an information data processing terminal which is used for realizing an image emotion classification system for multi-scale semantic fusion under emotion gain.
The embodiment of the invention has a great advantage in the research and development or use process, and has the following description in combination with data, charts and the like of the test process.
For the situation that the non-obvious features in the images are ignored in the current image emotion classification and the problem of information loss caused by multi-scale feature fusion are solved, on one hand, the method carries out feature recalibration on the processed images and the feature images with different scales through a coordinated attention mechanism, fuses the obtained attention feature images, and excavates the emotion features in the images from the whole and part, thereby effectively improving the robustness of the model; on the other hand, the PAFPN pyramid network is used for fusing image features with different scales, the ECA network is used for enhancing the connection between channels, and the ASFF network is adaptive to the features with different scales to solve the problem of scale inconsistency in the fusion process, so that information loss is greatly reduced.
Comparative experiments provided by the embodiment of the invention: compared with the prior art which uses more traditional methods and deep learning methods, the invention has the following experimental results shown in table 1. Compared with the effect on low-layer characteristics and the combination thereof in the traditional method, the method has higher accuracy; compared with the models such as AlexNet and VGG which are commonly used in depth models, the invention has obvious improvement, and compared with other depth models in recent years, the invention has good performance.
Table 1 comparative test
Table 2 ablation test
From the ablation experiments of table 2, it can be seen that both modules MPCA and MASEF are improved in accuracy compared to the underlying network resnet-50, and the best results are obtained on the final model, verifying the validity of the model.
Furthermore, fig. 6 is a schematic diagram of confusion matrix of different modules of the present invention, and compared with the most basic resnet-50, the present invention has a very significant improvement in recognition accuracy of two emotions, namely anger and fear, on the one hand, because the multi-scale semantic fusion module obtains image information more comprehensively, and on the other hand, the emotion gain module digs for non-significant features. In reality, anger and fear appear less in normal life, and most people feel more fuzzy to the two emotions, and the two emotions are mostly not provided with a main body on an image, so that the foreground and the background are confused, and the non-obvious feature mining in the emotion gain module can just acquire emotion information in the aspect.
Visualization experiment:
as shown in fig. 7, in order to further verify the effectiveness of the invention, a visualization experiment was performed on the model of the invention. As can be seen from the result graph, the emotion features extracted from the basic network cannot accurately represent the emotion of the image, but the output feature graph after passing through the multi-scale semantic network can more accurately notice the slideway of the roller coaster in the image; in addition, in order to acquire the emotion characteristics of the non-salient region in the image, the emotion gain network is more careful about the sky in the image, the house in the foreground and other background regions. And finally, superposing the outputs of the two networks to obtain a final emotion feature map.
Based on the image emotion classification method of multi-scale semantic fusion under emotion gain, the following two embodiments and the implementation scheme thereof are constructed:
example 1: emotion classification in movie episodes
Background: to help movie producers understand the emotional response of viewers to a movie, we can use this method to classify the emotions of the movie episodes.
1. Image preprocessing: thousands of pictures are randomly selected from the movie theatre, and the size of each picture is 448 multiplied by 3 through random scaling and cutting.
2. Depth network model:
the published pre-trained resnet-50 was used as the base model.
According to the characteristics of the movie theatre, emotion gain processing is carried out aiming at darkness/brightness, color and the like, and emotion information in the picture is enhanced.
The pyramid network and the ASEF network are combined to capture emotion features at different scales in the movie theatre.
3. And (3) loss function design: the cross entropy loss is designed, so that obvious inter-class distances among movie dramas with different emotions are ensured.
4. Training a model: and inputting the preprocessed movie theatre into a model, and optimizing by using random gradient descent.
5. Emotion classification: and carrying out emotion classification on the new movie theatre, and providing references of emotion response for producers.
Example 2: social media image emotion analysis
Background: images on social media contain a large amount of affective information. This approach may be applied in order to help brands or businesses understand the emotional feedback of their products or services in consumers.
1. Image preprocessing: the relevant user published images are downloaded from the social media platform, randomly scaled to 448 x 3 sizes.
2. Depth network model:
the published pre-trained resnet-50 was used as the base model.
And carrying out emotion gain processing on the social media image, and mining emotion information in the picture.
The pyramid network and the ASEF network are utilized to capture multi-scale emotion features in different social scenes.
3. And (3) loss function design: the cross entropy loss is designed to ensure that there is a significant inter-class distance between social media images of different emotions.
4. Training a model: and inputting the preprocessed social media image into a model, and optimizing by using random gradient descent.
5. Emotion classification: and carrying out emotion classification on the new social media image, and providing feedback of consumer emotion for brands or enterprises.
Through the two embodiments, the practical application value of the emotion classification method in different application scenes can be seen.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.
Claims (8)
1. The image emotion classification method for multi-scale semantic fusion under emotion gain is characterized by comprising the following steps of:
s1: image preprocessing, namely dividing an FI image emotion data set P into a training set T and a test set M, randomly scaling and cutting images of the training set T, and finally obtaining an image set T' with the size of 448 multiplied by 3;
s2: establishing a depth network model, selecting a resnet-50 as a basic network, taking different scale outputs of the resnet-50 as inputs of a subsequent network, and then constructing an emotion gain network and a multi-scale semantic fusion network in parallel; constructing the emotion gain network, and acquiring emotion gain characteristic diagrams of different areas and different scales through a dicing operation and characteristic recalibration; constructing the multi-scale semantic fusion network, and acquiring a deep multi-scale semantic fusion feature map by combining a pyramid network and an ASEF network; the ASEF network is composed of an ECA network and an ASFF network, and aims to strengthen the connection among channels before different scales are fused, so as to solve the problem of scale inconsistency in the fusion process; then, the two feature images are subjected to channel splicing, and finally, a multi-scale semantic fusion feature image under emotion gain is obtained;
S3: the loss function is designed, so that the image distance between different emotion categories is further caused by the constructed cross entropy loss for keeping the distance between the categories;
s4: training a model, namely inputting an image set T' in the S1 into a depth network model of the S2, training the model by using an SGD optimizer, and learning model parameters by calculating loss through a loss function of the S3;
s5: and (3) obtaining emotion types of the image data set P to be tested, and inputting the test images of the test set M in the S1 into the model trained in the S4 after the processing steps of fixed-size scaling and center cutting to obtain the corresponding emotion types.
2. The image emotion classification method based on multi-scale semantic fusion under emotion gain according to claim 1, wherein the FI image emotion data set P, the training set T and the test set M in S1 all comprise eight emotions, and the eight emotions are pleasurable, daunting, satisfying, exciting, anger, aversion, fear and sadness respectively.
3. The image emotion classification method of multi-scale semantic fusion under emotion gain of claim 1, wherein the underlying network in S2 is obtained by convolutional layer group migration of pre-trained resnet-50 on a large-scale dataset ImageNet; after inputting the image, obtaining four feature maps c2, c3, c4 and c5 with different scales, wherein the sizes of the image scales are 112×112, 56×56, 28×28 and 14×14 respectively;
The implementation steps of the emotion gain network design in the S2 are as follows:
s2.1.1: cutting each picture of the preprocessed image set T' to obtain 64 56X 56 cut blocks, stacking the cut blocks, and performing feature recalibration to obtain emotion weight feature graphs F of different areas of each picture of the image set T loc ;
S2.1.2: the three feature graphs c3, c4 and c5 with different scales of the resnet-50 are subjected to feature recalibration, and the obtained results are fused to obtain an integral emotion weight feature graph F glo ;
S2.1.3: different region emotion weight feature map F obtained in S2.1.1 loc Different scale emotion weight feature map F obtained from S2.1.2 glo And (3) fusing to finally obtain an emotion gain feature map F, wherein the formula is as follows:
(1)。
4. the image emotion classification method of multi-scale semantic fusion under emotion gain of claim 3, wherein in S2.1.1, feature recalibration is performed by using a CA attention mechanism, and the specific steps are as follows:
s2.1.1.1: in order to be able to capture the remote spatial interactions with accurate location information, inputs of size c×h×w are pooled according to W direction and H direction, respectively, to generate feature maps of sizes c×h×1 and c×1×w, respectively, with the formula:
(2);
(3);
Wherein x represents an input of size c×h×w, C represents the number of channels, H represents high, W represents wide, x c (h, j) represents one pixel of the image with the channel number C and the direction W, x c (j, w) represents one pixel of the image with the channel number C and the direction H;representing the average pooling of the input x in the W direction; />Representing the average pooling of the input x in the H direction; />Representing that the input x in the W direction is summed and divided by the input number W, +.>Representing the input x in the H direction added and divided by the inputNumber H of entries;
s2.1.1.2: will beAnd->Performing splicing operation, then performing dimension reduction operation and activation operation by using a 1 multiplied by 1 convolution kernel, and generating a feature map f, wherein the formula is as follows:
(4);
wherein σ represents a sigmoid activation function, F 1 Representing a 1 x 1 convolution;,/>the pooling result of the average pooling in the W direction and the H direction obtained in the S2.1.1 step is represented by a piece of information;
s2.1.1.3: performing split operation on f along the space dimension to obtain f h And f w Then, the dimension-lifting operation is carried out by respectively utilizing 1X 1 convolution, and the final H-direction attention vector g is obtained by combining a sigmoid activation function h And a W-direction attention vector g w The formula is:
(5);
(6);
wherein sigmaRepresenting the sigmoid activation function, Indicating the result of the split and convolution operations performed in the H direction,representing the result of performing split and convolution operations in the W direction; f (F) h And F w Indicating the same convolution operation, subscripts H and W indicate in the directions of H and W;
s2.1.1.4: after the attention vectors in S2.1.1.3 are overlapped, emotion weight feature graphs F of different areas are obtained loc The formula is:
(7);
wherein,input representing position at (i, j, -j>Representing the attention vector at i at a position in the h direction,represents the attention vector at j at a position in the w-direction,/v>Representing the sum of all the positional pixels in the image.
5. The image emotion classification method of multi-scale semantic fusion under emotion gain according to claim 1, wherein the multi-scale semantic fusion network design step in S2 is as follows:
s2.2.1, using four feature maps c2, c3, c4, c5 with different scales of the resnet-50 as inputs of a pyramid network, performing side-to-side connection from top to bottom and from bottom to top, specifically comprising the following steps:
s2.2.1.1: c5 the number of channels was adjusted to 256 by a 1×1 convolution, and the result was named p5;
s2.2.1.2: p5 is scaled to 28 x 28 by upsampling, after which c4 and p5 are fused to obtain p4;
s2.2.1.3: fusing c3 and p4 to obtain p3, and fusing c2 and p3 to obtain p2;
S2.2.1.4: p2 adjusts the number of channels to 256 by a 1×1 convolution, and the result is named n2;
s2.2.1.5: n2 is scaled to 56 x 56 by downsampling, after which p3 and n2 are fused to obtain n3;
s2.2.1.6: fusing p4 with n3 to obtain n4, and fusing p5 with n4 to obtain n5;
s2.2.2: the output of the pyramid network is firstly used for strengthening the connection between the feature map channels through an ECA attention mechanism, then the fusion duty ratio of the feature maps with different scales is adaptively adjusted through a self-adaptive space network ASFF, and finally the multi-scale semantic fusion feature map is obtained, and the specific steps are as follows:
s2.2.2.1: the n3, n4 and n5 acquired in S2.2.1 are subjected to inter-office cross-channel interaction through a global average pooling layer and then a fast one-dimensional convolution kernel with the size of k, attention weights are acquired through sigmoid function activation, and finally e3, e4 and e5 are acquired according to the specific formula as follows:
(8);
(9);
(10);
wherein,representing the result of the sum of all positions of the input x in the W and H directions divided by WH, g (x) representing the global average pooling in the W and H directions, σ representing the activation function sigmoid, C1D k A 1 x 1 convolution with a convolution kernel k, w representing the attention weight, y (x) representing the feature enhancement map after multiplication of the input feature with the attention weight;
S2.2.2.2: respectively carrying out feature fusion on e3, e4 and e5 obtained in S2.2.2.1 by taking each as a reference, and finally obtaining the self-adaptive space fusion feature F 1 ,F 2 ,F 3 The method comprises the steps of carrying out a first treatment on the surface of the The e3, e4 and e5 are re-named as x1, x2 and x3, the x3 is required to be adjusted to be consistent with the x1 size, the x1 convolution is firstly adopted to be adjusted to be consistent with the x1 channel number, and then the size is scaled to be the same size in an interpolation mode; and carrying out the same operation on x2 to ensure that the channel numbers and the scale sizes of x1, x2 and x3 are the same, and finally carrying out feature fusion, wherein the fusion formula is as follows:
(11);
wherein, l represents x of the selection standard, and the values are 1,2 and 3;,/>,/>representing that x1, x2 and x3 are scaled and sampled to obtain a result with the same scale as xl, alpha L ,β L And gamma L Respectively indicate->,/>And->Weight parameter F of (2) L Representing different scale features and corresponding weights alpha L ,β L And gamma L The result of the multiplication;
wherein the weight parameter alpha L ,β L And gamma L Then the characteristic of x 1-x 3 after each scaling is obtained by convolution of 1X 1, and the parameter alpha L ,β L And gamma L After splicing they are all in the range of [0,1 ] by softmax activation function]And the sum is 1;
s2.2.2.3: spatially adaptive fusion of three scales obtained in S2.2.2.2 1 ,F 2 ,F 3 The splicing fusion is specifically described as F 3 With dimensions as reference, for F 1 And F 2 Upsampling respectively to F 1 ,F 2 ,F 3 The three features are spliced in a channel manner to obtain a final multi-scale semantic fusion feature map; the formula is as follows:
(12);
wherein F is l Representing different scale features and corresponding weights alpha L ,β L And gamma L The result of the multiplication is an adaptive spatial fusion feature F L ,Representing three adaptive spatial fusion features F 1 ,F 2 ,F 3 The splicing result of (a) is the multi-scale semantic fusion feature map F mul 。
6. The image emotion classification method for multi-scale semantic fusion under emotion gain according to claim 1, wherein the multi-scale semantic fusion feature map fusion method under emotion gain in S2 is designed as follows:
fusion of the emotion gain feature map and the multi-scale semantic fusion feature map is realized by a fusion method of channel splicing;
s2, final output, wherein the formula is as follows:
(13);
wherein F is an emotion gain characteristic diagram, F mul F for multi-scale semantic fusion of feature graphs out And (5) fusing the feature images for multi-scale semantics under the emotion gain.
7. The image emotion classification method of multi-scale semantic fusion under emotion gain according to claim 1, wherein a cross entropy loss function formula constructed in S3 is as follows:
(14);
wherein, l (y i =j) =1 indicates when y i Correctly classified into j classes, otherwise l (y i =j) =0, N represents the number of pictures in a particular dataset, C represents the emotion category involved,representing the probability that the i-th image is judged to be j, L fwcls Is a cross entropy loss.
8. An image emotion classification system for multi-scale semantic fusion under emotion gain, which is characterized by comprising: comprising computer means for performing the steps of the method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410071984.6A CN117593593B (en) | 2024-01-18 | 2024-01-18 | Image emotion classification method for multi-scale semantic fusion under emotion gain |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410071984.6A CN117593593B (en) | 2024-01-18 | 2024-01-18 | Image emotion classification method for multi-scale semantic fusion under emotion gain |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117593593A CN117593593A (en) | 2024-02-23 |
CN117593593B true CN117593593B (en) | 2024-04-09 |
Family
ID=89918694
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410071984.6A Active CN117593593B (en) | 2024-01-18 | 2024-01-18 | Image emotion classification method for multi-scale semantic fusion under emotion gain |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117593593B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112163602A (en) * | 2020-09-14 | 2021-01-01 | 湖北工业大学 | Target detection method based on deep neural network |
CN114170411A (en) * | 2021-12-06 | 2022-03-11 | 国能大渡河大岗山发电有限公司 | Picture emotion recognition method integrating multi-scale information |
CN114170657A (en) * | 2021-11-30 | 2022-03-11 | 西安理工大学 | Facial emotion recognition method integrating attention mechanism and high-order feature representation |
CN114625908A (en) * | 2022-03-24 | 2022-06-14 | 电子科技大学成都学院 | Text expression package emotion analysis method and system based on multi-channel attention mechanism |
CN114722202A (en) * | 2022-04-08 | 2022-07-08 | 湖北工业大学 | Multi-modal emotion classification method and system based on bidirectional double-layer attention LSTM network |
CN115410254A (en) * | 2022-08-26 | 2022-11-29 | 大连民族大学 | Multi-feature expression recognition method based on deep learning |
CN115966010A (en) * | 2023-02-07 | 2023-04-14 | 南京邮电大学 | Expression recognition method based on attention and multi-scale feature fusion |
CN115995029A (en) * | 2022-12-27 | 2023-04-21 | 杭州电子科技大学 | Image emotion analysis method based on bidirectional connection |
DE202023102803U1 (en) * | 2023-05-22 | 2023-07-17 | Pradeep Bedi | System for emotion detection and mood analysis through machine learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113017630B (en) * | 2021-03-02 | 2022-06-24 | 贵阳像树岭科技有限公司 | Visual perception emotion recognition method |
-
2024
- 2024-01-18 CN CN202410071984.6A patent/CN117593593B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112163602A (en) * | 2020-09-14 | 2021-01-01 | 湖北工业大学 | Target detection method based on deep neural network |
CN114170657A (en) * | 2021-11-30 | 2022-03-11 | 西安理工大学 | Facial emotion recognition method integrating attention mechanism and high-order feature representation |
CN114170411A (en) * | 2021-12-06 | 2022-03-11 | 国能大渡河大岗山发电有限公司 | Picture emotion recognition method integrating multi-scale information |
CN114625908A (en) * | 2022-03-24 | 2022-06-14 | 电子科技大学成都学院 | Text expression package emotion analysis method and system based on multi-channel attention mechanism |
CN114722202A (en) * | 2022-04-08 | 2022-07-08 | 湖北工业大学 | Multi-modal emotion classification method and system based on bidirectional double-layer attention LSTM network |
CN115410254A (en) * | 2022-08-26 | 2022-11-29 | 大连民族大学 | Multi-feature expression recognition method based on deep learning |
CN115995029A (en) * | 2022-12-27 | 2023-04-21 | 杭州电子科技大学 | Image emotion analysis method based on bidirectional connection |
CN115966010A (en) * | 2023-02-07 | 2023-04-14 | 南京邮电大学 | Expression recognition method based on attention and multi-scale feature fusion |
DE202023102803U1 (en) * | 2023-05-22 | 2023-07-17 | Pradeep Bedi | System for emotion detection and mood analysis through machine learning |
Non-Patent Citations (2)
Title |
---|
《Multi-scale features enhanced sentiment region discovery for visual sentiment analysis》;Haiwei Wu et al.;《ICGIP2021》;20220216;全文 * |
《基于深度学习的视频情感识别》;郝培钧;《中国优秀硕士学位论文全文数据库信息科技辑》;20220615;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117593593A (en) | 2024-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110163299B (en) | Visual question-answering method based on bottom-up attention mechanism and memory network | |
CN108229478A (en) | Image, semantic segmentation and training method and device, electronic equipment, storage medium and program | |
CN108960036A (en) | 3 D human body attitude prediction method, apparatus, medium and equipment | |
CN110851760B (en) | Human-computer interaction system for integrating visual question answering in web3D environment | |
CN110728219A (en) | 3D face generation method based on multi-column multi-scale graph convolution neural network | |
Ludl et al. | Enhancing data-driven algorithms for human pose estimation and action recognition through simulation | |
CN113239916B (en) | Expression recognition and classroom state evaluation method, device and medium | |
CN111062329B (en) | Unsupervised pedestrian re-identification method based on augmented network | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN111767883A (en) | Title correction method and device | |
CN113486700A (en) | Facial expression analysis method based on attention mechanism in teaching scene | |
Ververas et al. | Slidergan: Synthesizing expressive face images by sliding 3d blendshape parameters | |
CN116229319A (en) | Multi-scale feature fusion class behavior detection method and system | |
CN111652864A (en) | Casting defect image generation method for generating countermeasure network based on conditional expression | |
Pérez-Benito et al. | Smoothing vs. sharpening of colour images: Together or separated | |
CN114898284B (en) | Crowd counting method based on feature pyramid local difference attention mechanism | |
CN105913377A (en) | Image splicing method for reserving image correlation information | |
CN115240259A (en) | Face detection method and face detection system based on YOLO deep network in classroom environment | |
Yang et al. | Student Classroom Behavior Detection Based on YOLOv7+ BRA and Multi-model Fusion | |
CN117593593B (en) | Image emotion classification method for multi-scale semantic fusion under emotion gain | |
CN113066074A (en) | Visual saliency prediction method based on binocular parallax offset fusion | |
Malavath et al. | Natya Shastra: Deep Learning for Automatic Classification of Hand Mudra in Indian Classical Dance Videos. | |
Sra et al. | Deepspace: Mood-based image texture generation for virtual reality from music | |
CN113743315A (en) | Handwritten elementary mathematical formula recognition method based on structure enhancement | |
CN114220135A (en) | Method, system, medium and device for recognizing attention and expression of human face in teaching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |