CN111985532A - Scene-level context-aware emotion recognition deep network method - Google Patents
Scene-level context-aware emotion recognition deep network method Download PDFInfo
- Publication number
- CN111985532A CN111985532A CN202010664287.3A CN202010664287A CN111985532A CN 111985532 A CN111985532 A CN 111985532A CN 202010664287 A CN202010664287 A CN 202010664287A CN 111985532 A CN111985532 A CN 111985532A
- Authority
- CN
- China
- Prior art keywords
- body part
- emotion
- context
- network
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an emotion recognition deep network method for scene level context perception, which reads a training sample set XinObtaining a body part image set X by the body mark value and the original emotion mark valueB(ii) a To XinAnd XBAfter normalization processing, the data are respectively sent to an upper layer convolutional neural network and a lower layer convolutional neural network to extract emotional characteristics TFAnd contextual affective feature TCWill TFAnd TCRespectively sending into the upper and lower adaptive layers to obtain the fusion weight lambdaFAnd λCWill TF、TC、λFAnd λCFusing to obtain emotion fusion characteristic TA,TAObtaining initial predicted values of arousal and value by linear mapping of a full connection layer, and measuring the distance between the two initial predicted values and the original emotion marking valueGradually converging the loss, and finishing training to obtain a network model; processing the test sample set and sending the processed test sample set into a network model to obtain a test sample set XtnThe tag value is predicted. The method provided by the invention considers the influence degree of the characteristics with different attributes on the human emotion when fusing the characteristics, and improves the prediction performance of the model on the basis of enriching the emotion recognition research work based on the image.
Description
Technical Field
The invention belongs to the technical field of pattern recognition, and particularly relates to a scene-level context-aware emotion recognition deep network method.
Background
Emotion is an essential form of expression in which a person expresses his or her own feelings. Understanding and recognizing their emotions from the actual scene in which a person is located in daily life helps to perceive their mental state and predict behavior, effectively interacting. In the past 90 s, the concept of emotion calculation was proposed by MIT media laboratories, and scientists have been working on converting human complex emotions into numerical information recognizable by computers to better realize human-computer interaction and make computers become intelligent, which has become one of the key problems to be solved in the era of artificial intelligence.
Traditionally, emotion recognition tasks for static images are mainly researched according to human face images. For the face image, a predefined feature extraction method is adopted for extracting emotional features, and the face image is sent to a classifier (regressor) for model training, so that emotion prediction is finally realized. However, emotional recognition based on face images is easily affected by natural environment and sample characteristic factors such as pose, illumination, and face difference.
According to psychological research, the emotion information conveyed by the visual communication is about 55% of the information conveyed by the face image. In daily emotion communication, the emotion of a person is judged, the emotion of the person's heart can be estimated through a series of rich context information such as the facial expression of a target person, the surrounding environment such as the movement of the person, the interaction with other people, the scene, and the like, and even under the extreme condition that the face cannot be detected, the emotion of a research object can still be estimated through a large amount of context information.
In recent years, a complex emotion recognition method based on a deep convolutional network attracts attention, and the network learns emotion characteristics by itself and analyzes the emotion characteristics instead of a traditional artificial definition mode. However, the current deep learning analysis method mainly performs emotion analysis on a face image, lacks comprehensive consideration on character expression under a complex situation of a natural scene, and never considers the influence of scene-level context information on character emotion recognition in the scene. Meanwhile, the fusion mode of different attribute characteristics is not studied enough, and the contribution degree of different attribute characteristics to the emotional state identification is ignored in the established model.
Disclosure of Invention
The invention aims to provide a scene-level context-aware emotion recognition depth network method, which solves the limitation problems that in the prior art, the emotion analysis range based on static images is narrow, only face images are targeted, and emotion recognition is carried out by directly splicing different attribute features.
The invention adopts the technical scheme that a scene level context perception emotion recognition deep network method specifically comprises the following steps:
step 1, collecting images and determining a training sample set XinAnd test sample set Xtn;
Step 2, reading a training sample set XinExtracting the body part of each sample according to the body mark value to obtain a body part image set XB;
Step 3, training sample set XinCarrying out normalization processing in the set to obtain a context emotion image set Xim(ii) a For body part image set XBCarrying out normalization processing in the set to obtain a normalized body part image set Xbody;
Step 4, collecting the normalized body part image set XbodyExtracting body part emotional feature T by convolutional neural network fed into upper layerFAnd collecting the context emotion image set XimExtracting scene level context emotional characteristic T by convolutional neural network sent to lower layerC;
Step 5, emotional characteristics T of the body partFContext and context sentiment feature T at scene levelCRespectively sending into the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and outputting the fusion weight lambda of the body part by the upper adaptive layerFThe adaptive layer output context fusion weight λ of the lower layerC;
Step 6, emotion characteristics T of body partFScene level contextual emotional characteristics TCFusing weight lambda with body partFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context informationAThen T is addedAObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring the loss between the initial predicted values of arousal and value and corresponding original emotion marking values by adopting a KL divergence loss function, reversely propagating through a network, iterating for multiple times, updating network weight, gradually reducing loss, enabling the algorithm to gradually converge, finishing training, and obtaining a network model;
step 7, extracting a test sample set X according to the step 2tnObtaining a test body part image set X of the body part of each test sampletBThen, according to step 3, respectively testing sample sets XtnAnd testing the body part image set XtBAfter normalization processing, sending the normalized result into the network model obtained in the step 6, and finally obtaining a test sample set XtnPredict tag values.
The present invention is also characterized in that,
training sample set X in step 2inThe specific steps for extracting the body part are as follows:
step 2.1, read training sample set XinBody labeling of each sample in (B)x1,By1,Bx2,By2) Wherein (B)x1,By1,),(Bx2,By2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)Wherein:
in the formula (1), BwWidth of body part image, BhRepresenting the width of the body part image;
step 2.2, according to the parameter set obtained in step 1.1For training sample set XinCutting each sample to obtain a body part image set XB。
Training sample set X in step 3inThe formula for the intra-set normalization process is as follows:
in the formula (2), XinFor training the sample set, XimIs a context emotion image set, sigma is a standard deviation image of a training sample set, xmeanIs a mean image of a training sample set;
x in formula (2)meanAnd σ is defined as follows:
in the formulae (3) and (4), xiRepresenting a training sample set XinN represents the total number of training samples, n is greater than or equal to 1.
Step 3, a body part image set X is processedBThe formula for the intra-set normalization process is as follows:
in the formula (5), XBFor a set of body part images, XbodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'meanMean image for a set of body part images;
X 'in formula (5)'meanAnd σ' is defined as follows:
in formulas (6) and (7), x'i'Image set X representing body partBN represents the total number of training samples, n ≧ 1.
In the step 4, the structural parameters of the upper convolutional neural network and the lower convolutional neural network are the same, and both the upper convolutional neural network and the lower convolutional neural network adopt a VGG16 architecture.
Step 4, body part emotional characteristic TFAnd contextual affective feature TCThe calculation process of (2) is as follows:
TF=F(Xbody,WF) (8)
TC=F(Xim,WC) (9)
in the formula (8), WFAll parameters of all convolutional and pooling layers of the convolutional neural network representing the upper layer, in equation (9), WCAll parameters of all convolution layers and pooling layers of the convolutional neural network at the lower layer are represented, and F represents the calculation operation of convolution and pooling in the feature extraction network.
Body part fusion weight lambda in step 5FAnd context fusion weight λCThe calculation process of (2) is as follows:
λF=F(TF,WD) (10)
λC=F(TC,WE) (11)
in the formula (10), WDFor the adaptive layer network parameters of the upper layer, in equation (11), WEIs an adaptive layer network parameter of the lower layer, andF+λC=1。
in step 5, the network structures of the upper adaptive layer and the lower adaptive layer are completely the same, and the specific network architecture parameters are as follows:
the upper adaptive layer and the lower adaptive layer respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer.
Step 6, body part emotional characteristic TFContext feature T at scene levelCFusing weight lambda with body partFContext fusion weight λCThe calculation formula for performing weighted fusion is as follows:
in the formula (12), TAThe n represents a connection operator for integrating the emotion characteristics of the body part after the weight is integrated and the context emotion characteristics of the scene level are spliced,representing a convolution operation between the different characteristic features and the fusion weights.
The invention has the beneficial effects that: the invention discloses a scene-level context-aware emotion recognition depth network method, and provides a two-stage context-aware emotion recognition network. By adopting a two-stage context emotion recognition network, the problem that the existing emotion recognition task for images mainly aims at the reality deficiency of human face image data is solved; on the other hand, the influence degree of the features with different attributes on the human emotion is fully considered in feature fusion, and the prediction performance of the model is improved on the basis of enriching emotion recognition research work based on images.
Drawings
FIG. 1 is an overall flow diagram of a scene level context aware emotion recognition deep network method of the present invention;
FIG. 2 is a diagram showing a complex emotion image and emotion dimension labeling information thereof;
FIG. 3 is a schematic diagram of a convolution operation;
FIG. 4 is a schematic view of the expansion of the receptive field by small convolution kernel stacking;
FIG. 5 is a schematic view of a pooling operation.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention discloses a scene level context-aware emotion recognition deep network method, which has the specific process shown in figure 1 and specifically comprises the following steps:
step 1, collecting images and determining a training sample set XinAnd test sample set Xtm;
Each training sample and each testing sample have corresponding original emotion marking values and body marking values.
For training sample set XinThe original emotion labeled n × 2-dimensional vector y ═ a [ ("a")1,v1),(a2,v2),...,(an,vn)]Wherein (a)1,v1) Respectively represent training sample sets XinThe arousal and value tags of sample 1 in (a), …n,vn) Respectively represent training sample sets XinThe arousal and value labels of the nth sample in (1), the body part label value is a vector of n x 4 dimensions Image set X representing body partBThe body label of the 1 st sample, …,image set X representing body partBThe body label of the nth sample in (1).
For test sample set XtnThe original emotion annotation value is m × 2-dimensional vector ty ═ ta1,tv1),(ta2,tv2),...,(tam,tvm)]Body part labeling with m × 4 dimensional vectorm represents the number of test samples.
Step 2, reading a training sample set XinExtracting the body part of each sample according to the body mark value to obtain a body part image set XB;
Wherein, for training sample set XinThe specific steps for extracting the body part are as follows:
step 2.1, read training sample set XinBody labeling of each sample in (B)x1,By1,Bx2,By2) Wherein (B)x1,By1,),(Bx2,By2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)Wherein:
in the formula (1), BwWidth of body part image, BhIndicating the width of the body part image.
Step 2.2, according to the parameter set obtained in step 2.1For training sample set XinCutting each training sample to obtain a body part image set XB。
Step 3, training sample set XinCarrying out normalization processing in the set to obtain a context emotion image set Xim(ii) a For body part image set XBCarrying out normalization processing in the set to obtain a normalized body part image set Xbody;
Wherein, for training sample set XinThe formula for the intra-set normalization process is as follows:
in the formula (2), XinFor training the sample set, XimIs a context emotion image set, sigma is a standard deviation image of a training sample set, xmeanIs a mean image of a training sample set;
x in formula (2)meanAnd σ is defined as follows:
in the formulae (3) and (4), xiRepresenting a training sample set XinN represents the total number of training samples, n is greater than or equal to 1.
For body part image set XBThe formula for the intra-set normalization process is as follows:
in the formula (5), XBFor a set of body part images, XbodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'meanA mean image of the body part image set;
x 'in formula (5)'meanAnd σ' is defined as follows:
in formulas (6) and (7), x'i'Image set X representing body partBN represents the total number of training samples, n ≧ 1.
Step 4, collecting the normalized body part image set XbodyExtracting body part emotional feature T by convolutional neural network fed into upper layerFAnd collecting the context emotion image set XimExtracting scene level context emotional characteristic T by convolutional neural network sent to lower layerC;
Step 4.1, initializing parameters of the whole network architecture, including all convolutional layers, pooling layers and full-connection layers in the network, initializing the weight of each layer into Gaussian distribution with the overall obedience mean value of 0 and standard deviation of 1, and uniformly initializing bias terms into 0.001;
step 4.2, collecting the body part image set XbodyFeeding into the upper convolutional neural network to collect the context emotion image set XimThe convolutional neural network sent to the lower layer has the same structure as the convolutional neural network model of the upper layer and the lower layer, and both adopt a VGG16 network architecture, and the VGG16 network architecture parameters are shown in the following table 1:
TABLE 1 Emotion feature extraction convolutional network architecture parameter Table
As can be seen from the network architecture parameter table 1, for 5 convolutional layers C1, C2, C3, C4, and C5 in the network structure, the number of corresponding feature maps is 64, 128, 256, 512, and 512, respectively, and each feature map is formed by an input image or an output X of a previous layermConvolution templates K each associated with a corresponding numberuvPerforming convolution operation, and adding bias term bvThe convolution process is shown in fig. 3, and the specific calculation formula of the feature map is as follows:
in the formula (13), the value of u is {1,2,3,4,5}, which represents the number of corresponding convolution layers, and the value of v is the number of convolution templates corresponding to each layer, which is 64, 128, 256, 512,representing convolution operation with step length of 1, the sizes of convolution kernels are all 3 × 3, the receptive field of the convolution layer is enlarged through stacking of small convolution kernels, meanwhile, the parameter quantity of the convolution layer can be effectively reduced, and a receptive field schematic diagram is shown in fig. 4.
For the pooling layers S1, S2, S3, and S4, the result obtained by the convolutional layer corresponding to the maximum sampling is used for sampling, the size of the pooled sampling region of the present invention is 2 × 2, the step size is 2, and the pooling process is shown in fig. 5, for example: 1 st feature map X of convolutional layer C1m2 x 2, the sampling result results in the first input O of the 1 st feature map of the pooling layer S11Where the sampling method is to take the maximum value in the 2 × 2 region, the other outputs are similar, and the horizontal and vertical spatial resolutions after sampling become 1/2 as they are.
Step 4.3, normalized body part image set XbodyContext emotion image set XimAfter the iteration and the calculation of the convolutional neural network at the upper layer and the convolutional neural network at the lower layer respectively, the emotional characteristic T of the body part can be obtainedFScene level contextual emotional characteristics TCThe calculation process can be represented by the following formula:
TF=F(Xbody,WF) (8)
TC=F(Xim,WC) (9)
in the formula (8), WFNetwork parameters related to the extraction of body part emotional features representing the upper layer, in formula (9), WCThe context information is expressed as network parameters related to the feature extraction of the scene level context information of the lower layer, and F represents the calculation operation of a convolution layer and a pooling layer in the feature extraction network;
step 5, emotional characteristics T of the body partFContext and context sentiment feature T at scene levelCRespectively sending into the upper adaptive layer and the lower adaptive layer for adaptive weight learning, and outputting the fusion weight lambda of the body part by the upper adaptive layerFThe adaptive layer output context fusion weight λ of the lower layerC;
For the adaptive layer network structure, the adaptive layer network structures of the upper layer and the lower layer are completely the same, the two networks respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer, and the overall structure parameters are shown in the following table 2:
table 2 adaptive converged network architecture parameter table
Finally, outputting the body part fusion weight lambda through a Softmax layerFAnd context fusion weight λCThe calculation process is as follows:
λF=F(TF,WD) (10)
λC=F(TC,WE) (11)
in the formula (10), WDFor the adaptive layer network parameters of the upper layer, in equation (11), WEAdding constraint to the fusion weight value through the last Softmax layer of the adaptive network layer for the network parameter of the lower adaptive layer, so that the lambda is ensuredF+λC=1。
Step 6, emotion of body partCharacteristic TFScene level contextual emotional characteristics TCFusing weight lambda with body partFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context informationAThen T is addedAObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, adopting a KL divergence loss function to measure loss between the initial predicted values of arousal and value and corresponding original emotion marking values, carrying out backward propagation through a network, carrying out multiple iterations, updating network weight, gradually reducing loss, enabling the algorithm to gradually converge, completing training, and obtaining a network model.
Step 6.1, carrying out emotional characteristic T on the body partFScene level contextual emotional characteristics TCFusing weights λ with body partsFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristics TAThe expression is as follows:
in the formula (12), pi represents a join operator, which means that the body part emotional characteristics and the scene level context emotional characteristics after the weight fusion are spliced,representing a convolution operation between the different characteristic features and the fusion weights;
step 6.2, fusing the characteristics TASending the data to a full-connection layer for processing, wherein the last full-connection layer is changed into a linear activation function because the predicted value is continuous, and the parameter structure table of the full-connection layer is as follows:
full connection layer parameter table
And 6.3, linearly mapping the final 256-dimensional emotion characteristics into 2-dimensional predicted label values arousal and value through a full connection layer Fc10, measuring the loss between the predicted label values and the original label values by adopting KL divergence loss as a loss function, reversely propagating the network, iterating for 80 times, updating the network weight, gradually reducing the loss, gradually converging the algorithm, and finishing training.
Wherein, the adopted loss function is a KL divergence function, and the specific definition formula is as follows:
in formula (14), p (y)i”) Representing the true distribution of the original affective tag y, q (ly)i”) The distribution of model predicted label values ly is represented, and n represents the total number of training samples.
The back propagation of the convolutional neural network employed by the present invention includes three cases:
(1) when the pooling layer is connected with the full-connection layer, the error can be reversely transmitted into a plurality of lower sampling layers by the full-connection layer, and the gradient of each pixel in the characteristic diagram needs to be obtained.
F' (u) as shown in formula (15)l j) The partial derivative of the activation function of the l layer is shown, j represents the number of the characteristic graphs of the current layer,l+1 jfor the gradient of the l +1 layer offset, firstly, the l +1 layer weight matrix Wl+1 jAfter rotating 180 degrees, the utility model is used for rotatingl+1 jThe surrounding neighborhood is 0-padded with a weight matrix rot180 (W)l+1 j) A convolution operation is performed, wherein &representsa dot product of two matrices. After the bias gradient of the corresponding element in the current layer characteristic diagram is obtained, the bias gradient and the weight gradient of the lower sampling layer are respectively shown as the following formulas:
dl j=downsample(xl-1 j) Is the down sampling result of the j-th feature map of the l-1 layer.
(2) When the convolutional layer is connected after the pooling layer, the solution of the bias and the weight gradient is the same as in the case (1).
(3) When the convolutional layer is a pooling layer, the characteristic graphs are in one-to-one correspondence. Similarly, the bias gradient of each pixel point in the characteristic diagram of the current layer is firstly solvedl j:
l j=wl+1 j(f′(ul j)×upsample(l+1 j)) (17)
In formula (17), update: (l+1 j) Representative pairl+1 jUpsampling, upsampling the jth result of the l +1 layer downsampling, restoring to the same size as the convolution feature map, conveniently f' (u)l j) The matrix is dot-multiplied, and the bias gradient and weight gradient of the convolutional layer are shown as formulas (18) and (19).
In the formulae (18), (19), wl jIs the jth characteristic diagram x of the ith layerl jCorresponding convolution kernel, pl jIs the jth characteristic diagram x of the l-1 layerl-1 jAnd a convolution kernel wl jAnd obtaining a corresponding result after convolution.
Step 7, testing sample set XtnExtracting a test sample set X according to the step 2tnObtaining a test body part image set X of the body part of each test sampletBThen, according to step 3, respectively testing sample sets XtnAnd testing the body part image set XtBAfter normalization processing, the network is sent to the network obtained in step 6Model, finally obtaining test sample set XtnPredict tag values.
The specific process of step 7 is as follows:
step 7.1, read test sample set XtnBody labeling (tB) of each sample inx1,tBy1,tBx2,tBy2) The position and size parameter set is calculated by the following formula
Step 7.2, according to the parameter set obtained in step 7.1For test sample set XtnCutting to obtain a test body part image set XtB。
Step 7.3, referring to step 3, testing sample set XtnAnd testing the body part image set XtBRespectively carrying out normalization processing in the sets to obtain corresponding test context emotion image sets XtmAnd a normalized test body part image set Xtbody;
Step 7.4, set X of normalized test body part imagestbodySending the upper layer structure of the network model obtained in the step 6, and testing the context and emotion image set XtmSending the lower layer structure of the network model obtained in the step 6, and obtaining a test sample set X through model predictiontnPredict tag values.
Examples
The experiment of the invention is carried out based on an EMOTIC database, an EMOTIC data set provides rich emotional images under complex scenes, and the images not only contain the object to be detected, but also contain scene level context information of a large number of environments and other factors; the data set has 23554 samples to be tested, which can be divided into 17077 training set samples, 2088 verification set samples and 4389 test set samples. The labeling information not only comprises discrete labeling and continuous dimension labeling, but also comprises body part labeling of an object to be detected in each image, so that scene-level context research can be conveniently developed, and part of complex emotion images and the labeling thereof are shown in the attached figure 2.
The experimental results are compared as follows:
1) influence of different feature fusion modes on emotion recognition
Because the attributes of the features extracted from different network structures are often different, the optimal performance discrimination cannot be provided if the features with the two different attributes, namely the body part emotional feature and the scene level context emotional feature, are directly spliced. Therefore, in order to verify the effectiveness of the adaptive fusion network, the same experimental setup is adopted, the characteristics output by the two layers of convolutional neural networks are compared by adopting a direct splicing fusion mode and an adaptive network fusion mode, and the experimental results are shown in the following table 3:
TABLE 3 influence of different feature fusion modes on emotion recognition
As can be seen from the data in the table, the self-adaptive fusion network designed by the invention is superior to the method for directly splicing the characteristics with two different attributes in terms of the fusion mode of the emotional characteristics. This verifies the effectiveness of the present invention to introduce an adaptive converged network into a contextual emotion recognition network structure.
Claims (9)
1. A scene level context-aware emotion recognition deep network method is characterized by specifically comprising the following steps:
step 1, collecting images and determining a training sample set XinAnd test sample set Xtn;
Step 2, reading a training sample set XinExtracting the body part of each sample according to the body mark value to obtain a body part image set XB;
Step 3, training sample set XinCarrying out normalization processing in the set to obtain a context emotion image set Xim(ii) a For body part image set XBCarrying out normalization processing in the set to obtain a normalized body part image set Xbody;
Step 4, collecting the normalized body part image set XbodyExtracting body part emotional feature T by convolutional neural network fed into upper layerFAnd collecting the context emotion image set XimExtracting scene level context emotional characteristic T by convolutional neural network sent to lower layerC;
Step 5, emotional characteristics T of the body partFContext and context sentiment feature T at scene levelCRespectively sending into the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and outputting the fusion weight lambda of the body part by the upper adaptive layerFThe adaptive layer output context fusion weight λ of the lower layerC;
Step 6, emotion characteristics T of body partFScene level contextual emotional characteristics TCFusing weight lambda with body partFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context informationAThen T is addedAObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring the loss between the initial predicted values of arousal and value and corresponding original emotion marking values by adopting a KL divergence loss function, and updating network weight through network back propagation and multiple iterations to gradually reduce the loss, so that the algorithm gradually converges, and the training is completed to obtain a network model;
step 7, extracting a test sample set X according to the step 2tnObtaining a test body part image set X of the body part of each test sampletBThen, according to step 3, respectively testing sample sets XtnAnd testing the body part image set XtBAfter normalization processing, sending the normalized result into the network model obtained in the step 6, and finally obtaining a test sample set XtnPredict tag values.
2. The method as claimed in claim 1, wherein the training sample set X in step 2 is a training sample setinThe specific steps for extracting the body part are as follows:
step 2.1, read training sample set XinBody labeling of each sample in (B)x1,By1,Bx2,By2) Wherein (B)x1,By1,),(Bx2,By2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)Wherein:
in the formula (1), BwWidth of body part image, BhA width representing a body part image;
3. The method as claimed in claim 1, wherein the training sample set X in step 3 is a training sample setinThe formula for the intra-set normalization process is as follows:
in the formula (2), XinFor training the sample set, XimIs a context emotion image set, sigma is a standard deviation image of a training sample set, xmeanIs a mean image of a training sample set;
x in formula (2)meanAnd σ is defined as follows:
in the formulae (3) and (4), xiRepresenting a training sample set XinN represents the total number of training samples, n ≧ 1.
4. The method as claimed in claim 1, wherein the step 3 is a body part image set XBThe formula for the intra-set normalization process is as follows:
in the formula (5), XBFor a set of body part images, XbodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'meanA mean image of the body part image set;
x 'in formula (5)'meanAnd σ' is defined as follows:
in formulas (6) and (7), x'i'Image set X representing body partBN represents the total number of training samples, n ≧ 1.
5. The method for emotion recognition depth network based on scene-level context awareness, according to claim 1, wherein in step 4, the convolutional neural network at the upper layer and the convolutional neural network at the lower layer have the same structural parameters, and both adopt VGG16 architecture.
6. The method as claimed in claim 1, wherein the body part emotion characteristics T in step 4 are determined by the context-aware emotion recognition depth network methodFAnd contextual affective feature TCThe calculation process of (2) is as follows:
TF=F(Xbody,WF) (8)
TC=F(Xim,WC) (9)
in the formula (8), WFAll parameters of all convolutional and pooling layers of the convolutional neural network representing the upper layer, in equation (9), WCAll parameters of all convolution layers and pooling layers of the convolutional neural network at the lower layer are represented, and F represents the calculation operation of convolution and pooling in the feature extraction network.
7. The method as claimed in claim 1, wherein the body region fusion weight λ of step 5 is a scene-level context-aware emotion recognition depth network methodFAnd context fusion weight λCThe calculation process of (2) is as follows:
λF=F(TF,WD) (10)
λC=F(TC,WE) (11)
in the formula (10), WDFor the adaptive layer network parameters of the upper layer, in equation (11), WEIs an adaptive layer network parameter of the lower layer, andF+λC=1。
8. the method as claimed in claim 1, wherein in step 5, the network structures of the upper adaptive layer and the lower adaptive layer are completely the same, and the specific network architecture parameters are as follows:
the upper adaptive layer and the lower adaptive layer respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer.
9. The method as claimed in claim 1, wherein the body part emotion characteristics T in step 6 are obtained by the context-aware emotion recognition deep network methodFContext feature T at scene levelCFusing weight lambda with body partFContext fusion weight λCThe calculation formula for performing weighted fusion is as follows:
in the formula (12), TAThe n represents a connection operator for integrating the emotion characteristics of the body part after the weight is integrated and the context emotion characteristics of the scene level are spliced,representing a convolution operation between different characteristic features and fusion weights.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664287.3A CN111985532B (en) | 2020-07-10 | 2020-07-10 | Scene-level context-aware emotion recognition deep network method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664287.3A CN111985532B (en) | 2020-07-10 | 2020-07-10 | Scene-level context-aware emotion recognition deep network method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111985532A true CN111985532A (en) | 2020-11-24 |
CN111985532B CN111985532B (en) | 2021-11-09 |
Family
ID=73439067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010664287.3A Active CN111985532B (en) | 2020-07-10 | 2020-07-10 | Scene-level context-aware emotion recognition deep network method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111985532B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112733756A (en) * | 2021-01-15 | 2021-04-30 | 成都大学 | Remote sensing image semantic segmentation method based on W divergence countermeasure network |
CN113011504A (en) * | 2021-03-23 | 2021-06-22 | 华南理工大学 | Virtual reality scene emotion recognition method based on visual angle weight and feature fusion |
CN113076905A (en) * | 2021-04-16 | 2021-07-06 | 华南理工大学 | Emotion recognition method based on context interaction relationship |
CN114764906A (en) * | 2021-01-13 | 2022-07-19 | 长沙中车智驭新能源科技有限公司 | Multi-sensor post-fusion method for automatic driving, electronic equipment and vehicle |
CN117636426A (en) * | 2023-11-20 | 2024-03-01 | 北京理工大学珠海学院 | Attention mechanism-based facial and scene emotion recognition method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512680A (en) * | 2015-12-02 | 2016-04-20 | 北京航空航天大学 | Multi-view SAR image target recognition method based on depth neural network |
CN108830296A (en) * | 2018-05-18 | 2018-11-16 | 河海大学 | A kind of improved high score Remote Image Classification based on deep learning |
CN109977413A (en) * | 2019-03-29 | 2019-07-05 | 南京邮电大学 | A kind of sentiment analysis method based on improvement CNN-LDA |
WO2019174376A1 (en) * | 2018-03-14 | 2019-09-19 | 大连理工大学 | Lung texture recognition method for extracting appearance and geometrical feature based on deep neural network |
CN110399490A (en) * | 2019-07-17 | 2019-11-01 | 武汉斗鱼网络科技有限公司 | A kind of barrage file classification method, device, equipment and storage medium |
CN110472245A (en) * | 2019-08-15 | 2019-11-19 | 东北大学 | A kind of multiple labeling emotional intensity prediction technique based on stratification convolutional neural networks |
-
2020
- 2020-07-10 CN CN202010664287.3A patent/CN111985532B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512680A (en) * | 2015-12-02 | 2016-04-20 | 北京航空航天大学 | Multi-view SAR image target recognition method based on depth neural network |
WO2019174376A1 (en) * | 2018-03-14 | 2019-09-19 | 大连理工大学 | Lung texture recognition method for extracting appearance and geometrical feature based on deep neural network |
CN108830296A (en) * | 2018-05-18 | 2018-11-16 | 河海大学 | A kind of improved high score Remote Image Classification based on deep learning |
CN109977413A (en) * | 2019-03-29 | 2019-07-05 | 南京邮电大学 | A kind of sentiment analysis method based on improvement CNN-LDA |
CN110399490A (en) * | 2019-07-17 | 2019-11-01 | 武汉斗鱼网络科技有限公司 | A kind of barrage file classification method, device, equipment and storage medium |
CN110472245A (en) * | 2019-08-15 | 2019-11-19 | 东北大学 | A kind of multiple labeling emotional intensity prediction technique based on stratification convolutional neural networks |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114764906A (en) * | 2021-01-13 | 2022-07-19 | 长沙中车智驭新能源科技有限公司 | Multi-sensor post-fusion method for automatic driving, electronic equipment and vehicle |
CN112733756A (en) * | 2021-01-15 | 2021-04-30 | 成都大学 | Remote sensing image semantic segmentation method based on W divergence countermeasure network |
CN113011504A (en) * | 2021-03-23 | 2021-06-22 | 华南理工大学 | Virtual reality scene emotion recognition method based on visual angle weight and feature fusion |
CN113011504B (en) * | 2021-03-23 | 2023-08-22 | 华南理工大学 | Virtual reality scene emotion recognition method based on visual angle weight and feature fusion |
CN113076905A (en) * | 2021-04-16 | 2021-07-06 | 华南理工大学 | Emotion recognition method based on context interaction relationship |
CN117636426A (en) * | 2023-11-20 | 2024-03-01 | 北京理工大学珠海学院 | Attention mechanism-based facial and scene emotion recognition method |
Also Published As
Publication number | Publication date |
---|---|
CN111985532B (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111985532B (en) | Scene-level context-aware emotion recognition deep network method | |
CN108460338B (en) | Human body posture estimation method and apparatus, electronic device, storage medium, and program | |
CN110532900B (en) | Facial expression recognition method based on U-Net and LS-CNN | |
CN112990054B (en) | Compact linguistics-free facial expression embedding and novel triple training scheme | |
CN112800903B (en) | Dynamic expression recognition method and system based on space-time diagram convolutional neural network | |
CN111028319B (en) | Three-dimensional non-photorealistic expression generation method based on facial motion unit | |
CN114220035A (en) | Rapid pest detection method based on improved YOLO V4 | |
CN107066583A (en) | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity | |
CN108830237B (en) | Facial expression recognition method | |
CN109740686A (en) | A kind of deep learning image multiple labeling classification method based on pool area and Fusion Features | |
CN112949622B (en) | Bimodal character classification method and device for fusing text and image | |
CN112949740B (en) | Small sample image classification method based on multilevel measurement | |
CN113780249B (en) | Expression recognition model processing method, device, equipment, medium and program product | |
CN112541529A (en) | Expression and posture fusion bimodal teaching evaluation method, device and storage medium | |
CN114936623A (en) | Multi-modal data fused aspect-level emotion analysis method | |
CN110111365B (en) | Training method and device based on deep learning and target tracking method and device | |
CN108182475A (en) | It is a kind of based on automatic coding machine-the multi-dimensional data characteristic recognition method of the learning machine that transfinites | |
CN111832573A (en) | Image emotion classification method based on class activation mapping and visual saliency | |
Zhai et al. | Face verification across aging based on deep convolutional networks and local binary patterns | |
CN112819510A (en) | Fashion trend prediction method, system and equipment based on clothing multi-attribute recognition | |
Zheng et al. | Facial expression recognition based on texture and shape | |
CN117576248A (en) | Image generation method and device based on gesture guidance | |
CN114155560B (en) | Light weight method of high-resolution human body posture estimation model based on space dimension reduction | |
CN113780350A (en) | Image description method based on ViLBERT and BilSTM | |
Wen | Research on Modern Book Packaging Design Based on Aesthetic Evaluation Based on a Deep Learning Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |