CN111985532B - Scene-level context-aware emotion recognition deep network method - Google Patents

Scene-level context-aware emotion recognition deep network method Download PDF

Info

Publication number
CN111985532B
CN111985532B CN202010664287.3A CN202010664287A CN111985532B CN 111985532 B CN111985532 B CN 111985532B CN 202010664287 A CN202010664287 A CN 202010664287A CN 111985532 B CN111985532 B CN 111985532B
Authority
CN
China
Prior art keywords
body part
emotion
context
network
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010664287.3A
Other languages
Chinese (zh)
Other versions
CN111985532A (en
Inventor
孙强
张龙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202010664287.3A priority Critical patent/CN111985532B/en
Publication of CN111985532A publication Critical patent/CN111985532A/en
Application granted granted Critical
Publication of CN111985532B publication Critical patent/CN111985532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an emotion recognition deep network method for scene level context perception, which reads a training sample set XinObtaining a body part image set X by the body mark value and the original emotion mark valueB(ii) a To XinAnd XBAfter normalization processing, the data are respectively sent to an upper layer convolutional neural network and a lower layer convolutional neural network to extract emotional characteristics TFAnd contextual affective feature TCWill TFAnd TCRespectively sending into the upper and lower adaptive layers to obtain the fusion weight lambdaFAnd λCWill TF、TC、λFAnd λCFusing to obtain emotion fusion characteristic TA,TAObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring loss between the two initial predicted values and an original emotion marking value, gradually converging, finishing training, and obtaining a network model; processing the test sample set and sending the processed test sample set into a network model to obtain a test sample set XtnThe tag value is predicted. The method provided by the invention considers the influence degree of the characteristics with different attributes on the human emotion when fusing the characteristics, and improves the prediction performance of the model on the basis of enriching the emotion recognition research work based on the image.

Description

Scene-level context-aware emotion recognition deep network method
Technical Field
The invention belongs to the technical field of pattern recognition, and particularly relates to a scene-level context-aware emotion recognition deep network method.
Background
Emotion is an essential form of expression in which a person expresses his or her own feelings. Understanding and recognizing their emotions from the actual scene in which a person is located in daily life helps to perceive their mental state and predict behavior, effectively interacting. In the past 90 s, the concept of emotion calculation was proposed by MIT media laboratories, and scientists have been working on converting human complex emotions into numerical information recognizable by computers to better realize human-computer interaction and make computers become intelligent, which has become one of the key problems to be solved in the era of artificial intelligence.
Traditionally, emotion recognition tasks for static images are mainly researched according to human face images. For the face image, a predefined feature extraction method is adopted for extracting emotional features, and the face image is sent to a classifier (regressor) for model training, so that emotion prediction is finally realized. However, emotional recognition based on face images is easily affected by natural environment and sample characteristic factors such as pose, illumination, and face difference.
According to psychological research, the emotion information conveyed by the visual communication is about 55% of the information conveyed by the face image. In daily emotion communication, the emotion of a person is judged, the emotion of the person's heart can be estimated through a series of rich context information such as the facial expression of a target person, the surrounding environment such as the movement of the person, the interaction with other people, the scene, and the like, and even under the extreme condition that the face cannot be detected, the emotion of a research object can still be estimated through a large amount of context information.
In recent years, a complex emotion recognition method based on a deep convolutional network attracts attention, and the network learns emotion characteristics by itself and analyzes the emotion characteristics instead of a traditional artificial definition mode. However, the current deep learning analysis method mainly performs emotion analysis on a face image, lacks comprehensive consideration on character expression under a complex situation of a natural scene, and never considers the influence of scene-level context information on character emotion recognition in the scene. Meanwhile, the fusion mode of different attribute characteristics is not studied enough, and the contribution degree of different attribute characteristics to the emotional state identification is ignored in the established model.
Disclosure of Invention
The invention aims to provide a scene-level context-aware emotion recognition depth network method, which solves the limitation problems that in the prior art, the emotion analysis range based on static images is narrow, only face images are targeted, and emotion recognition is carried out by directly splicing different attribute features.
The invention adopts the technical scheme that a scene level context perception emotion recognition deep network method specifically comprises the following steps:
step 1, collecting images and determining a training sample set XinAnd test sample set Xtn
Step 2, reading a training sample set XinThe body mark value and the original emotion mark value of each sample are marked according to the body mark valueExtracting body part of each sample to obtain a body part image set XB
Step 3, training sample set XinCarrying out normalization processing in the set to obtain a context emotion image set Xim(ii) a For body part image set XBCarrying out normalization processing in the set to obtain a normalized body part image set Xbody
Step 4, collecting the normalized body part image set XbodyExtracting body part emotional feature T by convolutional neural network fed into upper layerFAnd collecting the context emotion image set XimExtracting scene level context emotional characteristic T by convolutional neural network sent to lower layerC
Step 5, emotional characteristics T of the body partFContext and context sentiment feature T at scene levelCRespectively sending into the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and outputting the fusion weight lambda of the body part by the upper adaptive layerFThe adaptive layer output context fusion weight λ of the lower layerC
Step 6, emotion characteristics T of body partFScene level contextual emotional characteristics TCFusing weight lambda with body partFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context informationAThen T is addedAObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring the loss between the initial predicted values of arousal and value and corresponding original emotion marking values by adopting a KL divergence loss function, reversely propagating through a network, iterating for multiple times, updating network weight, gradually reducing loss, enabling the algorithm to gradually converge, finishing training, and obtaining a network model;
step 7, extracting a test sample set X according to the step 2tnObtaining a test body part image set X of the body part of each test sampletBThen, according to step 3, respectively testing sample sets XtnAnd testing the body part image set XtBAfter normalization processing, the network model obtained in step 6 is sent to theObtaining a test sample set XtnPredict tag values.
The present invention is also characterized in that,
training sample set X in step 2inThe specific steps for extracting the body part are as follows:
step 2.1, read training sample set XinBody labeling of each sample in (B)x1,By1,Bx2,By2) Wherein (B)x1,By1,),(Bx2,By2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)
Figure BDA0002579775030000031
Wherein:
Figure BDA0002579775030000032
in the formula (1), BwWidth of body part image, BhRepresenting the width of the body part image;
step 2.2, according to the parameter set obtained in step 1.1
Figure BDA0002579775030000033
For training sample set XinCutting each sample to obtain a body part image set XB
Training sample set X in step 3inThe formula for the intra-set normalization process is as follows:
Figure BDA0002579775030000041
in the formula (2), XinFor training the sample set, XimIs a context emotion image set, sigma is a standard deviation image of a training sample set, xmeanIs a mean image of a training sample set;
x in formula (2)meanAnd σ is defined as follows:
Figure BDA0002579775030000042
Figure BDA0002579775030000043
in the formulae (3) and (4), xiRepresenting a training sample set XinN represents the total number of training samples, n is greater than or equal to 1.
Step 3, a body part image set X is processedBThe formula for the intra-set normalization process is as follows:
Figure BDA0002579775030000044
in the formula (5), XBFor a set of body part images, XbodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'meanA mean image of the body part image set;
x 'in formula (5)'meanAnd σ' is defined as follows:
Figure BDA0002579775030000045
Figure BDA0002579775030000046
in formulas (6) and (7), x'i'Image set X representing body partBN represents the total number of training samples, n ≧ 1.
In the step 4, the structural parameters of the upper convolutional neural network and the lower convolutional neural network are the same, and both the upper convolutional neural network and the lower convolutional neural network adopt a VGG16 architecture.
Step 4, body part emotional characteristic TFAnd contextual affective feature TCThe calculation process of (2) is as follows:
TF=F(Xbody,WF) (8)
TC=F(Xim,WC) (9)
in the formula (8), WFAll parameters of all convolutional and pooling layers of the convolutional neural network representing the upper layer, in equation (9), WCAll parameters of all convolution layers and pooling layers of the convolutional neural network at the lower layer are represented, and F represents the calculation operation of convolution and pooling in the feature extraction network.
Body part fusion weight lambda in step 5FAnd context fusion weight λCThe calculation process of (2) is as follows:
λF=F(TF,WD) (10)
λC=F(TC,WE) (11)
in the formula (10), WDFor the adaptive layer network parameters of the upper layer, in equation (11), WEIs an adaptive layer network parameter of the lower layer, andFC=1。
in step 5, the network structures of the upper adaptive layer and the lower adaptive layer are completely the same, and the specific network architecture parameters are as follows:
Figure BDA0002579775030000051
the upper adaptive layer and the lower adaptive layer respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer.
Step 6, body part emotional characteristic TFContext feature T at scene levelCFusing weight lambda with body partFContext fusion weight λCThe calculation formula for performing weighted fusion is as follows:
Figure BDA0002579775030000052
in the formula (12),TAThe n represents a connection operator for integrating the emotion characteristics of the body part after the weight is integrated and the context emotion characteristics of the scene level are spliced,
Figure BDA0002579775030000061
representing a convolution operation between the different characteristic features and the fusion weights.
The invention has the beneficial effects that: the invention discloses a scene-level context-aware emotion recognition depth network method, and provides a two-stage context-aware emotion recognition network. By adopting a two-stage context emotion recognition network, the problem that the existing emotion recognition task for images mainly aims at the reality deficiency of human face image data is solved; on the other hand, the influence degree of the features with different attributes on the human emotion is fully considered in feature fusion, and the prediction performance of the model is improved on the basis of enriching emotion recognition research work based on images.
Drawings
FIG. 1 is an overall flow diagram of a scene level context aware emotion recognition deep network method of the present invention;
FIG. 2 is a diagram showing a complex emotion image and emotion dimension labeling information thereof;
FIG. 3 is a schematic diagram of a convolution operation;
FIG. 4 is a schematic view of the expansion of the receptive field by small convolution kernel stacking;
FIG. 5 is a schematic view of a pooling operation.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention discloses a scene level context-aware emotion recognition deep network method, which has the specific process shown in figure 1 and specifically comprises the following steps:
step 1, collecting images and determining a training sample set XinAnd test sample set Xtm
Each training sample and each testing sample have corresponding original emotion marking values and body marking values.
For training sample set XinThe original emotion labeled n × 2-dimensional vector y ═ a [ ("a")1,v1),(a2,v2),...,(an,vn)]Wherein (a)1,v1) Respectively represent training sample sets XinThe arousal and value tags of sample 1 in (a), …n,vn) Respectively represent training sample sets XinThe arousal and value labels of the nth sample in (1), the body part label value is a vector of n x 4 dimensions
Figure BDA0002579775030000071
Figure BDA0002579775030000072
Image set X representing body partBThe body label of the 1 st sample, …,
Figure BDA0002579775030000073
image set X representing body partBThe body label of the nth sample in (1).
For test sample set XtnThe original emotion annotation value is m × 2-dimensional vector ty ═ ta1,tv1),(ta2,tv2),...,(tam,tvm)]Body part labeling with m × 4 dimensional vector
Figure BDA0002579775030000074
m represents the number of test samples.
Step 2, reading a training sample set XinExtracting the body part of each sample according to the body mark value to obtain a body part image set XB
Wherein, for training sample set XinThe specific steps for extracting the body part are as follows:
step 2.1, read training sample set XinBody labeling of each sample in (B)x1,By1,Bx2,By2) Wherein (B)x1,By1,),(Bx2,By2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)
Figure BDA0002579775030000075
Wherein:
Figure BDA0002579775030000076
in the formula (1), BwWidth of body part image, BhIndicating the width of the body part image.
Step 2.2, according to the parameter set obtained in step 2.1
Figure BDA0002579775030000077
For training sample set XinCutting each training sample to obtain a body part image set XB
Step 3, training sample set XinCarrying out normalization processing in the set to obtain a context emotion image set Xim(ii) a For body part image set XBCarrying out normalization processing in the set to obtain a normalized body part image set Xbody
Wherein, for training sample set XinThe formula for the intra-set normalization process is as follows:
Figure BDA0002579775030000081
in the formula (2), XinFor training the sample set, XimIs a context emotion image set, sigma is a standard deviation image of a training sample set, xmeanIs a mean image of a training sample set;
x in formula (2)meanAnd σ is fixedIt is defined as follows:
Figure BDA0002579775030000082
Figure BDA0002579775030000083
in the formulae (3) and (4), xiRepresenting a training sample set XinN represents the total number of training samples, n is greater than or equal to 1.
For body part image set XBThe formula for the intra-set normalization process is as follows:
Figure BDA0002579775030000084
in the formula (5), XBFor a set of body part images, XbodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'meanA mean image of the body part image set;
x 'in formula (5)'meanAnd σ' is defined as follows:
Figure BDA0002579775030000085
Figure BDA0002579775030000091
in formulas (6) and (7), x'i'Image set X representing body partBN represents the total number of training samples, n ≧ 1.
Step 4, collecting the normalized body part image set XbodyExtracting body part emotional feature T by convolutional neural network fed into upper layerFAnd collecting the context emotion image set XimConvolutional neural network into the lower layerExtracting scene level context emotional characteristic TC
Step 4.1, initializing parameters of the whole network architecture, including all convolutional layers, pooling layers and full-connection layers in the network, initializing the weight of each layer into Gaussian distribution with the overall obedience mean value of 0 and standard deviation of 1, and uniformly initializing bias terms into 0.001;
step 4.2, collecting the body part image set XbodyFeeding into the upper convolutional neural network to collect the context emotion image set XimThe convolutional neural network sent to the lower layer has the same structure as the convolutional neural network model of the upper layer and the lower layer, and both adopt a VGG16 network architecture, and the VGG16 network architecture parameters are shown in the following table 1:
TABLE 1 Emotion feature extraction convolutional network architecture parameter Table
Figure BDA0002579775030000092
Figure BDA0002579775030000101
As can be seen from the network architecture parameter table 1, for 5 convolutional layers C1, C2, C3, C4, and C5 in the network structure, the number of corresponding feature maps is 64, 128, 256, 512, and 512, respectively, and each feature map is formed by an input image or an output X of a previous layermConvolution templates K each associated with a corresponding numberuvPerforming convolution operation, and adding bias term bvThe convolution process is shown in fig. 3, and the specific calculation formula of the feature map is as follows:
Figure BDA0002579775030000102
in the formula (13), the value of u is {1,2,3,4,5}, which represents the number of corresponding convolution layers, and the value of v is the number of convolution templates corresponding to each layer, which is 64, 128, 256, 512,
Figure BDA0002579775030000103
representing convolution operation with step length of 1, the sizes of convolution kernels are all 3 × 3, the receptive field of the convolution layer is enlarged through stacking of small convolution kernels, meanwhile, the parameter quantity of the convolution layer can be effectively reduced, and a receptive field schematic diagram is shown in fig. 4.
For the pooling layers S1, S2, S3, and S4, the result obtained by the convolutional layer corresponding to the maximum sampling is used for sampling, the size of the pooled sampling region of the present invention is 2 × 2, the step size is 2, and the pooling process is shown in fig. 5, for example: 1 st feature map X of convolutional layer C1m2 x 2, the sampling result results in the first input O of the 1 st feature map of the pooling layer S11Where the sampling method is to take the maximum value in the 2 × 2 region, the other outputs are similar, and the horizontal and vertical spatial resolutions after sampling become 1/2 as they are.
Step 4.3, normalized body part image set XbodyContext emotion image set XimAfter the iteration and the calculation of the convolutional neural network at the upper layer and the convolutional neural network at the lower layer respectively, the emotional characteristic T of the body part can be obtainedFScene level contextual emotional characteristics TCThe calculation process can be represented by the following formula:
TF=F(Xbody,WF) (8)
TC=F(Xim,WC) (9)
in the formula (8), WFNetwork parameters related to the extraction of body part emotional features representing the upper layer, in formula (9), WCThe context information is expressed as network parameters related to the feature extraction of the scene level context information of the lower layer, and F represents the calculation operation of a convolution layer and a pooling layer in the feature extraction network;
step 5, emotional characteristics T of the body partFContext and context sentiment feature T at scene levelCRespectively sending into the upper adaptive layer and the lower adaptive layer for adaptive weight learning, and outputting the fusion weight lambda of the body part by the upper adaptive layerFThe adaptive layer output context fusion weight λ of the lower layerC
For the adaptive layer network structure, the adaptive layer network structures of the upper layer and the lower layer are completely the same, the two networks respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer, and the overall structure parameters are shown in the following table 2:
table 2 adaptive converged network architecture parameter table
Figure BDA0002579775030000111
Finally, outputting the body part fusion weight lambda through a Softmax layerFAnd context fusion weight λCThe calculation process is as follows:
λF=F(TF,WD) (10)
λC=F(TC,WE) (11)
in the formula (10), WDFor the adaptive layer network parameters of the upper layer, in equation (11), WEAdding constraint to the fusion weight value through the last Softmax layer of the adaptive network layer for the network parameter of the lower adaptive layer, so that the lambda is ensuredFC=1。
Step 6, emotion characteristics T of body partFScene level contextual emotional characteristics TCFusing weight lambda with body partFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context informationAThen T is addedAObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, adopting a KL divergence loss function to measure loss between the initial predicted values of arousal and value and corresponding original emotion marking values, carrying out backward propagation through a network, carrying out multiple iterations, updating network weight, gradually reducing loss, enabling the algorithm to gradually converge, completing training, and obtaining a network model.
Step 6.1, carrying out emotional characteristic T on the body partFScene level contextual emotional characteristics TCFusing weights λ with body partsFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristics TAThe expression is as follows:
Figure BDA0002579775030000121
in the formula (12), pi represents a join operator, which means that the body part emotional characteristics and the scene level context emotional characteristics after the weight fusion are spliced,
Figure BDA0002579775030000122
representing a convolution operation between the different characteristic features and the fusion weights;
step 6.2, fusing the characteristics TASending the data to a full-connection layer for processing, wherein the last full-connection layer is changed into a linear activation function because the predicted value is continuous, and the parameter structure table of the full-connection layer is as follows:
full connection layer parameter table
Figure BDA0002579775030000123
And 6.3, linearly mapping the final 256-dimensional emotion characteristics into 2-dimensional predicted label values arousal and value through a full connection layer Fc10, measuring the loss between the predicted label values and the original label values by adopting KL divergence loss as a loss function, reversely propagating the network, iterating for 80 times, updating the network weight, gradually reducing the loss, gradually converging the algorithm, and finishing training.
Wherein, the adopted loss function is a KL divergence function, and the specific definition formula is as follows:
Figure BDA0002579775030000131
in formula (14), p (y)i”) Representing the true distribution of the original affective tag y, q (ly)i”) The distribution of model predicted label values ly is represented, and n represents the total number of training samples.
The back propagation of the convolutional neural network employed by the present invention includes three cases:
(1) when the pooling layer is connected with the full-connection layer, the error can be reversely transmitted into a plurality of lower sampling layers by the full-connection layer, and the gradient of each pixel in the characteristic diagram needs to be obtained.
Figure RE-GDA0002687918660000132
F' (u) as shown in formula (15)l j) The partial derivative of the activation function of the l-th layer is shown, j represents the number of characteristic graphs of the current layer, and deltal+1 jFor the gradient of the l +1 layer offset, firstly, the l +1 layer weight matrix Wl+1 jAfter rotating 180 degrees, delta is addedl+1 jThe surrounding neighborhood is 0-padded with a weight matrix rot180 (W)l+1 j) A convolution operation is performed, wherein &representsa dot product of two matrices. After the bias gradient of the corresponding element in the current layer characteristic diagram is obtained, the bias gradient and the weight gradient of the lower sampling layer are respectively shown as the following formulas:
Figure BDA0002579775030000133
dl j=downsample(xl-1 j) Is the down sampling result of the j-th feature map of the l-1 layer.
(2) When the convolutional layer is connected after the pooling layer, the solution of the bias and the weight gradient is the same as in the case (1).
(3) When the convolutional layer is a pooling layer, the characteristic graphs are in one-to-one correspondence. Similarly, the bias gradient delta of each pixel point in the characteristic diagram of the current layer is firstly solvedl j
δl j=wl+1 j(f′(ul j)×upsample(δl+1 j)) (17)
In formula (17), upsample (δ)l+1 j) Represents a pair deltal+1 jUpsampling, upsampling the jth result of the l +1 level downsampling to restore the same size as the convolution signature, conveniently as f' (ii), (iii), (iv), (v) and (v)ul j) The matrix is dot-multiplied, and the bias gradient and weight gradient of the convolutional layer are shown as formulas (18) and (19).
Figure BDA0002579775030000141
Figure BDA0002579775030000142
In the formulae (18), (19), wl jIs the jth characteristic diagram x of the ith layerl jCorresponding convolution kernel, pl jIs the jth characteristic diagram x of the l-1 layerl-1 jAnd a convolution kernel wl jAnd obtaining a corresponding result after convolution.
Step 7, testing sample set XtnExtracting a test sample set X according to the step 2tnObtaining a test body part image set X of the body part of each test sampletBThen, according to step 3, respectively testing sample sets XtnAnd testing the body part image set XtBAfter normalization processing, sending the normalized result into the network model obtained in the step 6, and finally obtaining a test sample set XtnPredict tag values.
The specific process of step 7 is as follows:
step 7.1, read test sample set XtnBody labeling (tB) of each sample inx1,tBy1,tBx2,tBy2) The position and size parameter set is calculated by the following formula
Figure BDA0002579775030000143
Figure BDA0002579775030000144
Step 7.2, according to the parameter set obtained in step 7.1
Figure BDA0002579775030000145
For test sample set XtnCutting to obtain a test body part image set XtB
Step 7.3, referring to step 3, testing sample set XtnAnd testing the body part image set XtBRespectively carrying out normalization processing in the sets to obtain corresponding test context emotion image sets XtmAnd a normalized test body part image set Xtbody
Step 7.4, set X of normalized test body part imagestbodySending the upper layer structure of the network model obtained in the step 6, and testing the context and emotion image set XtmSending the lower layer structure of the network model obtained in the step 6, and obtaining a test sample set X through model predictiontnPredict tag values.
Examples
The experiment of the invention is carried out based on an EMOTIC database, an EMOTIC data set provides rich emotional images under complex scenes, and the images not only contain the object to be detected, but also contain scene level context information of a large number of environments and other factors; the data set has 23554 samples to be tested, which can be divided into 17077 training set samples, 2088 verification set samples and 4389 test set samples. The labeling information not only comprises discrete labeling and continuous dimension labeling, but also comprises body part labeling of an object to be detected in each image, so that scene-level context research can be conveniently developed, and part of complex emotion images and the labeling thereof are shown in the attached figure 2.
The experimental results are compared as follows:
1) influence of different feature fusion modes on emotion recognition
Because the attributes of the features extracted from different network structures are often different, the optimal performance discrimination cannot be provided if the features with the two different attributes, namely the body part emotional feature and the scene level context emotional feature, are directly spliced. Therefore, in order to verify the effectiveness of the adaptive fusion network, the same experimental setup is adopted, the characteristics output by the two layers of convolutional neural networks are compared by adopting a direct splicing fusion mode and an adaptive network fusion mode, and the experimental results are shown in the following table 3:
TABLE 3 influence of different feature fusion modes on emotion recognition
Figure BDA0002579775030000151
Figure BDA0002579775030000161
As can be seen from the data in the table, the self-adaptive fusion network designed by the invention is superior to the method for directly splicing the characteristics with two different attributes in terms of the fusion mode of the emotional characteristics. This verifies the effectiveness of the present invention to introduce an adaptive converged network into a contextual emotion recognition network structure.

Claims (9)

1. A scene level context-aware emotion recognition deep network method is characterized by specifically comprising the following steps:
step 1, collecting images and determining a training sample set XinAnd test sample set Xtn
Step 2, reading a training sample set XinExtracting the body part of each sample according to the body mark value to obtain a body part image set XB
Step 3, training sample set XinCarrying out normalization processing in the set to obtain a context emotion image set Xim(ii) a For body part image set XBCarrying out normalization processing in the set to obtain a normalized body part image set Xbody
Step 4, collecting the normalized body part image set XbodyExtracting body part emotional feature T by convolutional neural network fed into upper layerFAnd collecting the context emotion image set XimExtracting scene level context emotional characteristic T by convolutional neural network sent to lower layerC
Step 5, putting the body partBit affective feature TFContext and context sentiment feature T at scene levelCRespectively sending into the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and outputting the fusion weight lambda of the body part by the upper adaptive layerFThe adaptive layer output context fusion weight λ of the lower layerC
Step 6, emotion characteristics T of body partFScene level contextual emotional characteristics TCFusing weight lambda with body partFContext fusion weight λCCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context informationAThen T is addedAObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring the loss between the initial predicted values of arousal and value and corresponding original emotion marking values by adopting a KL divergence loss function, and updating network weight through network back propagation and multiple iterations to gradually reduce the loss, so that the algorithm gradually converges, and the training is completed to obtain a network model;
step 7, extracting a test sample set X according to the step 2tnObtaining a test body part image set X of the body part of each test sampletBThen, according to step 3, respectively testing sample sets XtnAnd testing the body part image set XtBAfter normalization processing, sending the normalized result into the network model obtained in the step 6, and finally obtaining a test sample set XtnPredict tag values.
2. The method as claimed in claim 1, wherein the training sample set X in step 2 is a training sample setinThe specific steps for extracting the body part are as follows:
step 2.1, read training sample set XinBody labeling of each sample in (B)x1,By1,Bx2,By2) Wherein (B)x1,By1,),(Bx2,By2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)
Figure FDA0002579775020000021
Wherein:
Figure FDA0002579775020000022
in the formula (1), BwWidth of body part image, BhA width representing a body part image;
step 2.2, according to the parameter set obtained in step 1.1
Figure FDA0002579775020000023
For training sample set XinCutting each sample to obtain a body part image set XB
3. The method as claimed in claim 1, wherein the training sample set X in step 3 is a training sample setinThe formula for the intra-set normalization process is as follows:
Figure FDA0002579775020000024
in the formula (2), XinFor training the sample set, XimIs a context emotion image set, sigma is a standard deviation image of a training sample set, xmeanIs a mean image of a training sample set;
x in formula (2)meanAnd σ is defined as follows:
Figure FDA0002579775020000031
Figure FDA0002579775020000032
in the formulae (3) and (4), xiRepresenting a training sample set XinN represents the total number of training samples, n ≧ 1.
4. The method as claimed in claim 1, wherein the step 3 is a body part image set XBThe formula for the intra-set normalization process is as follows:
Figure FDA0002579775020000033
in the formula (5), XBFor a set of body part images, XbodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'meanA mean image of the body part image set;
x 'in formula (5)'meanAnd σ' is defined as follows:
Figure FDA0002579775020000034
Figure FDA0002579775020000035
in formulas (6) and (7), x'i'Image set X representing body partBN represents the total number of training samples, n ≧ 1.
5. The method for emotion recognition depth network based on scene-level context awareness, according to claim 1, wherein in step 4, the convolutional neural network at the upper layer and the convolutional neural network at the lower layer have the same structural parameters, and both adopt VGG16 architecture.
6. A scene as claimed in claim 1The emotion recognition deep network method based on level context perception is characterized in that in the step 4, the body part emotion characteristics TFAnd contextual affective feature TCThe calculation process of (2) is as follows:
TF=F(Xbody,WF) (8)
TC=F(Xim,WC) (9)
in the formula (8), WFAll parameters of all convolutional and pooling layers of the convolutional neural network representing the upper layer, in equation (9), WCAll parameters of all convolution layers and pooling layers of the convolutional neural network at the lower layer are represented, and F represents the calculation operation of convolution and pooling in the feature extraction network.
7. The method as claimed in claim 1, wherein the body region fusion weight λ of step 5 is a scene-level context-aware emotion recognition depth network methodFAnd context fusion weight λCThe calculation process of (2) is as follows:
λF=F(TF,WD) (10)
λC=F(TC,WE) (11)
in the formula (10), WDFor the adaptive layer network parameters of the upper layer, in equation (11), WEIs an adaptive layer network parameter of the lower layer, andFC=1。
8. the method as claimed in claim 1, wherein in step 5, the network structures of the upper adaptive layer and the lower adaptive layer are completely the same, and the specific network architecture parameters are as follows:
Figure FDA0002579775020000041
the upper adaptive layer and the lower adaptive layer respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer.
9. The method as claimed in claim 1, wherein the body part emotion characteristics T in step 6 are obtained by the context-aware emotion recognition deep network methodFContext feature T at scene levelCFusing weight lambda with body partFContext fusion weight λCThe calculation formula for performing weighted fusion is as follows:
Figure FDA0002579775020000051
in the formula (12), TAThe n represents a connection operator for integrating the emotion characteristics of the body part after the weight is integrated and the context emotion characteristics of the scene level are spliced,
Figure FDA0002579775020000052
representing a convolution operation between different characteristic features and fusion weights.
CN202010664287.3A 2020-07-10 2020-07-10 Scene-level context-aware emotion recognition deep network method Active CN111985532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010664287.3A CN111985532B (en) 2020-07-10 2020-07-10 Scene-level context-aware emotion recognition deep network method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010664287.3A CN111985532B (en) 2020-07-10 2020-07-10 Scene-level context-aware emotion recognition deep network method

Publications (2)

Publication Number Publication Date
CN111985532A CN111985532A (en) 2020-11-24
CN111985532B true CN111985532B (en) 2021-11-09

Family

ID=73439067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010664287.3A Active CN111985532B (en) 2020-07-10 2020-07-10 Scene-level context-aware emotion recognition deep network method

Country Status (1)

Country Link
CN (1) CN111985532B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733756B (en) * 2021-01-15 2023-01-20 成都大学 Remote sensing image semantic segmentation method based on W divergence countermeasure network
CN113011504B (en) * 2021-03-23 2023-08-22 华南理工大学 Virtual reality scene emotion recognition method based on visual angle weight and feature fusion
CN113076905B (en) * 2021-04-16 2022-12-16 华南理工大学 Emotion recognition method based on context interaction relation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512680A (en) * 2015-12-02 2016-04-20 北京航空航天大学 Multi-view SAR image target recognition method based on depth neural network
CN108830296A (en) * 2018-05-18 2018-11-16 河海大学 A kind of improved high score Remote Image Classification based on deep learning
CN109977413A (en) * 2019-03-29 2019-07-05 南京邮电大学 A kind of sentiment analysis method based on improvement CNN-LDA
WO2019174376A1 (en) * 2018-03-14 2019-09-19 大连理工大学 Lung texture recognition method for extracting appearance and geometrical feature based on deep neural network
CN110399490A (en) * 2019-07-17 2019-11-01 武汉斗鱼网络科技有限公司 A kind of barrage file classification method, device, equipment and storage medium
CN110472245A (en) * 2019-08-15 2019-11-19 东北大学 A kind of multiple labeling emotional intensity prediction technique based on stratification convolutional neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512680A (en) * 2015-12-02 2016-04-20 北京航空航天大学 Multi-view SAR image target recognition method based on depth neural network
WO2019174376A1 (en) * 2018-03-14 2019-09-19 大连理工大学 Lung texture recognition method for extracting appearance and geometrical feature based on deep neural network
CN108830296A (en) * 2018-05-18 2018-11-16 河海大学 A kind of improved high score Remote Image Classification based on deep learning
CN109977413A (en) * 2019-03-29 2019-07-05 南京邮电大学 A kind of sentiment analysis method based on improvement CNN-LDA
CN110399490A (en) * 2019-07-17 2019-11-01 武汉斗鱼网络科技有限公司 A kind of barrage file classification method, device, equipment and storage medium
CN110472245A (en) * 2019-08-15 2019-11-19 东北大学 A kind of multiple labeling emotional intensity prediction technique based on stratification convolutional neural networks

Also Published As

Publication number Publication date
CN111985532A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN108460338B (en) Human body posture estimation method and apparatus, electronic device, storage medium, and program
CN110532900B (en) Facial expression recognition method based on U-Net and LS-CNN
CN112990054B (en) Compact linguistics-free facial expression embedding and novel triple training scheme
CN111985532B (en) Scene-level context-aware emotion recognition deep network method
CN110046671A (en) A kind of file classification method based on capsule network
CN107066583A (en) A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN110399821B (en) Customer satisfaction acquisition method based on facial expression recognition
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN111028319B (en) Three-dimensional non-photorealistic expression generation method based on facial motion unit
CN108830237B (en) Facial expression recognition method
CN109740686A (en) A kind of deep learning image multiple labeling classification method based on pool area and Fusion Features
CN112949740B (en) Small sample image classification method based on multilevel measurement
CN108154156B (en) Image set classification method and device based on neural topic model
CN112949622A (en) Bimodal character classification method and device fusing text and image
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
CN114936623A (en) Multi-modal data fused aspect-level emotion analysis method
Elmahmudi et al. A framework for facial age progression and regression using exemplar face templates
CN111832573A (en) Image emotion classification method based on class activation mapping and visual saliency
CN115966010A (en) Expression recognition method based on attention and multi-scale feature fusion
Zheng et al. Facial expression recognition based on texture and shape
CN110111365B (en) Training method and device based on deep learning and target tracking method and device
CN107330363A (en) A kind of quick Internet advertising board detection method
CN112819510A (en) Fashion trend prediction method, system and equipment based on clothing multi-attribute recognition
CN114155560B (en) Light weight method of high-resolution human body posture estimation model based on space dimension reduction
Chen et al. A Unified Framework for Generative Data Augmentation: A Comprehensive Survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant