CN111985532B

CN111985532B - Scene-level context-aware emotion recognition deep network method

Info

Publication number: CN111985532B
Application number: CN202010664287.3A
Authority: CN
Inventors: 孙强; 张龙涛
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2021-11-09
Anticipated expiration: 2040-07-10
Also published as: CN111985532A

Abstract

The invention discloses an emotion recognition deep network method for scene level context perception, which reads a training sample set X_inObtaining a body part image set X by the body mark value and the original emotion mark value_B(ii) a To X_inAnd X_BAfter normalization processing, the data are respectively sent to an upper layer convolutional neural network and a lower layer convolutional neural network to extract emotional characteristics T_FAnd contextual affective feature T_CWill T_FAnd T_CRespectively sending into the upper and lower adaptive layers to obtain the fusion weight lambda_FAnd λ_CWill T_F、T_C、λ_FAnd λ_CFusing to obtain emotion fusion characteristic T_A，T_AObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring loss between the two initial predicted values and an original emotion marking value, gradually converging, finishing training, and obtaining a network model; processing the test sample set and sending the processed test sample set into a network model to obtain a test sample set X_tnThe tag value is predicted. The method provided by the invention considers the influence degree of the characteristics with different attributes on the human emotion when fusing the characteristics, and improves the prediction performance of the model on the basis of enriching the emotion recognition research work based on the image.

Description

Scene-level context-aware emotion recognition deep network method

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a scene-level context-aware emotion recognition deep network method.

Background

Emotion is an essential form of expression in which a person expresses his or her own feelings. Understanding and recognizing their emotions from the actual scene in which a person is located in daily life helps to perceive their mental state and predict behavior, effectively interacting. In the past 90 s, the concept of emotion calculation was proposed by MIT media laboratories, and scientists have been working on converting human complex emotions into numerical information recognizable by computers to better realize human-computer interaction and make computers become intelligent, which has become one of the key problems to be solved in the era of artificial intelligence.

Traditionally, emotion recognition tasks for static images are mainly researched according to human face images. For the face image, a predefined feature extraction method is adopted for extracting emotional features, and the face image is sent to a classifier (regressor) for model training, so that emotion prediction is finally realized. However, emotional recognition based on face images is easily affected by natural environment and sample characteristic factors such as pose, illumination, and face difference.

According to psychological research, the emotion information conveyed by the visual communication is about 55% of the information conveyed by the face image. In daily emotion communication, the emotion of a person is judged, the emotion of the person's heart can be estimated through a series of rich context information such as the facial expression of a target person, the surrounding environment such as the movement of the person, the interaction with other people, the scene, and the like, and even under the extreme condition that the face cannot be detected, the emotion of a research object can still be estimated through a large amount of context information.

In recent years, a complex emotion recognition method based on a deep convolutional network attracts attention, and the network learns emotion characteristics by itself and analyzes the emotion characteristics instead of a traditional artificial definition mode. However, the current deep learning analysis method mainly performs emotion analysis on a face image, lacks comprehensive consideration on character expression under a complex situation of a natural scene, and never considers the influence of scene-level context information on character emotion recognition in the scene. Meanwhile, the fusion mode of different attribute characteristics is not studied enough, and the contribution degree of different attribute characteristics to the emotional state identification is ignored in the established model.

Disclosure of Invention

The invention aims to provide a scene-level context-aware emotion recognition depth network method, which solves the limitation problems that in the prior art, the emotion analysis range based on static images is narrow, only face images are targeted, and emotion recognition is carried out by directly splicing different attribute features.

The invention adopts the technical scheme that a scene level context perception emotion recognition deep network method specifically comprises the following steps:

step 1, collecting images and determining a training sample set X_inAnd test sample set X_tn；

Step 2, reading a training sample set X_inThe body mark value and the original emotion mark value of each sample are marked according to the body mark valueExtracting body part of each sample to obtain a body part image set X_B；

Step 3, training sample set X_inCarrying out normalization processing in the set to obtain a context emotion image set X_im(ii) a For body part image set X_BCarrying out normalization processing in the set to obtain a normalized body part image set X_body；

Step 4, collecting the normalized body part image set X_bodyExtracting body part emotional feature T by convolutional neural network fed into upper layer_FAnd collecting the context emotion image set X_imExtracting scene level context emotional characteristic T by convolutional neural network sent to lower layer_C；

Step 5, emotional characteristics T of the body part_FContext and context sentiment feature T at scene level_CRespectively sending into the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and outputting the fusion weight lambda of the body part by the upper adaptive layer_FThe adaptive layer output context fusion weight λ of the lower layer_C；

Step 6, emotion characteristics T of body part_FScene level contextual emotional characteristics T_CFusing weight lambda with body part_FContext fusion weight λ_CCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context information_AThen T is added_AObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring the loss between the initial predicted values of arousal and value and corresponding original emotion marking values by adopting a KL divergence loss function, reversely propagating through a network, iterating for multiple times, updating network weight, gradually reducing loss, enabling the algorithm to gradually converge, finishing training, and obtaining a network model;

step 7, extracting a test sample set X according to the step 2_tnObtaining a test body part image set X of the body part of each test sample_tBThen, according to step 3, respectively testing sample sets X_tnAnd testing the body part image set X_tBAfter normalization processing, the network model obtained in step 6 is sent to theObtaining a test sample set X_tnPredict tag values.

The present invention is also characterized in that,

training sample set X in step 2_inThe specific steps for extracting the body part are as follows:

step 2.1, read training sample set X_inBody labeling of each sample in (B)_x1,B_y1,B_x2,B_y2) Wherein (B)_x1,B_y1,)，(B_x2,B_y2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)

Wherein:

in the formula (1), B_wWidth of body part image, B_hRepresenting the width of the body part image;

step 2.2, according to the parameter set obtained in step 1.1

For training sample set X_inCutting each sample to obtain a body part image set X_B。

Training sample set X in step 3_inThe formula for the intra-set normalization process is as follows:

in the formula (2), X_inFor training the sample set, X_imIs a context emotion image set, sigma is a standard deviation image of a training sample set, x_meanIs a mean image of a training sample set;

x in formula (2)_meanAnd σ is defined as follows:

in the formulae (3) and (4), x_iRepresenting a training sample set X_inN represents the total number of training samples, n is greater than or equal to 1.

Step 3, a body part image set X is processed_BThe formula for the intra-set normalization process is as follows:

in the formula (5), X_BFor a set of body part images, X_bodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'_meanA mean image of the body part image set;

x 'in formula (5)'_meanAnd σ' is defined as follows:

in formulas (6) and (7), x'_i'Image set X representing body part_BN represents the total number of training samples, n ≧ 1.

In the step 4, the structural parameters of the upper convolutional neural network and the lower convolutional neural network are the same, and both the upper convolutional neural network and the lower convolutional neural network adopt a VGG16 architecture.

Step 4, body part emotional characteristic T_FAnd contextual affective feature T_CThe calculation process of (2) is as follows:

T_F＝F(X_body,W_F) (8)

T_C＝F(X_im,W_C) (9)

in the formula (8), W_FAll parameters of all convolutional and pooling layers of the convolutional neural network representing the upper layer, in equation (9), W_CAll parameters of all convolution layers and pooling layers of the convolutional neural network at the lower layer are represented, and F represents the calculation operation of convolution and pooling in the feature extraction network.

Body part fusion weight lambda in step 5_FAnd context fusion weight λ_CThe calculation process of (2) is as follows:

λ_F＝F(T_F,W_D) (10)

λ_C＝F(T_C,W_E) (11)

in the formula (10), W_DFor the adaptive layer network parameters of the upper layer, in equation (11), W_EIs an adaptive layer network parameter of the lower layer, and_F+λ_C＝1。

in step 5, the network structures of the upper adaptive layer and the lower adaptive layer are completely the same, and the specific network architecture parameters are as follows:

the upper adaptive layer and the lower adaptive layer respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer.

Step 6, body part emotional characteristic T_FContext feature T at scene level_CFusing weight lambda with body part_FContext fusion weight λ_CThe calculation formula for performing weighted fusion is as follows:

in the formula (12)，T_AThe n represents a connection operator for integrating the emotion characteristics of the body part after the weight is integrated and the context emotion characteristics of the scene level are spliced,

representing a convolution operation between the different characteristic features and the fusion weights.

The invention has the beneficial effects that: the invention discloses a scene-level context-aware emotion recognition depth network method, and provides a two-stage context-aware emotion recognition network. By adopting a two-stage context emotion recognition network, the problem that the existing emotion recognition task for images mainly aims at the reality deficiency of human face image data is solved; on the other hand, the influence degree of the features with different attributes on the human emotion is fully considered in feature fusion, and the prediction performance of the model is improved on the basis of enriching emotion recognition research work based on images.

Drawings

FIG. 1 is an overall flow diagram of a scene level context aware emotion recognition deep network method of the present invention;

FIG. 2 is a diagram showing a complex emotion image and emotion dimension labeling information thereof;

FIG. 3 is a schematic diagram of a convolution operation;

FIG. 4 is a schematic view of the expansion of the receptive field by small convolution kernel stacking;

FIG. 5 is a schematic view of a pooling operation.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention discloses a scene level context-aware emotion recognition deep network method, which has the specific process shown in figure 1 and specifically comprises the following steps:

step 1, collecting images and determining a training sample set X_inAnd test sample set X_tm；

Each training sample and each testing sample have corresponding original emotion marking values and body marking values.

For training sample set X_inThe original emotion labeled n × 2-dimensional vector y ═ a [ ("a")₁,v₁),(a₂,v₂),...,(a_n,v_n)]Wherein (a)₁,v₁) Respectively represent training sample sets X_inThe arousal and value tags of sample 1 in (a), …_n,v_n) Respectively represent training sample sets X_inThe arousal and value labels of the nth sample in (1), the body part label value is a vector of n x 4 dimensions

Image set X representing body part_BThe body label of the 1 st sample, …,

image set X representing body part_BThe body label of the nth sample in (1).

For test sample set X_tnThe original emotion annotation value is m × 2-dimensional vector ty ═ ta₁,tv₁),(ta₂,tv₂),...,(ta_m,tv_m)]Body part labeling with m × 4 dimensional vector

m represents the number of test samples.

Step 2, reading a training sample set X_inExtracting the body part of each sample according to the body mark value to obtain a body part image set X_B；

Wherein, for training sample set X_inThe specific steps for extracting the body part are as follows:

Wherein:

in the formula (1), B_wWidth of body part image, B_hIndicating the width of the body part image.

Step 2.2, according to the parameter set obtained in step 2.1

For training sample set X_inCutting each training sample to obtain a body part image set X_B。

Wherein, for training sample set X_inThe formula for the intra-set normalization process is as follows:

x in formula (2)_meanAnd σ is fixedIt is defined as follows:

For body part image set X_BThe formula for the intra-set normalization process is as follows:

x 'in formula (5)'_meanAnd σ' is defined as follows:

Step 4, collecting the normalized body part image set X_bodyExtracting body part emotional feature T by convolutional neural network fed into upper layer_FAnd collecting the context emotion image set X_imConvolutional neural network into the lower layerExtracting scene level context emotional characteristic T_C；

Step 4.1, initializing parameters of the whole network architecture, including all convolutional layers, pooling layers and full-connection layers in the network, initializing the weight of each layer into Gaussian distribution with the overall obedience mean value of 0 and standard deviation of 1, and uniformly initializing bias terms into 0.001;

step 4.2, collecting the body part image set X_bodyFeeding into the upper convolutional neural network to collect the context emotion image set X_imThe convolutional neural network sent to the lower layer has the same structure as the convolutional neural network model of the upper layer and the lower layer, and both adopt a VGG16 network architecture, and the VGG16 network architecture parameters are shown in the following table 1:

TABLE 1 Emotion feature extraction convolutional network architecture parameter Table

As can be seen from the network architecture parameter table 1, for 5 convolutional layers C1, C2, C3, C4, and C5 in the network structure, the number of corresponding feature maps is 64, 128, 256, 512, and 512, respectively, and each feature map is formed by an input image or an output X of a previous layer_mConvolution templates K each associated with a corresponding number_uvPerforming convolution operation, and adding bias term b_vThe convolution process is shown in fig. 3, and the specific calculation formula of the feature map is as follows:

in the formula (13), the value of u is {1,2,3,4,5}, which represents the number of corresponding convolution layers, and the value of v is the number of convolution templates corresponding to each layer, which is 64, 128, 256, 512,

representing convolution operation with step length of 1, the sizes of convolution kernels are all 3 × 3, the receptive field of the convolution layer is enlarged through stacking of small convolution kernels, meanwhile, the parameter quantity of the convolution layer can be effectively reduced, and a receptive field schematic diagram is shown in fig. 4.

For the pooling layers S1, S2, S3, and S4, the result obtained by the convolutional layer corresponding to the maximum sampling is used for sampling, the size of the pooled sampling region of the present invention is 2 × 2, the step size is 2, and the pooling process is shown in fig. 5, for example: 1 st feature map X of convolutional layer C1_m2 x 2, the sampling result results in the first input O of the 1 st feature map of the pooling layer S1₁Where the sampling method is to take the maximum value in the 2 × 2 region, the other outputs are similar, and the horizontal and vertical spatial resolutions after sampling become 1/2 as they are.

Step 4.3, normalized body part image set X_bodyContext emotion image set X_imAfter the iteration and the calculation of the convolutional neural network at the upper layer and the convolutional neural network at the lower layer respectively, the emotional characteristic T of the body part can be obtained_FScene level contextual emotional characteristics T_CThe calculation process can be represented by the following formula:

T_F＝F(X_body,W_F) (8)

T_C＝F(X_im,W_C) (9)

in the formula (8), W_FNetwork parameters related to the extraction of body part emotional features representing the upper layer, in formula (9), W_CThe context information is expressed as network parameters related to the feature extraction of the scene level context information of the lower layer, and F represents the calculation operation of a convolution layer and a pooling layer in the feature extraction network;

step 5, emotional characteristics T of the body part_FContext and context sentiment feature T at scene level_CRespectively sending into the upper adaptive layer and the lower adaptive layer for adaptive weight learning, and outputting the fusion weight lambda of the body part by the upper adaptive layer_FThe adaptive layer output context fusion weight λ of the lower layer_C；

For the adaptive layer network structure, the adaptive layer network structures of the upper layer and the lower layer are completely the same, the two networks respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer, and the overall structure parameters are shown in the following table 2:

table 2 adaptive converged network architecture parameter table

Finally, outputting the body part fusion weight lambda through a Softmax layer_FAnd context fusion weight λ_CThe calculation process is as follows:

λ_F＝F(T_F,W_D) (10)

λ_C＝F(T_C,W_E) (11)

in the formula (10), W_DFor the adaptive layer network parameters of the upper layer, in equation (11), W_EAdding constraint to the fusion weight value through the last Softmax layer of the adaptive network layer for the network parameter of the lower adaptive layer, so that the lambda is ensured_F+λ_C＝1。

Step 6, emotion characteristics T of body part_FScene level contextual emotional characteristics T_CFusing weight lambda with body part_FContext fusion weight λ_CCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context information_AThen T is added_AObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, adopting a KL divergence loss function to measure loss between the initial predicted values of arousal and value and corresponding original emotion marking values, carrying out backward propagation through a network, carrying out multiple iterations, updating network weight, gradually reducing loss, enabling the algorithm to gradually converge, completing training, and obtaining a network model.

Step 6.1, carrying out emotional characteristic T on the body part_FScene level contextual emotional characteristics T_CFusing weights λ with body parts_FContext fusion weight λ_CCarrying out weighted fusion to obtain emotion fusion characteristics T_AThe expression is as follows:

in the formula (12), pi represents a join operator, which means that the body part emotional characteristics and the scene level context emotional characteristics after the weight fusion are spliced,

representing a convolution operation between the different characteristic features and the fusion weights;

step 6.2, fusing the characteristics T_ASending the data to a full-connection layer for processing, wherein the last full-connection layer is changed into a linear activation function because the predicted value is continuous, and the parameter structure table of the full-connection layer is as follows:

full connection layer parameter table

And 6.3, linearly mapping the final 256-dimensional emotion characteristics into 2-dimensional predicted label values arousal and value through a full connection layer Fc10, measuring the loss between the predicted label values and the original label values by adopting KL divergence loss as a loss function, reversely propagating the network, iterating for 80 times, updating the network weight, gradually reducing the loss, gradually converging the algorithm, and finishing training.

Wherein, the adopted loss function is a KL divergence function, and the specific definition formula is as follows:

in formula (14), p (y)_i”) Representing the true distribution of the original affective tag y, q (ly)_i”) The distribution of model predicted label values ly is represented, and n represents the total number of training samples.

The back propagation of the convolutional neural network employed by the present invention includes three cases:

(1) when the pooling layer is connected with the full-connection layer, the error can be reversely transmitted into a plurality of lower sampling layers by the full-connection layer, and the gradient of each pixel in the characteristic diagram needs to be obtained.

F' (u) as shown in formula (15)^l _j) The partial derivative of the activation function of the l-th layer is shown, j represents the number of characteristic graphs of the current layer, and delta^l+1 _jFor the gradient of the l +1 layer offset, firstly, the l +1 layer weight matrix W^l+1 _jAfter rotating 180 degrees, delta is added^l+1 _jThe surrounding neighborhood is 0-padded with a weight matrix rot180 (W)^l+1 _j) A convolution operation is performed, wherein &representsa dot product of two matrices. After the bias gradient of the corresponding element in the current layer characteristic diagram is obtained, the bias gradient and the weight gradient of the lower sampling layer are respectively shown as the following formulas:

d^l _j＝downsample(x^l-1 _j) Is the down sampling result of the j-th feature map of the l-1 layer.

(2) When the convolutional layer is connected after the pooling layer, the solution of the bias and the weight gradient is the same as in the case (1).

(3) When the convolutional layer is a pooling layer, the characteristic graphs are in one-to-one correspondence. Similarly, the bias gradient delta of each pixel point in the characteristic diagram of the current layer is firstly solved^l _j：

δ^l _j＝w^l+1 _j(f′(u^l _j)×upsample(δ^l+1 _j)) (17)

In formula (17), upsample (δ)^l+1 _j) Represents a pair delta^l+1 _jUpsampling, upsampling the jth result of the l +1 level downsampling to restore the same size as the convolution signature, conveniently as f' (ii), (iii), (iv), (v) and (v)u^l _j) The matrix is dot-multiplied, and the bias gradient and weight gradient of the convolutional layer are shown as formulas (18) and (19).

In the formulae (18), (19), w^l _jIs the jth characteristic diagram x of the ith layer^l _jCorresponding convolution kernel, p^l _jIs the jth characteristic diagram x of the l-1 layer^l-1 _jAnd a convolution kernel w^l _jAnd obtaining a corresponding result after convolution.

Step 7, testing sample set X_tnExtracting a test sample set X according to the step 2_tnObtaining a test body part image set X of the body part of each test sample_tBThen, according to step 3, respectively testing sample sets X_tnAnd testing the body part image set X_tBAfter normalization processing, sending the normalized result into the network model obtained in the step 6, and finally obtaining a test sample set X_tnPredict tag values.

The specific process of step 7 is as follows:

step 7.1, read test sample set X_tnBody labeling (tB) of each sample in_x1,tB_y1,tB_x2,tB_y2) The position and size parameter set is calculated by the following formula

Step 7.2, according to the parameter set obtained in step 7.1

For test sample set X_tnCutting to obtain a test body part image set X_tB。

Step 7.3, referring to step 3, testing sample set X_tnAnd testing the body part image set X_tBRespectively carrying out normalization processing in the sets to obtain corresponding test context emotion image sets X_tmAnd a normalized test body part image set X_tbody；

Step 7.4, set X of normalized test body part images_tbodySending the upper layer structure of the network model obtained in the step 6, and testing the context and emotion image set X_tmSending the lower layer structure of the network model obtained in the step 6, and obtaining a test sample set X through model prediction_tnPredict tag values.

Examples

The experiment of the invention is carried out based on an EMOTIC database, an EMOTIC data set provides rich emotional images under complex scenes, and the images not only contain the object to be detected, but also contain scene level context information of a large number of environments and other factors; the data set has 23554 samples to be tested, which can be divided into 17077 training set samples, 2088 verification set samples and 4389 test set samples. The labeling information not only comprises discrete labeling and continuous dimension labeling, but also comprises body part labeling of an object to be detected in each image, so that scene-level context research can be conveniently developed, and part of complex emotion images and the labeling thereof are shown in the attached figure 2.

The experimental results are compared as follows:

1) influence of different feature fusion modes on emotion recognition

Because the attributes of the features extracted from different network structures are often different, the optimal performance discrimination cannot be provided if the features with the two different attributes, namely the body part emotional feature and the scene level context emotional feature, are directly spliced. Therefore, in order to verify the effectiveness of the adaptive fusion network, the same experimental setup is adopted, the characteristics output by the two layers of convolutional neural networks are compared by adopting a direct splicing fusion mode and an adaptive network fusion mode, and the experimental results are shown in the following table 3:

TABLE 3 influence of different feature fusion modes on emotion recognition

As can be seen from the data in the table, the self-adaptive fusion network designed by the invention is superior to the method for directly splicing the characteristics with two different attributes in terms of the fusion mode of the emotional characteristics. This verifies the effectiveness of the present invention to introduce an adaptive converged network into a contextual emotion recognition network structure.

Claims

1. A scene level context-aware emotion recognition deep network method is characterized by specifically comprising the following steps:

Step 5, putting the body partBit affective feature T_FContext and context sentiment feature T at scene level_CRespectively sending into the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and outputting the fusion weight lambda of the body part by the upper adaptive layer_FThe adaptive layer output context fusion weight λ of the lower layer_C；

Step 6, emotion characteristics T of body part_FScene level contextual emotional characteristics T_CFusing weight lambda with body part_FContext fusion weight λ_CCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context information_AThen T is added_AObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring the loss between the initial predicted values of arousal and value and corresponding original emotion marking values by adopting a KL divergence loss function, and updating network weight through network back propagation and multiple iterations to gradually reduce the loss, so that the algorithm gradually converges, and the training is completed to obtain a network model;

step 7, extracting a test sample set X according to the step 2_tnObtaining a test body part image set X of the body part of each test sample_tBThen, according to step 3, respectively testing sample sets X_tnAnd testing the body part image set X_tBAfter normalization processing, sending the normalized result into the network model obtained in the step 6, and finally obtaining a test sample set X_tnPredict tag values.

2. The method as claimed in claim 1, wherein the training sample set X in step 2 is a training sample set_inThe specific steps for extracting the body part are as follows:

Wherein:

in the formula (1), B_wWidth of body part image, B_hA width representing a body part image;

step 2.2, according to the parameter set obtained in step 1.1

3. The method as claimed in claim 1, wherein the training sample set X in step 3 is a training sample set_inThe formula for the intra-set normalization process is as follows:

x in formula (2)_meanAnd σ is defined as follows:

in the formulae (3) and (4), x_iRepresenting a training sample set X_inN represents the total number of training samples, n ≧ 1.

4. The method as claimed in claim 1, wherein the step 3 is a body part image set X_BThe formula for the intra-set normalization process is as follows:

x 'in formula (5)'_meanAnd σ' is defined as follows:

5. The method for emotion recognition depth network based on scene-level context awareness, according to claim 1, wherein in step 4, the convolutional neural network at the upper layer and the convolutional neural network at the lower layer have the same structural parameters, and both adopt VGG16 architecture.

6. A scene as claimed in claim 1The emotion recognition deep network method based on level context perception is characterized in that in the step 4, the body part emotion characteristics T_FAnd contextual affective feature T_CThe calculation process of (2) is as follows:

T_F＝F(X_body,W_F) (8)

T_C＝F(X_im,W_C) (9)

7. The method as claimed in claim 1, wherein the body region fusion weight λ of step 5 is a scene-level context-aware emotion recognition depth network method_FAnd context fusion weight λ_CThe calculation process of (2) is as follows:

λ_F＝F(T_F,W_D) (10)

λ_C＝F(T_C,W_E) (11)

8. the method as claimed in claim 1, wherein in step 5, the network structures of the upper adaptive layer and the lower adaptive layer are completely the same, and the specific network architecture parameters are as follows:

9. The method as claimed in claim 1, wherein the body part emotion characteristics T in step 6 are obtained by the context-aware emotion recognition deep network method_FContext feature T at scene level_CFusing weight lambda with body part_FContext fusion weight λ_CThe calculation formula for performing weighted fusion is as follows:

in the formula (12), T_AThe n represents a connection operator for integrating the emotion characteristics of the body part after the weight is integrated and the context emotion characteristics of the scene level are spliced,

representing a convolution operation between different characteristic features and fusion weights.