CN116563908A

CN116563908A - Face analysis and emotion recognition method based on multitasking cooperative network

Info

Publication number: CN116563908A
Application number: CN202310204150.3A
Authority: CN
Inventors: 宋海裕; 王浩宇
Original assignee: Zhejiang University of Finance and Economics
Current assignee: Zhejiang University of Finance and Economics
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-08-08

Abstract

The invention discloses a face analysis and emotion recognition method based on a multitasking cooperative network. The method comprises the following steps of 1, preprocessing experimental data; 2. constructing an MPNET network model; step 2.1, adopting ResNet18 as a backbone network of an encoder, and extracting semantic information of an input picture; step 2.2, constructing an edge perception branch, and adding a detail perception module DPM and a feature fusion module FFM into the edge perception branch; step 2.3, constructing a segmentation branch for outputting a face analysis result and supervising the face analysis; step 2.4, constructing a classification branch for identifying facial emotion; 3. training MPNET by utilizing the inter-task consistency learning loss function and the intra-task loss function; 4. and (3) performing experiments by adopting a trained MPNET network model, and verifying the model effect on the CelebAMask_HQ data set. The invention integrates the face analysis and the face emotion recognition into a network, has high real-time performance and accuracy, and can be deployed on mobile terminals and other devices.

Description

Face analysis and emotion recognition method based on multitasking cooperative network

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a face analysis and emotion recognition method based on a multitasking collaborative network.

Background

Face analysis is a fine-grained semantic segmentation task, and is often applied to P-graph technology, photo beautifying and the like. The facial emotion recognition is a classification task and can be applied to the fields of man-machine interaction, psychological health assessment and the like. The method develops a new multi-task cooperative network and can realize face analysis and emotion recognition at the same time. Compared with other methods, the method has the advantages that the reasoning speed and the accuracy are obviously improved, and the method can be deployed to mobile terminal equipment such as mobile phones and the like.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a face analysis and emotion recognition method based on a multitasking cooperative network. Face analysis and emotion recognition are realized. The invention provides a deep learning model named MPNET, which comprises the following specific steps:

step 1, preprocessing experimental data;

step 2, constructing an MPNET network model;

step 3, training an MPNET network model;

and 4, carrying out experiments on the face analysis data sets by adopting a trained MPNET network model, and evaluating the experimental results.

The step 1 specifically comprises the following steps:

step 1.1, in order to improve the generalization capability of a model, firstly, carrying out normalization processing on an image;

step 1.2, clipping the normalized image, wherein the size of the image is 512 multiplied by 512;

step 1.3, carrying out data enhancement on the cut image, specifically through random rotation and random scaling;

step 1.4 divides the training set, the validation set and the test set.

The step 2 comprises the following steps:

step 2.1, adopting ResNet18 as a backbone network of an encoder, and extracting semantic information of an input picture;

step 2.2, constructing an edge perception branch, and adding a Detail Perception Module (DPM) and a Feature Fusion Module (FFM) into the edge perception branch.

The second layer of features of the ResNet18 firstly pass through a detail perception module DPM, and the output of the detail perception module DPM and the features of the third layer of features of the ResNet18 which are subjected to 2 times up-sampling are subjected to feature fusion together through a feature fusion module FFM to obtain fusion features I;

further, the fusion feature I passes through the detail perception module DPM again, and the output of the fusion feature I and the feature of the fourth layer of ResNet18 which is subjected to 4 times of up-sampling are fused together through the feature fusion module FFM to obtain a fusion feature II; after the final fusion feature II is up-sampled again by the detail perception module DPM and 4 times, the final fusion feature II is respectively sent into two Detai heads to obtain a facial bi-classification boundary mapAnd face multi-class boundary map->The Detail Head consists of a layer of 3×3 convolutions, a batch norm layer, a Relu activation function, and a 1×1 convolution.

The main structure of detail perception module DPM is as follows:

for input feature X, spatial attention is first obtained through a global max pooling layer and two 1X 1 convolution layers. The input feature X is passed through a global averaging pooling layer and two 1X 1 convolution layers to obtain the spatial attention figure ii. The spatial attention pattern I is added to the spatial attention pattern II, and the final spatial attention pattern is obtained through a softmax function. And multiplying the spatial attention pattern by the input feature X to obtain an output feature y.

Further, the output feature y is used as a new input feature, the channel attention pattern I is obtained through a global maximum pooling layer, and the input feature is obtained through a global average pooling layer. Adding the channel attention pattern II and the channel attention pattern II, then obtaining a final channel attention pattern through a 1X 1 convolution layer and a softmax function, and multiplying the channel attention pattern with the input features to obtain the features finally passing through the detail perception module.

The main structure of the feature fusion module is as follows:

input feature Z ₁ and Z₂ . The two features are spliced, and the branching attention pattern is obtained through a global average pooling layer, a convolution layer of 1 multiplied by 1 and a softmax function. Expanding the branch attention according to the channel dimension, and respectively matching the branch attention with Z according to the set dimension index ₁ and Z₂ And multiplying to obtain the final fused output characteristic Z. For example: the dimension of the branch attention pattern is 512, and the weight value corresponding to the expanded number 0-255 channel is equal to Z ₁ Multiplying the weight value corresponding to the 256-511 channel with Z ₂ The multiplication is performed and,

and 2.3, constructing a segmentation branch for outputting a face analysis result and supervising the face analysis.

For the fifth layer of the ResNet18, a five-layer decoder with the same structure is designed, each decoder is composed of a 3×3 convolution layer and upsampling, and the input features are restored to the original resolution through the five-layer decoder with the same structure, so as to obtain the supervised face analysis result

For the fifth layer of characteristics of the ResNet18 of the encoder, 8 times of up-sampling is performed to obtain a characteristic Y2, then the characteristic Y1 of the last detail sensing module DPM in the edge sensing branch and the characteristic Y2 after 8 times of up-sampling are sent to a double-image self-adaptive learning module DGALM, and a final face analysis result can be obtained after the characteristic of the DGALM module and a Seg Head. The Seg Head consists of a layer of 3 x 3 convolutions, a batch norm layer, a Relu activation function, and a 1 x 1 convolution.

The main structure of the double-graph self-adaptive learning module is as follows, firstly, the feature Y1 and the feature Y2 are spliced, and the spliced features are respectively convolved by two 1 multiplied by 1 to obtain a semantic feature graph Z _semantic And detail profile Z _detail . In the boundary sensing branch, two-class face boundaries are obtainedThen will->Scaling to 1/4 of the original size, and then classifying the face boundary by two classes to obtain Z _semantic and Z_detail The distinction is made between boundary pixels and non-boundary pixels, the specific formula is as follows:

[Z _{detail_edge} ,Z _{detail_noneedge} ]＝Z _detail ⊙[Mask,A-Mask]

[Z _{semantic_edge} ,Z _{semantic_noneedge} ]＝Z _semantic ⌒[Mask,A-Mask]

wherein, represents matrix dot product, Z _{detail_noneedge} Is a detail feature map not containing boundary pixels, Z _{detail_edge} Is a feature map containing details of boundary pixels. Z is Z _{semantic_noneedge} Is a semantic feature map not containing boundary pixels, Z _{semantic_edge} Is a semantic feature map containing boundary pixels. A is a matrix containing only element 1. argmax _dim＝2 Representing the index along the second dimension of the feature, obtaining the maximum value.

Further, the result of the face supervision analysis is obtained in the segmentation branchThen will->Scaling to 1/4 of the original size, and selecting Topk elements from the original size as face components to represent the top of the graph.

wherein ,Z_{graph_semantic} Is a human face semantic component, Z _{graph_detail} Is a face detail constituent component Z _{semantic_noneedge} Is a semantic feature map, Z, not containing boundaries _{detail_edge} Is a detail feature map containing boundaries, and C is the number of channels of the feature.

Further, graph reasoning is carried out through a layer of graph convolution, and remote interaction among pixels of different face components is established by utilizing message transmission of graph nerves, so that the face component is obtained and />

Further, a mapping matrix P is constructed ₁ and P₂ The features are mapped into the original geometric space, and the specific implementation is as follows:

further, the transposed mapping matrix is multiplied by the features after graph reasoning, the features are mapped back to the original geometric space, and the final feature output result is X _out 。

wherein ,representing a semantic feature map mapped back to the original geometric space; />Representing a detail feature map mapped back to the original geometric space; />Representing a feature stitching operation.

And 2.5, constructing a classification branch for face emotion recognition.

The last layer (fifth layer) feature s= [ S ] for the encoder ResNet18 output ₁ ,s ₂ ,...,s _C], wherein Will s _i The image block is regarded as an image block input into a transducer layer, then the image block is sent into the transducer layer, and finally the output characteristics are subjected to a layer of MLP to obtain the result of facial emotion recognition>

The step 3 comprises the following steps:

and 3.1, constructing an intra-task loss function.

Firstly, a loss function of a segmentation branch mainly comprises the loss of supervision face analysis and the loss of output face analysis, and the cross entropy loss function is used, and specifically comprises the following steps:

further, constructing a loss function of the boundary-aware branch, we use a cross entropy loss function, specifically as follows:

further, a loss function of facial emotion recognition is constructed, and a cross entropy loss function is used, specifically as follows:

further, the total intra-task loss function is:

and 3.2, constructing a consistency loss function among tasks.

HoldingThe 0 # channel of (2) is unchanged, and +.> Seg _2-joint-3 Representing a categorized face boundary.

And calculating a task consistency loss function between the classified boundary tasks and the multi-classified boundary tasks by using the dice coefficient.

Further, a consistency loss function between the classification boundary task, the multi-classification boundary task and the face analysis task is calculated. First alongObtain the index of the maximum value in the second dimension direction of (2) to obtainThen using a boundary localization algorithm for +.>The pixel points located in the boundary are assigned 1, and other non-boundary pixel points are assigned 0. Will->And->Multiplication to obtain

Then calculate

And then, calculating a task consistency loss function between the analysis task and the classification boundary task and a consistency loss function of the analysis task and the multi-classification boundary task respectively by using the dice coefficient.

Further, the overall inter-task consistency loss function is.

The step 4 specifically comprises the following steps:

step 4.1: f1 coefficient is introduced to evaluate the effects of face analysis and emotion recognition, and the definition is as follows:

compared with the prior art, the beneficial results of the invention are that:

the invention realizes face analysis and face emotion recognition by establishing the deep learning model of MPNET. And boundary perception branches are added, so that the face analysis result is more refined. Meanwhile, a double-image self-adaptive learning module is added, and the dependency relationship among different face components is established. Meanwhile, the FPS of MPNET on RTX 3090 reaches 92.9, the model parameter is only 11.63M, and the MPNET data processing method has high instantaneity and can be deployed on equipment such as mobile terminals.

Drawings

Fig. 1 is a network configuration diagram of MPENet.

Fig. 2 is an example effect of MPENet versus other models.

Fig. 3 is an example effect of MPENet ablation experiments.

Detailed Description

The invention will be further described with reference to the drawings and the specific examples.

In order to solve the problems encountered in face analysis and face expression recognition, the invention designs a novel multi-task collaborative learning network for face analysis and face emotion recognition. Specifically, MPENet consists of one shared encoder and three downstream branches (classification branch, segmentation branch and edge-aware branch). In the classification branch, we design a transducer module to convert the features extracted by the shared encoder into embedded level features for facial expression recognition. In the edge perception branches, we utilize multi-class face boundaries and two class boundaries to extract face boundary information, helping face analysis tasks to better locate face boundaries. In the segmentation branch, a dual-image self-adaptive learning module is used for fusing the edge information and semantic information of images to infer the relation between different characteristic areas and capturing more context relations, and meanwhile, an additional decoder is designed to serve as the supervision output of face analysis, so that a finer analysis image is obtained. Finally, a consistency learning loss function is designed among tasks, so that the tasks are matched with each other, and the overall accuracy of the model is improved.

Example 1 pretreatment of experimental data.

(1) And (5) normalizing the data.

(2) The picture is cropped to a size of 512 x 512.

(3) And carrying out data enhancement on the cut image, and carrying out random rotation and random scaling.

(4) The data set is divided into a training set, a validation set and a test set.

Example 2 an MPENet network model was constructed.

(1) Extracting semantic information by adopting ResNet18 as backbone network of encoder

(2) And constructing boundary sensing branches. The second layer features of ResNet first go through DPM and then go through FFM along with the third layer features that go through 2-fold upsampling. Further, the fused features are again DPM passed along with the 4-fold upsampled fourth layer features of ResNet18 through FFM. And finally, after DPM and quadruple up-sampling, the fused features are respectively sent into two Detai Head to obtain a face classification boundary map and a face multi-classification boundary map.

(3) For the fifth layer of characteristics of the encoder ResNet18, we designed a five-layer decoder structure, each decoder is composed of a 3×3 convolutional layer and upsampling, through which the input characteristics are restored to the original resolution, and the supervised face parsing result can be obtained.

(4) For the fifth layer feature of the decoder ResNet18, we first perform 8 times of upsampling, then send the feature X of the last DPM in the edge perception branch and the feature Y after 8 times of upsampling together into the DGALM, and then obtain the final face analysis result through the feature of the DGALM module and a Seghead.

(5) The final facial emotion classification result can be obtained by the last layer of characteristics output by the ResNet18 of the encoder through a layer of transformers and then through a layer of MLPs.

Example 3A DA-Net network model was trained.

(1) And adopting an SGD optimization mode as an optimization method.

(2) The res net18 network weights of the MPENet encoder employ weights pre-trained on the ImageNet dataset.

Example 4 experiments were performed on the public face dataset celebamask_hq using a trained MPENet network model and the experimental effect was evaluated.

(1) Table 1 below is a comparison of MPENet against the current mainstream semantic segmentation framework effect on the celebamask_hq dataset. The average F1 coefficient of the model reaches 85.9%, and Mean F1 of the emotion of the human face is 80.04%. See in particular the MPENet effect comparison with other methods in table 1.

Table 1. Comparison of mpenet with results from other models

(2) Table 2 below shows an ablation experiment of MPENet on the celebamask_hq dataset, and it can be seen that each module of MPENet can improve the model accuracy.

Table 2 ablation experiments of mpenet

(3) Table 2 below shows the performance comparison of MPNET with other models, and it can be seen that MPNET is at a leading level in both inference speed and resolution, FPS is 92.9, and model parameters are only 11.6.

TABLE 3 comparison of performance of MPNET with other models

Claims

1. A face analysis and emotion recognition method based on a multitasking cooperative network is characterized by comprising the following steps:

step 1, preprocessing experimental data;

step 2, constructing an MPNET network model;

step 3, training an MPNET network model;

2. The face parsing and emotion recognition method based on the multitasking collaborative network according to claim 1, wherein the step 2 includes the steps of:

step 2.2, constructing an edge perception branch, and adding a detail perception module DPM and a feature fusion module FFM into the edge perception branch;

step 2.3, constructing a segmentation branch for outputting a face analysis result and supervising the face analysis;

and 2.4, constructing a classification branch for face emotion recognition.

3. The face parsing and emotion recognition method based on the multitasking collaborative network according to claim 2, wherein the step 2.2 is specifically implemented as follows:

the second layer of features of the ResNet18 firstly pass through a detail perception module DPM, and the output of the detail perception module DPM and the features of the third layer of features of the ResNet18 which are subjected to 2 times up-sampling are subjected to feature fusion together through a feature fusion module FFM to obtain fusion features I; the fusion feature I passes through the detail perception module DPM again, and the output of the fusion feature I and the feature of the ResNet18 fourth layer which is subjected to 4 times of upsampling are fused together through the feature fusion module FFM to obtain a fusion feature II; after the final fusion feature II is up-sampled again by the detail perception module DPM and 4 times, the final fusion feature II is respectively sent into two Detai heads to obtain a facial bi-classification boundary mapAnd face multi-class boundary map->

4. A face parsing and emotion recognition method based on a multitasking collaborative network according to claim 2 or 3, characterized in that the detail perception module DPM has the following structure:

for the input feature X, firstly, a space attention figure I is obtained through a global maximum pooling layer and two 1X 1 convolution layers; the input feature X is subjected to a global average pooling layer and two 1 multiplied by 1 convolution layers to obtain a spatial attention diagram II; the spatial attention pattern I is added with the spatial attention pattern II, and the final spatial attention pattern is obtained through a softmax function; multiplying the spatial attention pattern by the input feature X to obtain an output feature y;

taking the output characteristic y as a new input characteristic, obtaining a channel attention figure I through a global maximum pooling layer, and obtaining a channel attention figure II through a global average pooling layer by the input characteristic; adding the channel attention pattern II and the channel attention pattern II, then obtaining a final channel attention pattern through a 1X 1 convolution layer and a softmax function, and multiplying the channel attention pattern with the input features to obtain the features finally passing through the detail perception module.

5. A face parsing and emotion recognition method based on a multitasking collaborative network according to claim 2 or 3, characterized in that the feature fusion module has the following structure:

for input feature Z ₁ and Z₂ Firstly, splicing two features, and then obtaining a branch attention pattern through a global average pooling layer, a convolution layer of 1 multiplied by 1 and a softmax function; expanding the branch attention according to the channel dimension, and respectively matching the branch attention with Z according to the set dimension index ₁ and Z₂ And multiplying to obtain the final fused output characteristic Z.

6. The face parsing and emotion recognition method based on the multitasking collaborative network according to claim 2, wherein the step 2.3 is specifically implemented as follows:

For the fifth layer of characteristics of the ResNet18 of the encoder, 8 times of up-sampling is performed to obtain a characteristic Y2, then the characteristic Y1 of the last detail sensing module DPM in the edge sensing branch and the characteristic Y2 after 8 times of up-sampling are sent to a double-image self-adaptive learning module DGALM, and a final face analysis result can be obtained after the characteristics of the DGALM and a Seg Head.

7. The face parsing and emotion recognition method based on the multi-task cooperative network as claimed in claim 6, wherein the structure of the dual-graph adaptive learning module is as follows:

(1) splicing the feature Y1 and the feature Y2, and respectively carrying out convolution on the spliced features by two 1 multiplied by 1 to obtain a semantic feature map Z _semantic And detail profile Z _detail The method comprises the steps of carrying out a first treatment on the surface of the In the boundary sensing branch, two-class face boundaries are obtainedThen willScaling to 1/4 of the original size, and then classifying the face boundary by two classes to obtain Z _semantic and Z_detail The distinction is made between boundary pixels and non-boundary pixels, the specific formula is as follows:

[Z _{detail_edge} ,Z _{detail_noneedge} ]＝Z _detail ⊙[Mask,A-Mask]

[Z _{semantic_edge} ,Z _{semantic_noneedge} ]＝Z _semantic ⊙[Mask,A-Mask]

wherein, represents matrix dot product, Z _{detail_noneedge} Is a detail feature map not containing boundary pixels, Z _{detail_edge} Is a feature map comprising details of boundary pixels; z is Z _{semantic_noneedge} Is a semantic feature map not containing boundary pixels, Z _{semantic_edge} Is a semantic feature map containing boundary pixels; a is a matrix containing only element 1; argmax _dim＝2 An index indicating a maximum value obtained along a second dimension of the feature;

(2) obtaining the result of face supervision analysis in the segmentation branchThen will->Scaling to 1/4 of the original size, and selecting Topk elements from the scaled values to serve as face components to represent the top points of the graph;

wherein ,Z_{graph_semantic} Is a human face semantic component, Z _{graph_detail} Is a face detail constituent component Z _{semantic_noneedge} Is a semantic feature map, Z, not containing boundaries _{detail_edge} A detail feature map including boundaries, C being the number of channels of the feature;

(3) graph reasoning is carried out through one layer of graph convolution to obtain and />

(4) Construction of mapping matrix P ₁ and P₂ The features are mapped into the original geometric space, and the specific implementation is as follows:

(5) multiplying the transposed mapping matrix by the features inferred from the graph, mapping the features back to the original geometric space, and outputting the final feature result X _out ；

8. The face parsing and emotion recognition method based on the multitasking collaborative network according to claim 7, wherein the step 2.5 is specifically implemented as follows:

for the last layer of characteristics s= [ S ] of the encoder res net18 output ₁ ,s ₂ ,...,s _C], wherein Will s _i The image block is regarded as an image block input into a transducer layer, then the image block is sent into the transducer layer, and finally the output characteristics are subjected to a layer of MLP to obtain the result of facial emotion recognition>

9. The face analysis and emotion recognition method based on the multi-task cooperative network as set forth in claim 7, wherein the step 3 includes constructing an intra-task loss function, specifically implemented as follows:

3-1-1. Loss function of segmentation branch mainly comprises loss Seg for supervising face analysis ^True And outputting a loss Seg of face parsing ^Pre The cross entropy loss function is used, specifically as follows:

3-1-2. Constructing a loss function of the boundary sensing branch, and using a cross entropy loss function, wherein the loss function is specifically as follows:

3-1-3, constructing a loss function of facial emotion recognition, and using a cross entropy loss function, wherein the loss function is specifically as follows:

3-1-4. The total intra-task loss function is:

wherein ,λ₀ 、λ ₁ and λ₂ Is a scaling factor.

10. The face parsing and emotion recognition method based on a multitasking collaborative network according to claim 8 or 9, wherein step 3 includes constructing a task-to-task consistency loss function, specifically implemented as follows:

3-2-1. First keepThe 0 # channel of (2) is unchanged, and +.> Seg _2-joint-3 Representing the boundary of the classified faces;

3-2-2, calculating a task consistency loss function between the classified boundary tasks and the multi-classified boundary tasks by using the dice coefficient;

3-2-3, calculating a consistency loss function among the two-classification boundary task, the multi-classification boundary task and the face analysis task;

(1) First alongObtaining the index of the maximum value, obtaining +.>

(2) Then using a boundary locating algorithm forThe pixel points positioned in the boundary are assigned 1, and other non-boundary pixel points are assigned 0; will->And->Multiplication to obtain +.>

(3) Calculating

(4) The method comprises the steps of respectively calculating a task consistency loss function between an analysis task and a classification boundary task and a consistency loss function of the analysis task and a multi-classification boundary task by using a dice coefficient, wherein the consistency loss function comprises the following specific steps:

(5) The overall inter-task consistency loss function is: