LU503919B1

LU503919B1 - Facial expression recognition method based on attention-modulated contextual spatial information

Info

Publication number: LU503919B1
Application number: LU503919A
Authority: LU
Inventors: Datong Xu; Mingyan Cui; Huawei Tao; Yajun Fan; Xue Li; Chunhua Zhu; Weiliang Han; Xinying Guo; Shuguang Zou; Jing Yang; Hongliang Fu
Original assignee: Univ Henan Technology
Priority date: 2022-03-29
Filing date: 2023-02-01
Publication date: 2023-10-06
Also published as: WO2023185243A1; CN114758383A

Abstract

The invention discloses a facial expression recognition method based on attention-modulated contextual spatial information specifically. And the specific steps are as follows: S1: Acquire a public data set of a natural scene facial expression image to be trained, and pre-process the facial expression image; S2: Construct an attention-modulated contextual spatial information network model for facial expression recognition in natural scenes; S3: Train the attention-modulated contextual spatial information network model (ACSI) by using the pre-processed facial expression image; S4: Repeat the model training in step S3 until the set training times are reached, obtain the trained deep residual network model, and use the trained deep residual network model to recognize facial expression. The combination of contextual convolution and coordinated attention can significantly improve the performance of facial expression recognition. Compared with similar algorithms, ACSI has higher recognition performance on public expression datasets.

Description

DESCRIPTION LU503919

FACIAL EXPRESSION RECOGNITION METHOD BASED ON

ATTENTION-MODULATED CONTEXTUAL SPATIAL INFORMATION

TECHNICAL FIELD

The invention relates to the technical field of automatic expression recognition. Specifically, it relates to an expression recognition algorithm and is particularly related to an facial expression recognition method based on attention-modulated contextual spatial information.

BACKGROUND

Facial expression is rich in characteristic information, and facial expression recognition has been widely used in human-computer interaction, mental health assessment, and so on.

Traditional facial expression recognition methods can be divided into two categories. One is based on facial Action Unit (AU), which usually transforms the task of facial expression recognition (FER) into an AU detection task, and AU is a small but recognizable muscle action related to expression. However, it is difficult to detect local changes in the face using this method, and factors such as illumination or changes in posture will also reduce the performance of AU detection. The other is to realize facial expression recognition by artificially designing features to represent face images and training expression classifiers. However, in natural scenes, due to uncontrollable factors, the performance of facial expression recognition methods based on artificial design features is limited. In recent years, deep learning-based facial expression recognition has become a research hotspot. Related work has shifted from controlled laboratory scenes to natural scenes, and some progress has been made. The convolutional neural network (CNN) is the main model of facial expression recognition, and CNN has a strong generalization in the task of facial expression recognition. Since then, various improvement methods have appeared. Among these methods, on the one hand, to solve the problem of incomplete expression features, Zhao et al. designed a symmetric structure to learn the multiscale features in the residual block and keep the facial expression information at the granularity level, Li et al. proposed a slide-patch (SP), which slides the window on each feature map to extract the global features of facial expressions. Fan et al. proposed a hierarchical scale network (HSNet) for facial expression recognition, in which an expansion starting block is added to enhance the kernel scale information extraction. Liang et al. adopted a two-branch network for facial expressid#/503919 recognition, in which one branch used CNN to capture local edge information and the other branch used visual Transformer to obtain a better global representation. Mao et al. proposed to use convolutional kernels of different sizes to form pyramidal convolutional units to extract expression features and improve the nonlinear expression ability of the model. However, the above methods improve the completeness of the extracted expression features by adding an auxiliary network layer or adopting a branching structure. On the other hand, to solve the problem of fuzzy classification boundaries between expression classes, Xie et al. proposed a module called Salient Expressional region descriptor (SERD) to highlight the salient features related to expressions and improve the feature representation ability. Gera et al. proposed a new spatial-channel attention network (SCAN) to obtain the local and global attention of each channel and each spatial position, and to process the expression features in space and channel dimensions, instead of directly performing features dimension reduction and compression. Wang et al. designed an attention branch using an architecture like U-Net to highlight subtle local expression information. After extracting multiscale features, Song Yugin et al. used the CBAM attention mechanism to screen the expression features and improve the expression of effective expression features. The above method extracts more subtle deep facial expression features by adding network auxiliary layer or using branch structure, thus improving the performance of the model. However, these methods ignore the potential contextual relationship between local regions of the human face, and the complex network structure is not conducive to a lightweight model.

Chinese patent document (application number: 202010537198.2) discloses a facial expression recognition method based on deep residual network. Firstly, the multiscale features of the enlarged facial expression image are extracted through the deep residual network model, and then the extracted features are reduced in dimension and compressed, and the processed features are used for expression classification. There are three defects in this method: (1) the standard convolution kernel with fixed receptive field is used in the residual network, which cannot obtain a wide range of facial expression information; (2) the redundant information is removed by the feature dimension reduction and compression scheme, and some important information related to expression is lost; (3) it performs well on laboratory controlled data sets, but its recognition performance on uncontrolled data sets needs to be verified. The above points limit the completeness of expression features extracted by this method, and the representation ability of features needs to be improved.

Chinese patent document (application number: 202110133950.1) discloses a dynamic facial503919 expression recognition method and system based on a representation stream embedding network.

A differentiable representation stream layer is embedded in a convolutional neural network to extract dynamic expression features from a video sequence, and spatial attention weight is used to weigh the output features. This method has two defects: (1) only spatial attention is used and feature optimization is not carried out from the channel dimension; (2) it involves the collection and processing of video data, and the working steps are complicated, resulting in high operating cost.

The existing methods have the following shortcomings: 1) in the feature extraction stage, only the global or local features of facial expressions are considered, which limits the completeness of features; 2) in the feature processing stage, the feature is reduced in dimension and compressed, which leads to fuzzy classification boundaries between classes.

SUMMARY

The invention provides a facial expression recognition method based on attention-modulated contextual spatial information and proposes a new natural scene facial expression recognition model called attention-modulated contextual spatial information (ACSI) model, wherein the contextual convolutions is used to replace the standard convolutions in the residual network, with CoResNet18 and CoResNet50 are constructed to extract multiscale features and obtain more subtle expression information without increasing network complexity.

In CoResNet, coordinate attention is embedded in each residual block to pay attention to salient features, enhance the useful information related to expressions, and suppress redundant information in the input feature map, and effectively reduce the sensitivity of deep convolution to face occlusion and posture changes.

To solve the technical problems, the technical scheme adopted by the invention is that the facial expression recognition method based on attention-modulated contextual spatial information specifically comprises the following steps:

S1: Acquire a public data set of a natural scene facial expression image to be trained, and pre-process the facial expression image;

S2: Construct an attention-modulated contextual spatial information network model for facial expression recognition in natural scenes;

S3: Train the attention-modulated contextual spatial information (ACSI) network model by using the pre-processed facial expression image;

S4: Repeat the model training in step S3 until the set training times are reached, obtain th&}503919 trained deep residual network model, and use the trained deep residual network model to recognize facial expression.

By adopting the technical scheme, a facial expression recognition model is constructed based on attention-modulated contextual spatial information. Firstly, the convolution kernel with a low expansion rate is used to capture local contextual information; Secondly, the convolution kernel with a high expansion rate is used to merge global context information to extract discriminative local features and related global features of faces, and ensure the complementarity of expression feature information; and finally, attention weight is assigned to the extracted features by using a coordinate attention mechanism. The differences in feature between expression classes increases and the ability to represent features is strengthened. Experiments on the AffectNet-7 and RAF DB data sets verify the effectiveness of the ACSI model, and the proposed model has better recognition performance than similar models.

As the preferred technical scheme of the invention, Step S2 comprises the following specific steps:

S21: Replace the middle convolution layer of the residual block with a contextual convolution block to form a contextual convolution residual module and construct a contextual convolution residual network;

S22: Use the coordinated attention (CA) to construct a coordinated attention module, so as to assign attention weights to the multiscale features extracted from the CoResNet to strengthen the feature representation ability.

By adopting the technical scheme, firstly, the contextual convolution is used to replace the standard convolution in the convolution residual block, and the contextual convolution residual network (CoResNet) is constructed as the feature extraction part, and the convolution kernels with different expansion rates is are used to capture local and merge global contextual information; Secondly, the coordinated attention module is embedded in CoResNet as the feature processing part, and the attention weight is assigned to the extracted features, which highlights the salient features and increases the feature differences between expression classes. Finally, the

ACSI model is formed for facial expression recognition.

As the preferred technical scheme of the invention, Step S21 is specifically as follows:

S211: The contextual convolution block receives the input feature map M”, which applies convolution kernels D = {d,,d,,d,,…,d,} with different expansion rates at different levels

L={1,2,3,...,n}, that is, the convolution kernel on level’ has an expansion rate d,,Vie L;

S212: Qt different levels level of contextual convolution, contextual convolution outputsL 4503919 plurality of feature maps M*“, and each map has a width Ww" and a height H* for all iel;

S213: Maintain the residual structure and combine the correlation between layers to obtain the contextual convolution residual module;

S214: Adjust the level of the contextual convolution block in each layer according to the size of the feature map to construct the contextual convolution residual network. The contextual convolution residual network (CoReNet) constructed in step S2 includes CoResNetl8 and

CoResNet50; In CoResNetl8, each contextual residual module consists of a contextual convolution residual module and a 1*1 standard convolution layer; In CoResNet50, each contextual residual module consists of a contextual convolution residual module and two 1*1 standard convolution layers; The contextual convolution residual module is used for multiscale features extraction, and the 1*1 standard convolution layer is used for channel transformation;

According to the size of the input features map, different levels of contextual convolution blocks are used in each contextual residual module. In the first contextual convolution residual module, a contextual convolution block with level 4 (leve/=4) is used; In the first contextual convolution residual module, the level is equal to 3, in the third contextual convolution residual module, the level is equal to 2, and in the last contextual convolution residual module, the /eve/ is equal to 1.

Here, when level=n, there are convolution kernels with expansion rates di=i,i=1,...,n-1,n in the contextual convolution block.

As the preferred technical scheme of the invention, in step S21, the learnable parameters of contextual convolution and the number of floating-point operations is calculated by formulas (1) and (2); params =M" -K" -K"-M* (1),

FLOPs=M"-K"-K” MU W.-H (2); where,M” and M”“ represent the number of input and output feature mapping, K” and K" represent the width and height of convolution kernel, and finally, 7°“ and H™ represent the width and height of output feature mapping.

As the preferred technical scheme of the invention, Step S22 is specifically as follows:

S221: Write the feature extracted by CoResNet as X. First, code each channel along th&}503919 horizontal and vertical coordinate directions using the average pooling kernels of sizes (7.1 and (LW); The coded output y" of the Cth channel with height of # is calculated by formula (3): 1 ; vem=— À, x) 8);

W O<i<W

The coded output : of the C-th channel with the width of W is calculated by formula (4): w 1 ; yw =— X WG;

H 0<j<H

S222: Preform feature-aggregating on the two transformations in step S221 along two spatial directions and return a pair of directional perceptual attention diagrams;

S223: Connect a pair of directional perceptual attention diagrams generated in step S222 and send them into a 1x1 convolution transformation function F ; £=ö(F(y',»" ]): where, [ ’ ] represents the concatenation operation along the spatial dimension, 0 is a . d RE) nonlinear $891 activation function, and / € is the intermediate feature mapping for encoding spatial information in the horizontal and vertical directions; In order to reduce the complexity of the model, an appropriate reduction rate ” is adopted to reduce the number of channels of / ; h CL xH w Chow

S224: Continue to decompose / into two separate tensors J" eR and J ER along the spatial dimension, and use two 1x1 convolution transformations % and "+ to h w transform / and into tensors with the same number of channels, and adopt formulas (6) and (7); ho h m =5(F.(/")) co) m =ö(F, (7 )) (7); where, 5 is sigmoid function, the output 77" and 77" are taken as attention weights, and finally the output Z of the coordinated attention module is shown in formula (8): 20) =x (6 xml (Dx ME D) (8),

where, 2,1) is the output, XJ) is the input, and m (0) and MJ) are the attention 503919 weights.

The technical scheme is adopted to pay attention to the salient features and enhance the feature differences between expression classes, so that the coordinated attention mechanism is adopted, and the coordinated attention module (CA) is embedded in the contextual convolution residual network for feature processing to enhance the expression-related information in the input feature map and suppress redundant information. By embedding coordinated attention in the network, the long-distance dependence between input features can be captured in one spatial direction, and the position information of the expression-related face region in another spatial direction can be maintained, and then the obtained feature map is encoded into a pair of directional perceptual and position-sensitive attention maps, which are applied to the input feature map to enhance the subtle expression information; A CA module is added after each contextual convolution block and CoResNet to screen key scale features and emphasize salient face regions to enhance the feature representation ability, thus improving the recognition performance.

As the preferred technical scheme of the invention, Step S1 is specifically as follows: First, adjust the size of the input image to 256x256, and then crop its up, down, left, and right and the central parts to obtain five 224x224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5.

As the preferred technical scheme of the invention, Step S3 is specifically as follows:

S31: Carry out multiscale feature extraction and contextual spatial information integration on the input facial expression image through the contextual convolution residual network (CoResNet);

S32: Embed an attention module in each contextual convolution residual module to pay attention to the salient scale features, and use coordinated attention to weight the extracted features for CoResNet output features, so as to capture the correlation of expression information in two spatial directions and keep the key area information of the face;

S33: Down sampling the attention-weighted features and classify the down sampled features.

As the preferred technical scheme of the invention, in step S3, the attention-modulated contextual spatial information network model (ACSI) includes a convolution layer, a bn layer, a relu layer, a maxpool layer, four contextual residual modules, a coordinated attention (CA) module, a global average pooling layer, an fc layer, and a softmax classification layer which are sequentially connected; The convolution layer performs a 3*3 standard convolution operation on the input facial expression image to extract features; The bn layer normalizes the extractéd/503919 features in batches to prevent the gradient from disappearing or exploding; Then the relu layer performs nonlinear activation on it; The maximum pool layer is used for feature dimension reduction; Four contextual convolution modules are used to extract multiscale face features from the reduced dimension features; The coordinated attention CA module embedded in the contextual convolution module is used to pay attention to the features with different scales; The

CA module behind the CoResNet output feature layer performs attention weighting on the output features; the global average pooling layer and fc layer, and the facial expression features after down-sampling are classified by softmax classifier.

As the preferred technical scheme of the invention, the input of the softmax classifier is an arbitrary real number vector, and the output is a vector; here the value of each element is between (0,1), and the sum is 1, and an array is provided, and the calculation formula of softmax is formula (9): soft max(x ) = —— (9) ; 2e = where, x, represents the i-th element, soft max(x,) represents the output value of the i-th element in the value of soft max , and i represents the number of elements, that is, the number of classified categories; The output value of multi-classification can be converted into a probability distribution with the range of [0,1] and the sum of 1 through the softmax function.

As the preferred technical scheme of the invention, in Step S3, before training the attention-modulated contextual spatial information network model (ACSI) with facial expression dataset, a large-scale facial expression data set MS-CELEB-1M (including 10 million face images of nearly 100,000 subjects) with more than 10 million data is used as a training set to pretrain ACSI; and then the facial expression data sets AffectNet-7 and RAF DB are respectively input into the pretrained ACSI model, and the output value is obtained through forward propagation, and the loss value of ACSI model is calculated by using a cross entropy loss function according to the output value (predicted category probability); The calculation formula of cross entropy loss function is shown in formula (10): 12 loss = 7m Pos ) (10) ; where, P (x) refers to the real class probability and q(x) is the predicted class probability of the model;

In step S4, according to the loss value of the ACSI model calculated by the formula (10503919 the network weights are updated by backpropagation, and the training is repeated until the set training times are reached to obtain the trained attention-modulated contextual spatial information network model, namely ACSI model.

Compared with that the prior method, the facial expression recognition method based on attention-modulated contextual spatial information has the beneficial effect of: (1) Convolution kernels with different expansion rates are replaced by deconstructed contextual convolution blocks to replace some convolution layers in the residual network, and contextual spatial information of face images is accessed on multiple network layers to extract more robust multiscale expression features; At the same time, parameters and calculation costs like those of standard convolution layers with the same size can be maintained, (2) A new attention mechanism, namely coordinated attention, is used, which can capture the dependence relationship between discriminative local features in one spatial direction, and at the same time keep the accurate location information of key face regions along another spatial direction, thus reducing the sensitivity of deep network to occlusion and posture changes and strengthening the feature representation ability; (3) The effectiveness and reliability of the constructed model for facial expression recognition in uncontrolled environment are verified on two large-scale natural environment facial expression image data sets.

BRIEF DESCRIPTION OF THE FIGURES

Fig. 1 is a flow chart of a facial expression recognition method based on attention-modulated contextual spatial information of the present invention;

Fig. 2 is a block diagram of an attention-modulated contextual spatial information network (ACSI) model in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention;

Fig. 3 is a schematic diagram of a contextual convolution block in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention;

Fig. 4 is a schematic structural diagram of a coordinated attention module in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention;

Fig. 5 shows the t-SNE visualization results of the features extracted by the baseline methdd/503919 and the ACSISO model on the AffectNet-7 dataset; (a) there is a t-SNE visual schematic diagram features extracted by baseline method on AffectNet-7 data set; (b) it is a t-SNE visual schematic diagram extracted by the ACSISO model on the AffectNet-7 data set;

Fig. 6 shows the t-SNE visualization results of features extracted by the baseline method and the ACSISO model on RAF-DB in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention; (a) there is a t-SNE visual schematic diagram features extracted by baseline method on RAF-DB; (b) it is a schematic diagram feature of the t-SNE visualization results extracted by the ACSISO model on

RAF-DB;

Fig. 7 is a schematic diagram of attention visualization results on an example expression image in the RAF DB dataset in the facial expression recognition method based on attention-modulated contextual spatial information of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

In the following, the technical scheme in the embodiment of the invention will be clearly and completely described with the figures attached in the embodiment of the invention.

Embodiment: As shown in Fig. 1, the facial expression recognition method based on attention-modulated contextual spatial information specifically comprises the following steps.

And S1 is specifically as follows: First, adjust the size of the input image to 256x256, and then crop its up, down, left, and right and the central parts to obtain five 224x224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5.

S2: Construct an attention-modulated contextual spatial information network model for facial expression recognition in natural scenes; firstly, the contextual convolution is used to replace the standard convolution in the convolution residual block, and the contextual convolution residual network (CoResNet) is constructed as the feature extraction part, and the convolution kernels with different expansion rates are used to capture local and merge global contextual information; secondly, the coordinated attention module is embedded in CoResNet as the feature processing part, and the attention weight is assigned to the extracted features, which highlights the salient features and increases the feature differences between expression classes.

Finally, the ACSI model is formed for facial expression recognition.

Step S2 comprises the following specific steps: LU503919

S21: Replace the middle convolution layer of the residual block with a contextual convolution block to form a contextual convolution residual module and construct a contextual convolution residual network; In the task of deep facial expression recognition, multiscale features are very important, which can capture richer local details while describing global semantic information. contextual convolution blocks contain convolution kernels with different expansion rates, and multiscale features can be extracted through receptive fields with different sizes. In CNN, standard convolution only use convolution kernel with fixed receptive field, and its convolution kernel size is usually 3x3 , because increasing the convolution kernel size will increase the parameter quantity and calculation time; The learnable parameters (weights) of standard convolution and the number of floating-point operations can be calculated by formulas (1) and (2); Like the standard convolution layer, all convolution kernels in the contextual convolution block are independent and allowed to be executed in parallel. Unlike the standard convolution layer, the contextual convolution with the same size can integrate contextual information while maintaining a similar number of parameters and calculation cost. Therefore, the contextual convolution block can be used as a direct substitute for the standard convolution layer to better complete feature extraction;

As shown in Fig, 2, step S21 1s specifically as follows:

S211: The contextual convolution block receives the input feature map M”, which applies convolution kernels D={d.d,.d,.....d,} with different expansion rates at different levels

L={L 2,30} that is, the convolution kernel on level” has an expansion rate d,,VieL; The expansion rate from level! tO level” Increases in turn, so that more and more contextual information can be extracted. Among them, the convolution kernel with lower expansion rate is responsible for capturing the local details of the face from the input features map, while the convolution kernel with higher expansion rate is responsible for merging the global contextual information, thus helping the whole facial expression recognition process.

S212: Different levels leve] Of contextual convolution, contextual convolution outputs a plurality of feature maps 17“ , and each map has a width w° and a height #7 for all ielLs

S214: Adjust the level of the contextual convolution block in each layer according to the size of the feature map to construct the contextual convolution residual network. The contextual convolution residual network (CoReNet) constructed in step S2 includes CoResNet18 atd/503919

CoResNet50; In CoResNet18, each contextual residual module consists of a contextual convolution residual module and a 1*1 standard convolution layer; In CoResNet50, each contextual residual module consists of a contextual convolution residual module and two 1*1 standard convolution layers; The contextual convolution residual module is used for multiscale features extraction, and the 1*1 standard convolution layer is used for channel transformation;

According to the size of the input features map, different levels of contextual convolution blocks are used in each contextual residual module. The schematic diagram of the contextual convolution block is shown in Fig. 3. In the first contextual convolution residual module, a contextual convolution block with level 4 (level=4) is used; In the first contextual convolution residual module, level is equal to 3, in the third contextual convolution residual module, level is equal to 2, and in the last contextual convolution residual module, level is equal to 1. Here, when level=n, there are convolution kernels with expansion rates di=i,i=1, ...,n-1,n in the contextual convolution block. Different from the previous work of the network cascade, this technical scheme directly integrates the contextual convolution into the widely used residual network and improves the residual blocks in ResNet18 and ResNet50, respectively to obtain the corresponding CoResNet18 and CoResNet50; CoResNet is mainly composed of four network layers, and each layer has different levels of contextual convolution residual blocks. Because the size of the feature map will decrease as the network layer is farther away from the input, this paper adjusts the level of contextual convolution blocks in each layer according to the size of the features map. In the first layer, CoConv4,, adopted, that is, the contextual convolution block of level = 4. In the second layer, it is CoConvs In the third layer, it is CoConv2 . However, since the resolution of the features map in the last input layer has been reduced to 7x 7 it is no longer reasonable to use contextual convolution at this time. So, only one standard convolution is used, which is also marked as CoConvl : The convolution parameters of different levels CoConv are shown in Table 1.

Table 1 Convolution parameters of the contextual convolution residual block LU503919 3x3,16,d, =1 3x3,16,d, =2 level=4 3x3,16,d, =3 3x3,16,d, = 4 56x56 3x3,64.d, =1 level=3 3x 3,32, d, =2 3x3,32,d, =3 28x28 been 5 level=2 3x3,128,d, =2 14x14

As the preferred technical scheme of the invention, in step S21, the learnable parameters of contextual convolution and the number of floating-point operations is calculated by formulas (1) and (2); params =M“ -K" -K"-M* (1);

FLOPs = M” CK . K” ‚MH Jo HoH (2); where, M” and M”“ represent the number of input and output feature mapping, K” and K" represent the width and height of convolution kernel, and finally, 7°“ and H™ represent the width and height of output feature mapping.

S22: Use the coordinated attention (CA) to construct a coordinated attention module (Its structure is shown in Fig. 4) to assign attention weights to the multiscale feature extracted from the CoResNet to strengthen the features representation ability;

And step S22 is specifically as follows:

S221: Write the feature extracted by CoResNet as X. First, code each channel along the horizontal and vertical coordinate directions by using the average pooling kernels of sizes (7.1 h and (LW). The coded output Ye of the C-th channel with height of # is calculated by formula (3): 1 ; yihy=— X x. (hi) 8);

W 0sisW

The coded output Ÿ of the C-th channel with the width of W is calculated by formula (4):

w 1 . LU503919 yw =— X WG;

H 0<j<H

S223: Connect a pair of directional perceptual attention diagrams generated in step S222 and send them into a 1x1 convolution transformation function F ; £=ö(F(y',»" ]): [> | [( [( mu. 6. where, represents the concatenation operation along the spatial dimension, is a . d RE) nonlinear “’8Moid activation function, and / € is the intermediate feature mapping for encoding spatial information in the horizontal and vertical directions; In order to reduce the complexity of the model, an appropriate reduction rate ” is adopted to reduce the number of channels of / ; h CL xH w Chow

S224: Continue to decompose S into two separate tensors JF eR and J ER along the spatial dimension, and use two 1x1 convolution transformations ky, and rs, to h f wo, . transform / and into tensors with the same number of channels, and adopt formulas (6) and (7); h h m =5(F.(/")) co) m°=5(F (1) where, 6 is sigmoid function, the output 77" and 77" are taken as attention weights, and finally the output Z of the coordinated attention module is shown in formula (8): .. Lo. hr: Woo. 2.00.1) =>, yx ml Gy xm! D) g) ., Lo. hf. w,. where, 2,1) is the output, XJ) is the input, and "”e G) and (J) are the attention weights.

To pay attention to the salient features and enhance the feature differences between expression classes, the coordinated attention mechanism is adopted, and the coordinated attention module (CA) is embedded in the contextual convolution residual network for feature processing, so as to enhance the expression-related information in the input feature map and suppress redundant information. As shown in figures, by embedding coordinated attention in the network, the long-distance dependence between input features can be captured in one spatial direction, and the position information of the expression-related face region in another spatial503919 direction can be maintained, and then the obtained feature map is encoded into a pair of directional perceptual and position-sensitive attention maps, which are applied to the input feature map to enhance the subtle expression information; a CA module is added after each contextual convolution block and CoResNet to screen key scale features and emphasize salient face regions to enhance the feature representation ability, thus improving the recognition performance.

S3: Train the attention-modulated contextual spatial information (ACSI) network by using the pre-processed facial expression image; And in step S3, the contextual spatial information network model (ACSI) includes a convolution layer, a bn layer, a relu layer, a maxpool layer, four contextual residual modules, a coordinated attention (CA) module, a global average pooling layer, an fc layer and a softmax classification layer which are sequentially connected; the convolution layer performs 3*3 standard convolution operation on the input facial expression image to extract features; the bn layer normalizes the extracted features in batches to prevent the gradient from disappearing or exploding; then the relu layer performs nonlinear activation on it; the maximum pool layer is used for feature dimension reduction; Four contextual convolution modules are used to extract multiscale face features from the reduced-dimension features; the coordinated attention CA module embedded in the contextual convolution module is used to pay attention to the features with different scales; The CA module behind the CoResNet output feature layer performs attention weighting on the output features; The global average pooling layer and fc layer, and the facial expression features after downsampling are classified by softmax classifier.

Step S3 is specifically as follows:

S31: Carry out multiscale feature extraction and contextual spatial information integration on the input facial expression image through the contextual convolution residual network (CoResNet),

S33: Down sample the attention-weighted features and classify the down sampled features with the softmax classifier.

The input of the softmax classifier is an arbitrary real number vector, and the output isL4&J503919 vector; wherein the value of each element is between (0,1), and the sum is 1, and an array is provided, and the calculation formula of Softmax is formula (9): soft max(x,) = LE 2e i=1 (9) ; where, x, represents the i-th element, soft max(x,) represents the output value of the i-th element in the value of soft max, and 7 represents the number of elements, that is, the number of classified categories, The output value of multi-classification can be converted into a probability distribution with the range of [0,1] and the sum of 1 through the softmax function.

In Step S3, before training the attention-modulated contextual spatial information network model (ACSI) with the facial expression dataset, a large-scale facial expression data set

MS-CELEB-1M (including 10 million face images of nearly 100,000 subjects) with more than million data is used as a training set to pre-train ACSI; And then the facial expression data sets AffectNet-7 and RAF DB are respectively input into the pre-trained ACSI model, and the output value is obtained through forward propagation, and the loss value of the ACSI model is calculated by using a cross entropy loss function according to the output value (predicted category probability); The calculation formula of cross entropy loss function is shown in formula (10): 1 N loss =——_}, p(x,)log(q,(x,))

Moss (10) ;

Where ? (x) refers to the real class probability and q(x) is the predicted class probability of the model;

S4: Repeat the model training in step S3 until the set training times are reached, obtain the trained deep residual network model, and use the trained deep residual network model to recognize facial expression. In step S4, according to the loss value of the ACSI model calculated by formula (10), the network weights are updated by backpropagation, and the training is repeated until the set training times are reached to obtain the trained attention-modulated contextual spatial information network, namely ACSI model.

Specific Application Embodiment: With the above technical scheme, to verify the effectiveness of the ASCP model proposed in this paper, experiments are carried out in two public facial expression databases, AffectNet and RAF-DB, both of which provide face images in natural scenes. Among them, the AffectNet database is one of the largest databases in the research field of facial emotion calculation, with about 440,000 face images, including

AffectNet-7 and AffectNet-8 (adding the "contempt" category); RAF-DB database includes LUS03919 basic facial expressions and 12 compound facial expressions, with a total of about 30,000 face images. As shown in Table 2, the face images of 7 basic facial expressions (happiness, surprise, sadness, anger, disgust, fear, and neutrality) in AffectNet-7 and RAF-DB databases are used as training sets. Because the test set is not available, it is tested on the corresponding verification set to evaluate the performance of the proposed model.

In the image pre-processing stage in step SI, first, adjust the size of the input image to 256 x 256, and then crop its up, down, left, and right and the central parts to obtain five 224 x 224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5. The model is implemented by pytorch, and the model is trained on NVIDIA

GeForce GTX 1650 GPU. During the training process, the SGD algorithm is used to optimize.

The momentum is set to 0.9, and the initial learning rate is 0.01. The learning rate is reduced to 0.1 every 20 iterations, the total number of iterations is 60, and the batch size is 16.

Table 2 Detailed information on the experimental data set, including expression category, number of training sets and number of test sets. oma [EEE TE

Dataset

The experimental results of this facial expression recognition method based on attention-modulated contextual spatial information on AffectNet-7 and RAF-DB verification sets are shown in Table 3, in which CoResNet18 and CoResNet50 (the baseline model in this article) are contextual convolution residual networks. CoResNet18 CA a and CoResNet50_CA a are to embed coordinated attention modules behind the feature output layers of CoResNetl8 and

CoResNet50, respectively. CoResNet18 CA b and CoResNetS0 CA b are to embed the coordination attention module in each contextual convolution residual block of th&/503919 corresponding CoResNet;

Table 3 The recognition accuracy of the ACSI model on AffectNet-7 and RAF-DB verification sets

Ce

Model

As can be seen from Table 3, on the AffectNet-7 verification set, the facial expression recognition accuracy of ACSI18 is increased by 1.70% compared to CoResNet18, and by 1.36% and 1.30% compared to CoResNet18 CA a and CoResNet18 CA b, respectively. The accuracy of facial expression recognition of ACSISO is increased by 2.03% compared with CoResNet50, and by 0.80% and 0.25% compared with CoResNet50 CA a and CoResNet50 CA b, respectively. On the RAF DB verification set, the accuracy of facial expression recognition of

ACSI18 is increased by 1.89% compared with CoResNet18, and by 1.23% and 1.14% compared with CoResNet18 CA a and CoResNet18 CA b respectively. The accuracy of facial expression recognition of ACSISO is increased by 1.79% compared with CoResNet50, and by 0.35% and 0.06% compared with CoResNet50 CA a and CoResNet50 CA b, respectively. The above experimental results show the effectiveness and generalization of the proposed algorithm.

To further illustrate the effectiveness of the contextual spatial information (ACSI) network model constructed in the facial expression recognition method based on attention-modulated contextual spatial information, the performance of the constructed contextual spatial information (ACSI) network model is compared with other similar models in recent years on data sets

AffectNet-7 and RAF-DB, as shown in Table 4 and Table 5. As can be seen from Table 4, the

ACSISO increases by 1.61% compared to FMPN, by 0.97% compared to OADN on AffectNet-7, by 0.75% compared with Ensemble CNN, and by 0.52% compared to DDA-Loss method. As can be seen in Table 5, the ACSIS0 proposed in this paper is increased by 2.5% compared witlkW503919

FSN, by 0.91% compared with CNN, by 0.76% compared with DLP-CNN and by 0.33% compared with pACNN in RAF DB.

The results show that the recognition accuracy of the model proposed in this paper has been improved on AffectNet-7 and RAF-DB, and it is competitive compared to similar models.

Because these models cannot solve the problem of limited feature completeness or fuzzy classification boundaries between classes, the recognition performance is low. The model proposed in this paper can extract multiscale facial expression features by using contextual convolution. Embedding the coordinated attention module in the network can make the network pay attention to more discriminating expression features, and the correlation between layers can be better combined through the residual structure, which finally improves the recognition performance.

Table 4 Comparison of performance of models on AffectNet-7

Table 5 Comparison of performance of models on RAF-DB

To prove the interclass differences of expression features extracted by the ACSI model, in this section the t-SNE visualization is performed on the features extracted by the ACSI model on

AffectNet-7 and RAF-DB verification sets, and the results are shown in Figs. 5 and 6. Figs. 2-6 all show 7 basic facial expression classifications, including anger, disgust, fear, happiness/503919 sadness, surprise, and neutral. As can be seen in the figures, compared to the baseline model, the features extracted by the ACSISO model have the distribution characteristics of relative dispersion between classes and relative aggregation within classes.

To further study the function of the attention module in the model, class activation map (CAM) method is used to visualize the attention diagram generated by the attention in this paper.

The class activation map method is used to visualize the activated parts of different expressions, and map the weights of the output layer to the convolution features map to identify the importance of different regions of the face image. Specifically, the facial activation region is visualized for the proposed network ACSI through CAM, and the attention map is obtained. To display the attention region on the original image, the attention map is generally adjusted to the same size as the input image, and the attention map is visualized to the original image through

COLORMAP JET color mapping. When the technical scheme is used, the specific steps are as follows: firstly, the visual attention map is adjusted to the same size as the input image, and the attention map 1s visualized to the original image through color mapping; Fig. 7 shows the attention map of different expression images in RAF DB. There are 7 columns in the diagram, and each column shows one of the seven expressions. From left to right, there are anger, disgust, fear, happiness, sadness, surprise, and neutral respectively. It can be clearly seen from Fig. 7 that the attention module used in this paper makes the network focus on the more discriminating face region in the presence of occlusion and posture changes; The results show that the combination of contextual convolution and coordinated attention can significantly improve the performance of facial expression recognition. Compared with similar algorithms, ACSI has higher recognition performance on public expression datasets.

The above is only the preferred embodiment of the invention, and it is not used to limit the invention. Any modification, equivalent substitution, improvement, etc. made within the spirit and principle of the invention should be included in the protection scope of the invention.

Claims

CLAIMS LU503919

1. A facial expression recognition method based on attention-modulated contextual spatial information specifically, where, it comprises the following specific steps: S1: acquire a public data set of a natural scene facial expression image to be trained, and pre-process the facial expression image; S2: construct an attention-modulated contextual spatial information network model ACSI for facial expression recognition in natural scenes; S3: train the contextual spatial information network model ACSI by using the preprocessed facial expression image; S4: repeat the model training in step S3 until set training times are reached, obtain the trained deep residual network model, and use the trained deep residual network model to recognize facial expression.

2. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 1, where, step S2 comprises the following specific steps: S21: replace the middle convolution layer of the residual block with a contextual convolution block to form a contextual convolution residual module and construct a contextual convolution residual network; S22: use the coordinate attention to construct a coordinate attention CA module, to assign attention weights to the multiscale features extracted from the context convolution residual network CoResNet constructed in step S21 to strengthen the features representation ability.

3 The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, where, step S21 1s specifically as follows: S211: the contextual convolution block receives the input feature map M”, which applies convolution kernels D={d.d,.d,.....d,} with different expansion rates at different levels L={123...n} , that is, the convolution kernel on level has an expansion rate d,,VieL; S212: at different levels Jeve] of contextual convolution, contextual convolution outputs a plurality of feature maps 17“ , and each map has a width w° and a height #7 for all ielLs S213: maintain the residual structure and combine the correlation between layers to obtain the contextual convolution residual module; S214: adjust the level of the contextual convolution block in each layer according to the size of the feature map to construct the contextual convolution residual network.

4. The facial expression recognition method based on attention-modulated contextual spatill/503919 information specifically, as claimed in claim 3, wherein, in step S21, the learnable parameters of contextual convolution and the number of floating-point operations are calculated by formulas (1) and (2); params =M" -K" -K"-M* (1); FLOPs = M” CK . K” ‚MH Jo HoH (2); where, M” and M”“ represent the number of input and output feature mapping, K” and K" represent the width and height of the convolution kernel, and finally, 7°“ and H”“ represent the width and height of the output feature mapping.

5. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, where, step S22 1s specifically as follows: S221: write the feature extracted by CoResNet as X; first, code each channel along the horizontal and vertical coordinate directions by using the average pooling kernels of sizes (7.1 h and (GW). the coded output Ye of the C-th channel with height of h is calculated by formula (3): 1 (3); Vi =— X x. (hi) W O<i<W the formula (3) calculates the coded output of the Cth channel when the height in the horizontal coordinate direction is #2, and sums the input features along the width 7; the coded output X : ofthe C-th channel with the width of W is calculated by formula (4): w 1 . 4); yiw=— 3 xm H 0<j<H the formula (4) calculates the coded output of the Cth channel when the height in the vertical coordinate direction is W, and 0<JSH gums the input features along the heights j; S222: preform feature-aggregating on the two transformations in step S221 along two spatial directions, and return a pair of directional perceptual attention diagrams; S223: connect a pair of directional perceptual attention diagrams generated in step S222 and send them into a 1x1 convolution transformation function F ; f= S(F( yy )) L | (5);

[> | [( ;( qu 6, 508919 where represents the concatenation operation along the spatial dimension, is a sigmoid | ferro nonlinear activation function, and is the intermediate feature mapping for encoding spatial information in the horizontal and vertical directions; S224: continue to decompose J into two separate tensors / "eR and J € rR along the spatial dimension, and use two 1x1 convolution transformations ky, and rs, to transform J ’ and I" into tensors with the same number of channels, and adopt formulas (6) and (7); m =5(F,(/")) (6); w_ w m= 5(, (7 a where 5 is sigmoid function, the output 77" and 77" are taken as attention weights, and finally the output 7 of the coordinated attention module is shown in formula (8):

z(t.) =x, MX MM D) a, where z.(1,]) is the output, x (1,7) is the input, and m} (Ci) and m} 0) are the attention weights.

6. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, wherein, Step S1 is specifically as follows: first, adjust the size of the input image to 256 x 256, and then crop its up, down, left and right and the central parts to obtain five 224 x 224 face images with the same expression tag, and then horizontally flip them with a probability of 0.5.

7. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 2, where, step S3 is specifically as follows: S31: carry out multiscale feature extraction and contextual spatial information integration on the input facial expression image through the contextual convolution residual network CoResNet: S32: embed an attention module in each contextual convolution residual module to pay attention to the salient scale features, and use coordinate attention to weight the extracted features for CoResNet output features to capture the correlation of expression information in two spatial directions and keep the key area information of the face; S33: down sample the attention-weighted features and classify the down sampled features.

8. The facial expression recognition method based on attention-modulated contextual spatiaV503919 information specifically, as claimed in claim 7, where, in step S3, the attention-modulated contextual spatial information network model (ACSI) includes a convolution layer, a bn layer, a relu layer, a maxpool layer, four contextual residual modules, a coordinate attention CA module, a global average pooling layer, an fc layer, and a softmax classification layer which are sequentially connected; the convolution layer performs 3*3 standard convolution operation on the input facial expression image to extract features; the bn layer normalizes the extracted features in batches to prevent the gradient from disappearing or exploding; then the relu layer performs nonlinear activation on it; the maximum pool layer is used for feature dimension reduction; four contextual convolution modules are used to extract multiscale face features from the reduced-dimension features; the coordinate attention CA module embedded in the contextual convolution module is used to pay attention to the features with different scales; the CA module behind the CoResNet output feature layer performs attention weighting on the output features; the global average pooling layer and fc layer, and the facial expression features after down sampling are classified by softmax classifier.

9. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 8, wherein, the input of the Softmax classifier is an arbitrary real number vector and the output is a vector; wherein the value of each element is between (0,1), and the sum is 1, and an array is provided, and the calculation formula of softmax is formula (9): soft max(x,) = LE 2e i=1 (9) ; where, x, represents the I -th element, soft max(x,) represents the output value of the I -th element in the value of soft max , and I represents the number of elements, that is, the number of classified categories; the output value of multi-classification can be converted into a probability distribution with the range of [0,1] and the sum of 1 through the softmax function.

10. The facial expression recognition method based on attention-modulated contextual spatial information specifically, as claimed in claim 8, where, in step S3, before training the attention-modulated contextual spatial information network with facial expression datasets, a large-scale facial expression data set MS-CELEB-1M with more than 10 million data is used as a training set to pre-train ACSI; and then the facial expression data sets AffectNet-7 and RAF_DB are respectively input into the pre-trained ACSI model, and the output value is obtained through forward propagation, and the loss value of the ACSI model is calculated by using a cross-entropy loss function according to the output value; the calculation formula of cross entropy loss functid/503919 is shown in formula (10): 1 N loss =—— 3 p(x,)log(q,(x,)) im (10) ; where, ? (x) refers to the real class probability and q(x) is the predicted class probability of the model; in step S4, according to the loss value of the ACSI model calculated by formula (10), the network weights are updated by back propagation, and the training is repeated until the set training times are reached, to obtain the trained attention-modulated contextual spatial information network, namely ACSI model.