LU102496B1

LU102496B1 - Facial expression recognition method based on attention mechanism

Info

Publication number: LU102496B1
Application number: LU102496A
Authority: LU
Inventors: Daihong Jiang; Yuanzheng Hu; Zhongdong Huang
Original assignee: Xuzhou Inst Technology
Priority date: 2020-11-03
Filing date: 2021-02-08
Publication date: 2021-08-09
Also published as: CN112257647A

Abstract

A facial expression recognition method based on an attention mechanism of the invention is applicable to the image recognition field. First, a facial expression recognition model is built, and converged facial expression predicting results are obtained through an end-to-end manner; adding a self-attention mechanism and a channel attention mechanism on the basis of a residual network to improve sensitivity to useful information in an input image and suppress useless information; self-attention is used to calculate a weighted average value of all position pixels in the input facial expression feature map to calculate relative importance of key positions in the facial expression feature map, the self-attention mechanism and the channel attention mechanism are merged to encourage an ability of a facial expression recognition model to extract the key positions in the facial expression feature map as global important features, and finally the optimal recognition result is output.

Description

FACIAL EXPRESSION RECOGNITION METHOD BASED ON ATTENTION MECHANISM

TECHNICAL FIELD The present invention relates to a facial expression recognition method based on an attention mechanism, and is particularly applicable to a facial expression recognition method based on an attention mechanism that is used in rapid facial expression recognition.

BACKGROUND In human daily communication, a facial expression represents a current emotional state of people, can often express more accurate information than language. and plays an indispensable role in human emotional communication. In the 1970s, psychologists Ekman and Friesen defined six basic emotions, which are happy, angry, surprise, fear, disgust and sad. respectively. Later, contempt was added. These seven emotions became the basis for studying expression recognition by people.

As a research direction in the field of computer visualization, facial expression recognition is closely related to face detection and recognition. It is gradually being applied to daily life in an array of applications, such as detecting driver fatigue, criminal investigation, entertainment and other fields. At present, the research of facial expression recognition is mainly divided into two directions: artificial feature extraction based on traditional methods and deep learning. Andrew et al. combined PCA and LDA to classify the expressions. Feng et al. proposed an expression recognition method based on a combination of LBP features and SVM classification, and made corresponding improvements to the model based on low-resolution situations that may occur in practical applications; Metaxas et al. proposed a multi-task sparse learning method based on [BP features, and this method converts an expression recognition problem into a multi-task sparse learning problem, and acquires better results on multiple data sets.

Since 2013, deep learning has been gradually applied to facial expression recognition. Matsugu et al. used Convolutional Neural Networks (CNN) to solve problems of translation,

rotation and scale invariance of expression pictures.

Bo SUN et al. has learned facial expression features through regional CNN.

Yao et al. proposed a network model HoloNet specifically for expression recognition, used CRel.U instead of ReLU in the model, and combined a residual module and CReLU to build an intermediate layer, which produced good results.

Zhao et al. designed a feature extraction network by adding a feature selection mechanism in AlexNet.

Cai et al. proposed a new loss function that maximizes a betwcen-class distance while optimizing the between-class distance of the expression. so that a network can learn more discriminative features.

Jun HE et al. used an improved deep residual network to deepen a depth of the network, and at the same time, introduced transfer learning to solve a problem that a current expression recognition data set is too small, and achieved an accuracy rate of 91.33% on a CK+ data set. The document uses a paired random forest method to process face pose variation. in order to solve the problem of face pose variation in facial expression recognition.

The above expression recognition methods based on deep learning show that using convolutional neural networks can acquire better recognition results. A convolution operation on which they depend, however, is a local operation in space, and dependency between long- range features can be captured only by repeatedly stacking convolutional layers, but this is too incfficient. It is therefore highly desirable to design a reasonable model structure to prevent gradient disappearance due to a large number of network layers.

SUMMARY OF INVENTION Inventive Purpose: with respect to the shortcomings of the above techniques, a facial expression recognition method based on an attention mechanism having a simple structure, high recognition efficiency and high recognition accuracy is provided.

In order to achieve the above technical purpose, the facial expression recognition method based on an attention mechanism of the present invention, first builds a facial expression recognition model, of which a structure according to an image input sequence is: a convolution module, a maximum pooling module, an attention residual module, a maximum pooling module, an attention residual module, a maximum pooling module and two fully connected layers, and a softmax function, wherein a converged facial expression prediction result is obtained through an end-to-end manner: the attention residual module 15 a module that a self-attention module is introduced on a basis of a residual network, and calculates relative importance of key positions in a facial expression feature map by calculating a weighted average value of all position pixels in the input facial expression feature map, the key positions are important positions for recognizing an expression, and specifically are positions that are important for recognizing the expression in the feature map, including a mouth and eyes; then introduces the channel attention to learn different features on a channel domain to thereby generate channel attention so as to learn interaction features in different channels, so that a channel of the feature map can detect a target to locate the channel to the key positions of the feature map. thereby improving robustness; and finally merges a self-attention mechanism and a channel attention mechanism to encourage an ability of extracting the key position in the facial expression feature map as a global important feature of a facial expression recognition model, and uses the repeated maximum pooling module and attention residual module to reduce error and output an optimal recognition result through an end-to-end learning manner, In a building process of the facial expression recognition model, the self-attention mechanism is introduced on a basis of the used residual network y = F(x, {W,}) + x, wherein x and y respectively denote input and output information of the residual network, and F(x, {W;}) denotes residual mapping.

The self-attention module uses non-local operations to concern all signals related to a current expression, and will obtain a relevance weight to represent relevance of other positions and a current position to be calculated, which is defined as follows:

— Fennec) , = — Xi, X; X; Yı C(x) | Lr g J vj wherein i denotes any position in an output feature map, j is an index of all possible positions in the feature map, x is an input feature map, y is the output feature map, pixel values of the output feature map and the input feature map have changed, and a dimension is the same as that of the input feature map, f is a function for calculating relevance between any two points, g is a one-variable function for information transformation, and C(x) is a normalization function; since f and g both are general formulas, a specific form needs to be considered in conjunction with the neural network, first, g is a one-variable output, which is replaced by 1 X] convolution, and the form is as follows: g(x) = Wjx;, and any two points represented by f are substituted into two positions in an embedding space, and a calculation formula is: AT bx . a fx) = e90D PR wherein O(x;) = Wex;, b(x;) = WeX;, a normalization parameter 1 C(x) = Xv; f (xi, X;), as for a given position i, oof Or Xj) becomes softmax for calculating all positions j, and the obtained output of a self-attention layer is: y = softmax(x"Wg Wx) g(x)

Supposing that an input feature map of a self-attention network is FPXWX¢ it is transformed into two embedding spaces after being subjected to two convolution weights Wy and W, to obtain FHXWXC" and FHXWXC' generally C' < C, a purpose here is to reduce the number of channels and reduce a calculation amount; second, a reshape operation is performed

. ' . . on this output feature map so that it becomes F HWXC subsequently a transpose operation is performed on a matrix obtained through Wg transformation and then a matrix multiplication is performed to calculate similarities and obtain a similarity matrix F#WXH#, then a softmax operation is performed on the last dimensionality, thereby corresponding to obtaining normalization relevancy of each pixel with other position pixels in the current feature map; finally, g is processed with dimensionality reduction and then the reshape operation, then is multiplied with a matrix F#WXHW the attention mechanism is applied to all channels of the feature map. and finally the channels are restored after a convolution of 1x1. to ensure that input and output dimensions are exactly the same; from a mathematical point of view, assuming that a feature map of a previous layer of the self-attention network is x € RE*N it is first mapped into two feature spaces f and g, wherein f= W,x, g = Wx Bi = eb) oo xls) DE RE, exp(sy) "7 ‘ 7 in the formula, B;; denotes contribution of an i" position to a j" region when the j" region of the feature map is synthesized. C here denotes the number of channels of the feature map of the previous layer, and N denotes the number of pixels of the feature map of the previous layer, Thus. the output of the self-attention layer is o = (o,, Oz, 40, Oj, wor) on). wherein:

N 0; =v OC. Big) i=i - + ¢ .

in the formula, Wy € REC, Wp ERCXC, W, ERCX and W, € REC" are weights of a convolution kernel, C’ is a hyper parameter, and C < C; then a residual connection is introduced, and a final output of the self-attention module is: yi = VO; + x;. wherein Ÿ is a learnable hyper parameter and is initialized to 0, and a weight is gradually increased in a training process.

The channel attention module is used to act on a feature detector to introduce the channel attention to [cam a weight distribution between the channels to strengthen channels that are useful for an expression recognition task while weakening channels that are irrelevant to the task. As for each channel of the intermediate feature map that has been changed. the feature map is compressed into two different spaces through global average pooling and global maximum pooling operations based on height and width, respectively to obtain two feature maps. Then, the obtained two feature maps are input to two networks using a same set of parameters, i.c.. into a fully connected neural network having shared parameters, output vectors of a fully connected layer are summed according to corresponding elements, features of the two spaces are merged, and finally, a final channel weight is obtained through a signoid activation function. Specifically, as follows: Assuming that the input feature map is F#*W*¢ wherein H,W,C are respectively a height, a width and the number of channels of the feature map, a maximum pooling feature map Enax € R**1X€ and a global average pooling feature map Fiyg € RPC are respectively obtained through the pooling, then the two feature maps are sent to the fully connected neural network, which only contains one hidden layer, and a calculation process is as follows: M, = sigmoid (MLP(AvgPool(F)) + MLP(MaxPool(F))) Further: M, = sigmoid (wy (Wo(Fava)) + Wy (Wo (Fa) . . . Exe Cc £ wherein Wo, W,are shared weights of the fully connected layer, and W, ERT , W, ER = The self-attention mechanism and the channel attention mechanism are added on the basis of a residual module to constitute an attention residual module for enhancing the ability of a facial expression recognition model network to extract features. capturing dependency between long-range features. improving sensitivity of the model to useful information, and suppressing useless information. An adding manner is divided into a serial manner and a parallel manner, wherein the serial is divided into performing self-attention followed by channel attention, and performing channel attention followed by self-attention.

The adding manner of self-attention followed by channel attention in the serial manner is specifically: a mode of performing self-attention followed by channel attention of the serial manner: a previous layer is subjected to convolution to obtain a feature map Fy, as an input. first a channel attention map Frmia is obtained through an action of channel attention M,, then merged with the input feature map as an input of self-attention M,, finally the feature map obtained through the M, action is merged with Fnia to obtain a final output of the attention module, and the mathematical description is as follows: Frida = Ma (Fin) Fin Fout = M. (Frida) ®Fmid- The adding manner of channel attention followed by self-attention in the serial manner is specifically: a mode of channel attention followed by self-attention in the serial manner: the previous layer is subjected to convolution to obtain a feature map as an input, respective feature maps are obtained through actions of self-attention M, and channel attention M,, then merged with the input feature map Fi, to respectively obtain a self-attention map Frag and a channel attention map F,;,, finally a summing operation of the corresponding elements is made on the two obtained attention maps to obtain a final attention map output, and the mathematical description is as follows: Fria = M (Fin)®Fin Four = Ma(Fmia)®Fmia wherein @denotes that the corresponding elements are multiplied.

Steps of the parallel adding manner are: a previous layer is subjected to convolution to obtain a feature map as an input, respective feature maps are obtained through actions of self- attention M, and channel attention M,, a multiplying operation of the corresponding elements is made on the obtained feature maps and the input feature map to respectively obtain a self- attention map F%;, and a channel attention map Fg, finally the corresponding elements of the two obtained attention maps are added to obtain a final output Fouz; and the mathematical description is as follows:

Fria = M. (Fin)@Fin Fria = Ma(Fin)®Fin Fout = Fria ® Fria in the formula, @denotes that the corresponding elements are added, and ® denotes that the corresponding elements are multiplied.

channel attention and self-attention are provided in a residual module to form an attention residual module, which is specifically divided into three structures: which arc separately using a self-attention mechanism, separately using a channel attention mechanism, and simultaneously using the self-attention and channel attention mechanisms, respectively. The attention residual module is a module that the attention mechanism ts added on the basis of the initial residual module.

Advantageous Effects: The facial expression recognition model with the attention mechanism provided by the present invention introduces a self-attention mechanism on the basis of a residual network. overcomes limitations of local operations of the convolution operation. and improves an ability of the model to capture long-range correlation features. Considering the relevance between the channels of the feature map, the channel attention is introduced to learn weight distribution between the channels. The facial expression recognition model based on attention used by the present invention has a rapid recognition speed and high recognition accuracy. A network training manner used by the present invention is an end-to-end training manner. It only needs one facial expression image to be input, then an expression classification can be directly output without performing a large amount of repeated trainings in advance.

BRIEF DESCRIPTION OF DRAWINGS FIG. | is a block diagram of a facial expression recognition method based on an attention mechanism of the present invention;

FIG. 2 is a block diagram of a residual module of the present invention; FIG. 3 is a block diagram of a self-attention module of the present invention; FIG. 4 is a block diagram of a channel attention module of the present invention; FIG. 5 is a block diagram of a mode of self-attention followed by channel attention; FIG. 6 is a block diagram of a mode of channel attention followed by self-attention; FIG. 7 is a block diagram of a parallel mode of channel attention and self-attention of the present invention; FIG. 8(a)is a block diagram of separately using a self-attention mechanism: FIG. 8(b) is a block diagram of separately using a channel attention mechanism: FIG. 8(c) is a block diagram of simultaneously using self-attention and channel attention mechanisms: FIG. 9 is a curve graph using FER2013 training: FIG. 10 is a curve graph using CK+ training: FIG. 11 is a confusion matrix using an FER2013 data sct; and FIG. 12 is a confusion matrix using a CK+ data set.

DETAILED DESCRIPTION OF THE EXEMPLARY EMBODIMENTS Below the exemplary embodiments of the present invention are further described with reference to the accompanying drawings: A facial expression recognition method based on an attention mechanism of the present invention first builds a facial expression recognition model, of which a structure according to an image input sequence is: a convolution module, a maximum pooling module, an attention residual module, a maximum pooling module, an attention residual module. a maximum pooling module and two fully connected layers FC] and FC2, and a softmax function, wherein a converged facial expression prediction result is obtained through an end-to-end manner. The attention residual module is a module that a self-attention mechanism and a channel attention mechanism are added on a basis of a residual network to improve sensitivity to useful information in an input image and suppress useless information. An adding manner is divided into a serial manner and a parallel manner, wherein the serial manner is divided into performing self-attention followed by channel attention, and performing channel attention followed by self- attention, and the parallel manner is performing self-attention and channel attention in parallel. The self-attention is used to calculate a weighted average value of all position pixels in an input facial expression feature map to calculate relative importance of key positions in the facial expression feature map. and the key positions are important positions for recognizing an expression and specifically are positions that are important for recognizing the expression in the feature map, including a mouth and eyes, then introduces the channel attention to learn different features on a channel domain to thereby generate channel attention so as to learn interaction features in different channels, so that a channel of the feature map can detect a target to locate the channel to the key positions of the feature map, thereby improving robustness; and finally merges the self-attention mechanism and the channel attention mechanism to encourage an ability of a facial expression recognition model to extract the key positions in the facial expression feature map as global important features, and uses the repeated maximum pooling module and attention residual module to reduce error and output an optimal recognition result through an end-to-end learning manner.

FIG. 1 is an overall framework of an attention mechanism model. À previous portion uses downsampling to perform feature extraction to obtain an expression feature map. Then, the feature map is input to the attention residual module for feature conversion to improve a model performance. Finally, expression classification 1s implemented by a fully connected layer. The attention residual module includes a self-attention module and a channel attention module.

Residual Network: in deep learning, the model performance is often improved by increasing a model scale, but with the deepening of the number of network layers, a problem of gradient disappearance will occur, which will bring difficulties to model training. In order to resolve this problem, the residual network uses a short-circuit connection manner to allow the network lo directly transfer the previous information to a module output layer.

As shown in FIG. 2, the residual module establishes a connection between an input and an output through an identity mapping manner so that a convolution layer can learn a residual between the input and the output, wherein F(x, {W;}) is used to denote residual mapping, then the output of the residual module is: y = F(x, {W;}) + x, wherein x and y respectively denote the input and output information of the module.

As shown in FIG. 3, the self-attention module: in a convolution neural network. a size of a convolution kernel is generally less than seven due to limitations of calculation resources. Thus, each convolution operation can only cover a very small neighborhood around a pixel point, and it is difficult to obtain features having a relatively long distance, such as correlation features between two eyes. In order to capture dependency between long-range pixels, it needs to repeatedly stack convolution operations and obtain it through back propagation. but this easily causes problems of the gradient disappearance and slow convergence. Since the network is deep. it needs to design a reasonable network structure without affecting the propagation of the gradient and the like. Different from convolutional local calculation. a core idea of a non-local operation, when calculating the output of each position of the feature map. no longer performs calculation only with pixels of a local neighborhood, but concerns all signals related to a current expression, and will obtain a relevance weight to represent relevance of other positions and a current position to be calculated, which is defined as follows: 1 yi = SZ F(xu x; )g(xj) y} wherein i denotes a position in an output feature map, / is an index of all possible positions in the feature map, x is an input feature map, y is the output feature map, pixel values of the output feature map and the input feature map have changed, and a size of the output feature map is the same as that of the input feature map, f is a function for calculating relevance between any two points, g is a one-variable function for information transformation, and

C(x) is a normalization function. Since f and g both are general formulas, a specific form needs to be considered in conjunction with the neural network, first, g is a one-variable output, which is replaced by using 1X1 convolution, and the form is as follows: g(x) = Wax; As for a function f for calculating the relevancy between two positions, the present text calculates similarities in an embedding space, and a mathematical expression formula is as - AT ; . ge follows:f(xi x;) = ePC0 6€) wherein 6(x;) = Wax; ¢(x;) = Wgx; and a normalization arameter C(x) = X fx x;) As for a given position i — f(x; x) b :s soft P vj ir Xj) g p * C0) ir Xj ecomes softmax for calculating all positions j, and the obtained output of a self-attention layer is: y = Tw T ; ; HXWXC ; ; softmax(x"Wg Wx) g(x), supposing that an input F of a network is subjected to two convolution weights Wp and Wj to transform it into two embedding spaces to obtain FHXWXC' and FHXWXC' generally € < C, the purpose here is to reduce the number of channels and reduce a calculation amount; second, a reshape operation is performed on this . . + . - output feature map to transform it into F#W*C | subsequently a transpose operation is performed on a matrix thercof and then a matrix multiplication is performed to calculate similarities and obtain a similarity matrix FWX#W and then a softmax operation is performed on the last dimension, thereby corresponding to obtaining normalization relevancy of each pixel and other position pixels in the current feature map. Finally, the same operations are also performed for g, it is processed with dimensionality reduction first and then the reshape operation, and is multiplied with the matrix F#W*HW the attention mechanism is applied to all channels of the feature map, and finally the channels are restored atier a convolution of 1X1, to ensure that input and output dimensions are exactly the same.

From a mathematical point of view, assuming that the feature map of a previous layer is x € REN it is first mapped into two feature spaces f and g, wherein f = W,x, g = Wx exp(s;;) T Bii = ax ‚Sy = 0(x])¢(s)) Zi exp(sy) wherein f;; denotes conntribution of an {* position to a /"" region when the /" region of the feature map is synthesized, C here denotes the number of channels of the feature map of the previous layer, and N denotes the number of pixels of the feature map of the previous layer. Thus, the output of the self-attention layer is 0 = (01,02, ..., 0, ..., On), Wherein

N 0; =v > Big (xo) i=1 wherein Wy € REC, W, ER, W, € RCXC and W, € REC" are weights of a convolution kernel, and wherein C’ is a hyper parameter, and C° < C. In addition, in order to perform the gradient back propagation better, a residual connection is introduced, so that the final output of the attention module is: Yi = VO; + X; wherein y is a learnable hyper parameter and is initialized to 0, and a weight is gradually increased in a training process. As shown in FIG. 4, the channel attention module: each channel of the feature map plays a role of a feature detector, thus, the channel of the feature map concerns what feature is a feature useful for the task. In a general convolutional neural network, however, a degree of importance between the channels is not distinguished, that is, each channel is treated equally. This ignores that the contribution of each channel to the task is different. In view of this, the present invention introduces the channel attention to learn a weight distribution between the channels to strengthen channels that are useful for an expression recognition task while weakening channels that are irrelevant to the task. In order to calculate channel attention more efficiently, as for each channel of an intermediate feature map, the feature map is compressed into two different spaces through global average pooling and global maximum pooling operations based on height and width,

then the obtained two feature maps are input to a fully connected network having shared parameters, output vectors of the fully connected layer are summed according to the corresponding elements, the features of the two spaces are merged, and finally a final channel weight is obtained through a signoid activation function, the detailed structure is shown in the following drawings.

Assuming that the input feature map is F7XXC, wherein H,W,C are respectively a height, a width and the number of channels of the feature map, a maximum pooling feature map Enax € R**1*€ d'a global average pooling feature map Fy, € RC are respectively obtained through the pooling, then the two feature maps are sent to the fully connected network which only contains one hidden layer, and a calculation process is as follows: M, = sigmoid (MLP(AvgPool(F)) + MLP(MaxPool(F))) Further: M, = sigmoid (w (Wo(Farg)) + Wi (Wo (Far) £xc cx wherein Wo, Ware shared weights of the fully connected layer, and M, ER", W, ER. Attention Merging: in order to enhance the ability of the network model to extract features and capture the dependency between the long-range features, the present invention adds the self-attention mechanism and the channel attention mechanism on the basis of the residual module to constitute an attention residual module so as to improve sensitivity of the model to useful information and suppress useless information. An adding manner is divided into a serial manner and a parallel manner, wherein the serial is divided into performing self-attention followed by channel attention, and performing channel attention followed by self-attention. Self-attention Followed by Channel Attention: performing the self-attention followed by channel attention in the serial manner is as shown in FIG. 4. The previous layer is subjected to convolution to obtain a feature map Fj, as an input. a channel attention map Fria is obtained through the action of the channel attention M,, then merged with the input feature map as the input of self-attention M,. finally the feature map obtained through the action of M, is merged with Frida to obtain a final output of the attention module. The formal description of the entire process is shown in FIG. 5: Fmid = M, (Fin) @Fin Fout = M. (Fria )®Fmia Channel Attention Followed by Self-attention: a mode of the channel attention followed by self-attention of the serial manner is shown in FIG. 5, the previous layer is subjected to convolution to obtain a feature map Fj, as an input, respective feature maps are obtained through the actions of self-attention M, and channel attention M, then merged with the input feature map Fi, to respectively obtain a self-attention map FA and a channel attention map Fed: finally a summing operation of the corresponding elements is made on the two obtained attention maps to obtain the final attention map output Four. The formal description of the entire process is shown in FIG. 6: Frid = Mc (Fin) Fin Four = Ma(Fnia)®Fmiq Wherein &® denotes that the corresponding elements are multiplied.

Parallel Manner: the parallel connecting manner is shown in FIG. 7. The previous layer is subjected to convolution to obtain a feature map Fi, as an input, respective feature maps are obtained through actions of self-attention M, and channel attention M, first, then a multiplying operation of the corresponding elements is made on the obtained feature map and the input feature map to respectively obtain a self-attention map Fria and a channel attention map F£;«- finally the corresponding elements of the two obtained attention maps are added to obtain the final output Foue- The formal description of the entire process is as follows: Frida = Mc (Fin) @Fin

Fria = Ma(Fin)®Fin Fout = Fra © Fria wherein, @ denotes that the corresponding elements are added, and @ denotes that the corresponding elements are multiplied.

Attention Residual In order to better use the channel attention and self-attention designed in the previous text, the present text inserts it to the residual module. It is specifically divided into three structure designs, which are separately using a self-attention mechanism, separately using a channel attention mechanism, and simultaneously using the self-attention and channel attention mechanisms, respectively. The attention residual module is a module that the attention mechanism is added on the basis of the initial residual module, and the specific structures are shown in FIGS. 8 (a). 8(b) and 8(c).

In order to verify effectiveness of the model of the present text, experiments are carried out on two data sets FER2013 and CK+. The experiments are based on TensorFlow framework, and an experimental platform is: Intel Core 17-6850 six-core, memory 64GB. graphics card is GTX1080Ti, system is Ubuntul 6.04. All of the experiments are single card trainings. Embodiment: The FER2013 data set has 35888 picces of facial expression images in total, containing faces with different illumination and postures, wherein there are 28709 pieces of images in a training set, and there are 3589 pieces of images in both of a public test set and a private test set. A gray image having an image size of 48X48 has seven classifications in total, which are respectively: angry, disgust, fear, happy, surprise, sad and neutral. A sample picture is shown in FIG. 9.

The CK+ data set is also a data set commonly used for facial expression recognition. The data set includes 593 image sequences of 123 persons in total. and exhibits a varying process of an expression of a test object from a natural state to an expression peak. There are 327 sequences marked with expression labels, containing eight kinds of expressions, i.e., natural, disgust, contempt, fear, happy, sad, surprise and angry. In the experiment of the present text. 981 pieces of images of seven kinds of expression images are selected for experiment, and the images are pre-processed to have a size of 48X48.

Since the quantity of the two data sets is relatively small, the present text augments the data sets using a manner of data augment, and the primary manner includes random rotation, random adjustment of brightness, random graying and so on. The CK+ data set is augmented to about 29000 pieces. and FER2013 is augmented to about 63000 pieces. The model accuracy rate is effectively improved while preventing ocgurrence of an overfitting phenomenon through the data augment operation.

1. Ablation Experiment The effectiveness of the self-attention mechanism and the channel attention mechanism is verified through the experiments. As for the ablation experiment, the FER2013 and CK+ data sets are used, and a residual module is used as a basic module to build a reference model. In the FER2013 data set experiment, an official data set dividing method is used, that is, 28709 pieces of images are used for training, 3589 pieces of images are used for verifying the model, and 3589 pieces of images are used for testing the accuracy rate of the final model. As for the CK+ data set, we divide the augmented data set into a training set, a verifying set and a test set in a proportion of 7:2:1.

During a training process, Adam is selected as an optimizer, à learning rate is setto 0.0001, total training steps are 50 epochs, and a batch_size is set to 64. Experimental results are shown in Table 1.

Table | Ablation Experimental Results - Model FER2013(%) CKH%)

Baseline 69.57 94.37 Baseline+Self 72.32 94.89 Baseline+Channel 69.63 95.28 Baseline+Channel+Self 74.15 97.89 Baseline+Self+Channel 71.20 98.54 Baseline+Self&Channel(Parallel) 72.01 97.48 Following conclusions are obtained from Table 1: (1) On the two data sets FER2013 and CK+, the performance of the reference model is obviously inferior to models added with the attention mechanism regardless of adding which kind of attention and which kind of adding manner, and this indicates that the attention mechanism can improve the feature extracting ability of the neural network, and is helpful for the improvement of the performance of the facial expression recognition model. (2) As for models added with the attention mechanism, using mixed attention is obviously better than a single attention manner, and this indicates that increasing non-linear mapping of the model is effective to the facial expression recognition task. (3) For the mixed attention model, on the FER2013 data set, performing the channel attention followed by the self-attention has the best effect. the accuracy rate is respectively increased by

3.98% and 2.89% in comparison with the parallel manner and the performing the self-attention followed by channel attention; and on the CK+ data set, performing the self-attention mechanism at first has the best effect, and the accuracy rate is increased by 0.66% and 1.48% in comparison with the parallel manner and the channel attention followed by self-attention.

2. Solution Selection From the analysis on the ablation experiment in the previous section, a combination mode of first performing the channel attention and then performing the self-attention has the best overall performance, and acquires the higher accuracy rate on both the FER2013 and CK+ data sets. Thus, the present text selects the model as the final model. In order to verify the effectiveness of the model of the present text, below a comparative experiment is made on this model and the other current methods. and the experimental results are shown in Table 2 and Table 3. Table 2 FER2013 Comparative Experiment Table 3 CK+ Comparative Experiment Model FER2013(%) Model CK +(%) CNN+SVM 69.30 Inception 93.20 LBP+SVM 70.86 FRR-CNN 92.06 CPC 71.35 Document 97.25 Inception 66.40 3DCNN 85.90 Model of Present 74.15 Model of 97.89 Text Present Text Following conclusions can be obtained from the experimental data in Table 2 and Table 3: (1) Compared with the previous three traditional facial expression recognition methods, the method using deep learning can obviously improve the accuracy rate of expression recognition.

The features extracted using the convolutional neural network can perform expression description better than the artificial feature operator. (2) Compared with the current mainstream method based on deep learning, the above-mentioned self-attention mechanism model acquires the higher accuracy rate on both data sets. (3) The accuracy rate ofthe FER2013 data set is obviously lower than the accuracy rate of the CK+ data set, and this indicates that the quality of the data set has a certain influence on the experimental results.

The scales of the FER2013 data set and the CK+ data set both are relatively small, error labels and non-facial expression labels also exist in the FER2013 data set, and all of these will interfere the training of the model, thereby affecting the performance of the model.

FIG. 9 and FIG. 10 show a training loss curve graph and an accuracy rate curve graph of the model of the present text on the FER2013 and CK+ data sets.

It can be seen from the graphs that the training process of the model of the present text on the FER2013 data set is less stable than on the CK+ data set, this has a certain relationship with the data sets. By checking the two data sets, it can be discovered that the expression images of the FER2013 data set has greater differences. low image resolution, and different image quality, this brings certain interference to the training process, and the accuracy rate is finally stabilized at about 75%. The image quality of the CK+ data set is better, the distribution is uniform, thus, the training of the model on this data set is more stable, the final accuracy rate is higher, and the accuracy rate of the training set and the verifying set is at about 98%, FIG. 11 is a confusion matrix obtained through the experiment on the FER2013 data set. and displays the classification accuracy rate of the facial image on the seven kinds of expressions. À horizontal coordinate represents predictable labels, and a longitudinal coordinate represents real labels. It can be seen from the matrix that the model added with a self-attention unit of the present text has the increased accuracy rate on each expression, wherein the expression "sad" has the greatest improvement effect and has the improvement of 13%, and this indicates that the adding of the self-attention unit makes the expression classification more accurate. However, the accuracy rates of seven kinds of expressions has a certain difference therebetween. for example, the accuracy rate of the expression "happy" with the highest accuracy rate reaches 92%, while "sad", "fear" and "angry" reaches 49%, 53% and 64%, respectively. On one hand, this is because the data amount of these three types of expressions is relatively small, and the imbalanced samples bring certain negative affects to the network training. On the other hand, the three types of expressions have certain similarities, and the feature difference is not obvious. They are not easily distinguished.

FIG. 12 shows a confusion matrix obtained from CK+ test. It can be seen that the accuracy rates of most of facial expression recognition have been improved, the results are the same as FER2013 results. Since the data amount of angry, sad and contempt is relatively small, and the feature difference between the expressions is not very obvious, the recognition rates are slightly lower than the expressions of sad, fear, happiness and surprise.

Claims

CLAIMS What is claimed is:

1. A facial expression recognition method based on an attention mechanism, characterized in: first building a facial expression recognition model, of which a structure according to an image input sequence is: a convolution module. a maximum pooling module. an attention residual module, a maximum pooling module, an attention residual module, a maximum pooling module and two fully connected layers, and a softmax function, wherein a converged facial expression prediction result is obtained through an end-to-end manner; the attention residual module is a module that a self-attention mechanism and a channel attention mechanism are added on a basis of a residual network to improve sensitivity to useful information in an input image and suppress useless information; an adding manner is divided into a serial manner and a parallel manner, wherein the serial manner is divided into performing self-attention followed by channel attention, and performing channel attention followed by self-attention, and the parallel manner is performing self-attention and channel attention in parallel; and the self- attention is used to calculate a weighted average value of all position pixels in an input facial expression feature map to calculate relative importance of key positions in the facial expression feature map. and the key positions are important positions for recognizing an expression and specifically are positions that are important for recognizing the expression in the feature map, including a mouth and eyes; then using the channel attention to learn different features on a channel domain to thereby generate channel attention so as to learn interaction features in different channels, so that a channel of the feature map can detect a target to locate the channel to the key positions of the feature map, thereby improving robustness; and finally merging the self-attention mechanism and the channel attention mechanism to encourage an ability of a facial expression recognition model to extract the key positions in the facial expression feature map as global important features, and using the repeated maximum pooling module and attention residual module to reduce error and output an optimal recognition result through an end-to-end learning manner.

2. The facial expression recognition method based on the attention mechanism of claim 1, characterized in that, in a building process of the facial expression recognition model, the self- attention mechanism is introduced on the basis of the used residual network y = F(x, {W}) + x, wherein x and y respectively denote input and output information of the residual network, and F(x,{W,}) denotes residual mapping.

3. The facial expression recognition method based on the attention mechanism of claim 1, characterized in that, a self-attention module uses non-local operations, white calculating output of each position in the feature map, to concern all signals related to a current expression, and will obtain a relevance weight to represent relevance of other positions and a current position to be calculated, which is defined as follows: 1 Yi = 2 f(x; )a(@) vj wherein { denotes any position in an output feature map, j is an index of all possible positions in the feature map, x is an input feature map, y is the output feature map, wherein pixel values of the output feature map and the input feature map have changed, and a dimension is the same as the input feature map, f is a function for calculating relevance between any two points, g is a one-variable function for information transformation, and C(x) is a normalization function; since f and g both are general formulas, a specific form needs to be considered in conjunction with a ncural network, first, g is a one-variable output, which is replaced by using | X1 convolution, and the form is as follows: g(x;) = W,x;, any two points represented by f are substituted into two positions in an embedding space, a calculation formula is: f(x; %) = ef Ox) wherein 8(x;) = Wax; . (x) = Wax; , the normalization parameter C(x) = Xv; f (Xi X;), as for a given position i, == (xx) becomes softmax for calculating all positions j, and the obtained output of a self-attention layer is:

y = softmax(x" WI Wyx)g(x) supposing an input feature map of a self-attention network is F}”W>XC, it is transformed into two embedding spaces through two convolution weights Wy and Wa to obtain FHXWXC" and FHXWXC generally C' < C, a purpose here is to reduce a number of channels and reduce a calculation amount; second, a reshape operation is performed on this output feature map to transform it into FHWXC" subsequently a transpose operation is performed on a matrix obtained through the Wy transformation and then a matrix multiplication is performed to calculate similarities and obtain a similarity matrix F#WX#W and then a softmax operation is performed on a last dimensionality, thereby corresponding to obtaining normalization relevancy of each pixel and other position pixels in the current feature map; finally, g is processed with dimensionality reduction and then the reshape operation, and is multiplied with the matrix FHWXHW the attention mechanism is applied to all channels of the feature map, and finally the channels are restored through a convolution of 1% 1. to ensure that input and output dimensions are exactly the same; from a mathematical point of view, assuming that the feature map of a previous layer of the self-attention network is x € RCX", it is first mapped into two feature spaces f and g, wherein f = Wrx, g = Wx exp(si;) T Bii Zr Su = 0x] )¢(s;) Xin exp(si;) in the formula, B;; denotes contribution of an i™ position to a j™ region when the j* region of the feature map is synthesized, Chere denotes the number of channels of the feature map of the previous layer, and N denotes the number of pixels of the feature map of the previous layer; thus, the output of the self-attention layer is 0 = (01,04, ...,0;, ..., On ). wherein:

N 0; = v() £1900) i=1 f - inthe formula. Wy € REC. Wy € RUXC, W, € REC and W, € RC" are weights of a convolution kernel. C’ is a hyper parameter, and C < C; then a residual connection is introduced, and the final output of the self-attention module is: y; = yo; + x;. wherein Ÿ is a learnable hyper parameter and is initialized to 0, and a weight is gradually increased in a training process,

4. The facial expression recognition method based on the attention mechanism of claim 1. characterized in that, the channel attention module is used to act on a feature detector to introduce the channel attention to learn a weight distribution between the channels to strengthen channels that useful for an expression recognition task while weakening channels that are irrelevant to the task; as for each channel of the intermediate feature map that has been changed. the feature map is compressed into two different spaces through global average pooling and giobal maximum pooling operations based on height and width to obtain two feature maps, then the obtained two feature maps are input to two networks using a same set of parameters, i.c., into a fully connected neural network having shared parameters, output vectors of the fully connected layer are summed according to the corresponding elements, the features of the two spaces are merged, and finally a final channel weight is obtained through a signoid activation function; it is specifically as follows: assuming that the input feature map is FFXWXC, wherein H,W,C are respectively a height, a width and the number of channels of the feature map. a maximum pooling feature map Enax € R1*1X*Cq a global average pooling feature map Fy, € RVC are respectively obtained through the pooling, then the two feature maps are sent to the fully connected neural network which only contains one hidden layer, and a calculation process is as follows: M, = sigmoid (MLP(AvgPool(F)) + MLP(MaxPool(F))) Further:

M, = sigmoid (wy (Wo(Fava)) + W, (Wo(Fnax))) Exc cxË wherein Wa, W,are shared weights of the fully connected layer. and M, ER" ,W, ER r.

5. The facial expression recognition method based on the attention mechanism of claim 4, characterized in that, the self-attention mechanism and the channel attention mechanism are added on the basis of the residual module to constitute an attention residual module for enhancing a feature extracting ability of a facial expression recognition model network, capturing dependency between long-range features, improving sensitivity of the model to useful information, and suppressing useless information; and the adding manner is divided into a serial manner and a parallel manner, wherein the serial is divided into performing self-attention followed by channel attention, and performing channel attention followed by self-attention.

6. The facial expression recognition method based on the attention mechanism of claim 5, characterized in that, the adding manner of self-attention followed by channel attention in the serial manner is specifically: a mode of performing self-attention followed by channel attention of the serial manner: the previous layer is subjected to convolution to obtain a feature map Fas an input, a channel attention map Fig is obtained through the action of the channel attention M,, then merged with the input feature map as the input of the self-attention M,, finally the feature map obtained through M, action is merged with Fria Lo obtain the final output of the attention module, and mathematical description is as follows: Pmid = Ma (Fin)@Fin Fout = M (Fria) ®Fmia-

7. The facial expression recognition method based on the attention mechanism of claim 5, characterized in that, the adding manner of channel attention followed by self-atiention in the serial manner is specifically: a mode of channel attention followed by self-attention in the serial manner: the previous layer is subjected to convolution to obtain a feature map as an input, respective feature maps are obtained through actions of the self-attention Mj and the channel attention M_, then merged with the input feature map Fj, to respectively obtain a self-attention map Fj; and a channel attention map Fr ;4, finally a summing operation of the corresponding elements is made on the two obtained attention maps to obtain the final attention map output Fous, and mathematical description is as follows: Fria = Mc (Fy )®F, Fout = Ma (Finia)®F nia wherein ® denotes that the corresponding elements are multiplied.

8. The facial expression recognition method based on the attention mechanism of claim 5, characterized in that, the steps of the parallel adding manner are: the previous layer is subjected to convolution to obtain a feature map as an input, respective feature maps are obtained through the actions of the self-attention Mgand the channel attention M,, a multiplying operation of the corresponding elements is made on the obtained feature map and the input feature map to respectively obtain a self-attention map Fraia and a channel attention map Pnid- finally the corresponding elements of the two obtained attention maps arc added to obtain the final output Faut» and mathematical description is as follows: Fria = M (Fin)@Fin Fria = Ma(Fin)@Fin Four = Fria ® Fria in the formula, @ denotes that the corresponding elements are added, and ® denotes that the corresponding elements are multiplied.

9. The facial expression recognition method based on the attention mechanism of claim 1, characterized in that, channel attention and self-attention are provided in a residual module to form an attention residual module, which is specifically divided into three structures: which are separately using a self-attention mechanism, separately using a channel attention mechanism, and simultaneously using the self-attention and channel attention mechanisms, respectively: and the attention residual module is a module that the attention mechanism is added on the basis of an initial residual module.