CN114170657A

CN114170657A - Facial emotion recognition method integrating attention mechanism and high-order feature representation

Info

Publication number: CN114170657A
Application number: CN202111439715.3A
Authority: CN
Inventors: 孙强; 梁乐; 梅路洋
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-11

Abstract

The invention discloses a facial emotion recognition method integrating an attention mechanism and high-order feature representation, which comprises the steps of firstly collecting a target image, and dividing the target image into a training sample set and a testing sample set; then reading an original emotion marking value of each sample image in the training sample set, sending each sample image in the training sample set into a multi-task cascaded convolutional neural network to obtain an output image, and then inputting the image obtained after preprocessing into a residual error attention network to obtain an attention output characteristic diagram; and finally, respectively sending the output feature map into a global second-order pooling network based on a channel and a global second-order pooling network based on a space position, so as to obtain output features corresponding to the emotion, and obtaining an emotion state value through a regressor. The invention solves the problems that the existing facial emotion recognition method based on the attention mechanism is difficult to model long-distance dependency relationship among emotion characteristics and insufficient in network nonlinear characterization capability.

Description

Facial emotion recognition method integrating attention mechanism and high-order feature representation

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a facial emotion recognition method integrating an attention mechanism and high-order feature representation.

Background

With the continuous development of society and urgent needs for fast and effective automatic emotion recognition in various aspects in recent years, biometric recognition technology has been rapidly developed in the last decade. In real life, people need to perform necessary emotion recognition to judge the emotion of the opposite party and make correct behavior response for normal social communication and requirements.

Aaron Sloman proposed a study on artificial intelligence emotion as early as 1981. In 1985, Marvin Minsky, one of the founders of artificial intelligence, presented computer and emotional problems. Picard formally proposed the concept of emotion calculation in 1995, and defined in emotion calculation in 1997 "emotion calculation" as "calculation related to, caused by or capable of affecting emotion".

Emotion recognition has been a major research target of broad scholars in recent years as one of the important branches of the recognition field. For static face images, in a conventional identification method, researchers usually artificially extract emotional features (for example, 68 feature points based on Dlib), and then send the emotional features into a pre-designed classifier for classification (SVM, decision tree, random forest, etc.), so as to obtain a final emotional prediction result. However, this method requires extensive experience in actual practice to manually select the appropriate emotional characteristics, and often requires a great deal of time and effort. Moreover, different characteristics contribute different degrees to the final result, and the traditional method does not distinguish the final result well.

In recent years, with the continuous increase of computer capabilities, deep learning has been proposed again. Due to its strong feature learning ability and high performance, it gradually replaces the traditional machine learning and becomes the mainstream method in the identification field. Most of the current emotion analysis methods based on deep learning can be used for screening effective features by combining an attention mechanism, but effective modeling for long-distance dependency relationship among emotion semantic features is lacked. Moreover, the traditional method can train different models aiming at different tasks, does not utilize the similarity between related tasks, and has insufficient nonlinear characterization capability of a deep network.

Disclosure of Invention

The invention aims to provide a facial emotion recognition method integrating an attention mechanism and high-order feature representation, and solves the problems that the existing facial emotion recognition method based on the attention mechanism is difficult to model long-distance dependency relationship between emotion features and insufficient in network nonlinear characterization capability.

The invention adopts the technical scheme that a facial emotion recognition method integrating an attention mechanism and high-order feature representation is implemented according to the following steps:

step 1, collecting a target image, and dividing the target image into a training sample set x_trainAnd test sample set x_test；

Step 2, reading a training sample set x_trainMarking the original emotion value of each sample image, and collecting a training sample set x_trainSending each sample image into a multi-task cascaded convolutional neural network MTCNN, finishing face alignment according to face detection and 5 key feature points to obtain an output image x_input＝[x₁，x₂，...，x_n]Wherein x is_nRepresenting the nth output image, n representing the total number of output images, i.e. the training sample set x_trainThe total number of images in;

step 3, obtaining the image x after the pretreatment_inputInputting the data into a residual attention network, and extracting features M of different receptive fields through trunk branches_i，c(x) Mask branch learning attention weight T_i，c(x) (ii) a Finally, the output of the main branch and the output of the mask branch are subjected to dot product operation to obtain an attention output characteristic diagram H_i，c(x)；

Step 4, outputting the attention output characteristic diagram H obtained in the step 3_i，c(x) Respectively sent to global second-order pooling network GSoP based on channel and global second-order pooling network GSoP based on space positionGlobal second-order pooling network, channel-based dependency Z between global second-order pooling network output feature maps_transSpatial position-based dependency Z between spatial positions in a global second-order pooling network output profile_non-local；

Step 5, fusing the dependency relationship Z between the characteristic graphs_transAnd the dependency Z between spatial positions in the characteristic diagram_non-localThereby obtaining an output characteristic Z_fusion；

Step 6, the output characteristics Z obtained in the step 5 are used_fusionSending the emotion data into a two-stage multi-task learning network, and obtaining an emotion state value by using a linear regression device: arousal and Valence.

The present invention is also characterized in that,

in the step 1, the method comprises the following steps of,

for training sample set x_trainThe sample image is a tensor x with dimensions of n x h x w_train＝[(h₁，w₁)，(h₂，w₂)，...，(h_n，w_n)]Wherein n represents the total number of samples of the training sample set, h and w represent the length and width of each sample image respectively, and the original emotion marking value of the sample is a vector y with dimension of n multiplied by 2_a，v＝[(a₁，v₁)，(a₂，v₂)，...，(a_n，v_n)]Wherein (a)_n，v_n) Respectively representing a training sample set x_trainArousal and Valence labels for the nth sample image;

for test sample set x_testThe sample image is a tensor x with dimensions of m x h x W_test＝[(h₁，w₁)，(h₂，w₂)，...，(h_m，w_m)]Where m represents the total number of samples of the test sample set, and h and w represent the length and width, respectively, of each sample image; the original emotion annotation value of the sample is a m multiplied by 2 dimensional vector y_a，v＝[(a₁，v₁)，(a₂，v₂)，...，(a_m，v_m)]Wherein (a)_m，v_m) Respectively representing a set of test samples x_testArousal and Valence labels of the mth sample picture.

The step 2 is as follows:

step 2.1, reading a training sample set x_trainPer sample image (h)_l，w_l) Where l ═ 1, 2., n, n denote a training sample set x_trainThe total number of the intermediate images is then scaled in different scales according to the proportion of the input sample image, a series of detection frames with different sizes are generated and used for constructing an image pyramid

Wherein x_kRepresenting the kth image in the image pyramid, wherein k represents the total number of images in the image pyramid so as to adapt to the face detection of different sizes; the detection process comprises three Network structures of a Proposal Network, a refined Network and an Output Network;

step 2.2, the image pyramid obtained in the step 2.1 is used

Sending into a first layer network: the method comprises the steps of performing feature extraction and face frame calibration on a Proposal Network, then judging whether the region is a face or not through three convolution layers, a face classifier, frame regression and a face key point locator, and finally outputting a plurality of images with possible faces

Wherein x_gRepresenting the g image, wherein g represents the total number of images in which human faces possibly exist; filtering the generated candidate frames through a non-maximum suppression NMS algorithm;

and 2.3, sending the output result of the step 2.2 to a second-layer network: refine Network, further perform key feature point and face region location on the remaining candidate frames after screening through a key feature point locator and frame regression, and optimize by using a non-maximum inhibition algorithm, thereby obtaining output results of high-precision screening and face region optimization

Wherein x_jJ represents the j image in the output result, and j represents the total number of images of the output result;

and 2.4, sending the output characteristics obtained in the step 2.5 into a third-layer network: output Network, image feature

Obtaining the final MTCNN output image x through four convolution layers and a full connection layer, and simultaneously through face region frame regression and face key point positioning_input＝[x₁，x₂，...，x_n]Wherein x is_nRepresenting the nth output image and n representing the total number of output images.

The step 3 is as follows:

step 3.1, initializing the parameters of the whole network architecture, namely weight and bias, wherein the parameters comprise all convolution layers, pooling layers and full-connection layers in the network;

step 3.2, respectively sending the images obtained after preprocessing into a main branch and a mask branch, wherein the characteristics M of different receptive fields output by the main branch are obtained_i，c(x) Mask branching output learning attention weight T_i，c(x) The specific process is described as follows:

branching a trunk: image feature x_inputPerforming normalization processing on output results of the two convolution layers with convolution kernels of 3 multiplied by 3 to obtain characteristic M of different receptive fields_i，c(x)；

Let the convolutional layer output of the first layer be z^lThen obtaining the final output o by normalization processing BN and the activation function ReLU^lThe specific calculation formula is as follows:

o^l＝ReLU(BN(z^l))＝ReLU(BN(Wo^l-1+b)) (1)

wherein W and b represent weight value and offset, respectively, and l is 1, 2, z^lRepresents the convolutional layer output of the first layer, o^l-1Represents the final output of the l-1 layer;

wherein, BN is used for carrying out normalization processing on the result of the convolution layer, and the calculation formula is as follows:

in the formula (2), x_iRepresenting image features x_inputAny of the individual sample image features of (2), x_BNFor the output characteristics after layer normalization, σ_lIs a standard deviation image of the first layer image features, mu_lThe average image of the image characteristics of the l layer is taken;

σ in equation (2)_lAnd mu_lThe definition of (A) is as follows:

in the formulae (3) and (4), x_kImage feature x_inputK represents the number of samples in each small batch, and k is more than or equal to 1;

and (3) mask branching: image feature x_inputAfter two pooling operations and two upsampling operations, the output is adjusted to be in the range of 0 to 1 by using a sigmoid function, and a learning attention weight T is obtained_i，c(x)；

Step 3.3, outputting M trunk branches_i，c(x) Sum mask branch output T_i，c(x) Performing dot product, introducing a residual error mechanism in the process to obtain residual error attention output, wherein the calculation formula is as follows:

H_i，c(x)＝(1+M_i，c(x))*T_i，c(x) (5)

the step 4 is as follows:

step 4.1, firstly, the residual attention characteristic H obtained in the step 3_i，c(x) Feeding into channel-based tanksDependency Z between local second pooling network learning feature maps_transSetting the size of an output characteristic diagram of the residual attention network as h ' × w ' × c ', firstly, performing channel dimensionality reduction on the characteristic diagram of the input GSoP network through a 1 × 1 convolution, and then obtaining the characteristic diagram with the size of h ' × w ' × c ', wherein h ', w ', c ' are respectively the height, the width, the number of input channels and the number of channels subjected to dimensionality reduction of the characteristic diagram;

obtaining a covariance matrix with the channel-by-channel size of c x c through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith characteristic channel of the residual attention network output characteristic diagram and all channels;

reconstructing the cxc two-dimensional covariance matrix into a three-dimensional tensor with the size of 1 xcxcxcxc, and performing line-by-line convolution on the reconstructed covariance matrix, namely performing grouping convolution on each line of the covariance matrix as a group, wherein the output size is 1 xc 4 c; then, performing 1 × 1 convolution, wherein the output size is 1 × 1 × c ', and then, performing sigmoid activation function layer to obtain a weight vector with the size of 1 × c';

multiplying each characteristic channel in the characteristic diagram of the input GSoP network by a corresponding position element in a weight vector of 1 × c', namely giving different attention degrees to each characteristic channel output by the residual attention network;

step 4.2, then the residual attention characteristic H obtained in the step 3_i，c(x) Feeding into GSoP network model based on spatial position to learn dependence Z between spatial positions in characteristic diagram_non-localSetting the size of an output characteristic diagram of the residual attention network as h '× w' × c ', firstly, performing channel dimensionality reduction on the characteristic diagram of the input GSoP network through a 1 × 1 convolution, and then obtaining the characteristic diagram with the size of h' × w '× c', wherein h ', w', c are respectively the height, width, input channel number and channel number after dimensionality reduction of the characteristic diagram;

down-sampling the feature map subjected to channel dimension reduction, wherein the size of the feature map is reduced to h multiplied by w multiplied by c;

obtaining a covariance matrix with the position-by-position size of hw multiplied by hw through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith spatial position and all spatial positions in the residual attention network output characteristic diagram;

reconstructing the hw multiplied by hw two-dimensional covariance matrix into a three-dimensional tensor with the size of 1 multiplied by hw, performing row-by-row convolution on each row of the reconstructed covariance matrix as a group, and performing grouping convolution to obtain the output size of 1 multiplied by hw multiplied by 4 hw; then, a 1 × 1 convolution and sigmoid function are carried out, the output size is 1 × hw × 4hw, and a new weight matrix with h × w × 1 output is obtained through reconstruction;

reconstructing a weight matrix with the size of h multiplied by w multiplied by 1 into a weight matrix of h 'multipliedby w' × 1 through upsampling;

and multiplying the input feature map of the GSoP by the corresponding spatial position feature in the weight matrix, and emphasizing or suppressing the spatial position feature in the output feature map of the residual attention network.

The step 6 is as follows:

step 6.1, adopting a hard sharing mode in the multi-task learning, and in the first stage of the multi-task learning, firstly using a sharing module to output the characteristic Z obtained in the step 5_fusionExtracting some universal bottom-layer features, and then respectively using two branches for learning the classification characterization Z of the image_classAnd dimension characterization Z_dimThen, the two types of output features of the learned classification feature and the learned dimension feature are cascaded and combined into a feature Z_mtl-1；

Step 6.2, in the second stage of multitask learning, the characteristic Z is subjected to_mtl-1Performing linear regression to obtain an output value: arousal and Valence;

the loss function adopted by the linear regressor is Tukey's double-weight loss function, and the specific defined formula is as follows:

in the formula (6), Loss represents a Loss value, and c-4.685 represents oneHyper-parameter, y_iWhich represents the value of the real tag,

representing the predicted tag value.

Compared with the prior art, the facial emotion recognition method based on the attention mechanism has the beneficial effects that the attention mechanism and the high-order feature representation are fused, and the problems that the existing facial emotion recognition method based on the attention mechanism is difficult to model the long-distance dependency relationship between the emotion features and insufficient in network nonlinear characterization capability can be effectively solved.

Drawings

FIG. 1 is an overall network model architecture of a facial emotion recognition method integrating attention mechanism and high-order feature representation according to the invention.

Fig. 2 is a residual attention network model structure.

Fig. 3 is a channel-based GSoP network model structure.

Fig. 4 is a GSoP network model structure based on spatial location.

FIG. 5 is an emotional state value prediction module that integrates depth characterization.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a facial emotion recognition method integrating an attention mechanism and high-order feature representation, which is implemented by combining the following steps of fig. 1-5:

In the step 1, the method comprises the following steps of,

the step 2 is as follows:

Wherein x_kRepresenting the k image, k, in the image pyramidThe total number of the images in the image pyramid is represented so as to adapt to the face detection of different sizes; the detection process comprises three Network structures of a Proposal Network, a refined Network and an Output Network;

step 2.2, the image pyramid obtained in the step 2.1 is used

Wherein x_gRepresenting the g image, wherein g represents the total number of images in which human faces possibly exist; filtering the generated candidate box by a Non-Maximum inhibition NMS (Non-Maximum Suppression) algorithm;

and 2.3, sending the output result of the step 2.2 to a second-layer network: compared with the Proposal Network, the Network has a more full connection layer structure, so that the characteristic screening is more strict, and most of candidate images with poor effects can be screened. Then, the key feature points and the face region are further positioned on the remaining candidate frames after screening through a key feature point positioner and frame regression, and optimization is carried out by using a non-maximum inhibition algorithm, so that the output results of high-precision screening and face region optimization are obtained

ThroughFour convolution layers and a full connection layer are processed by face region frame regression and face key point positioning to obtain the final MTCNN output image x_input＝[x₁，x₂，...，x_n]Wherein x is_nRepresenting the nth output image and n representing the total number of output images.

Step 3, obtaining the image x after the pretreatment_inputInputting the data into a residual attention network, and extracting features M of different receptive fields through trunk branches_i，c(x) Mask branch learning attention weight T_i，c(x) (ii) a Finally, the output of the main branch and the output of the mask branch are subjected to dot product operation to obtain an attention output characteristic diagram H_i，c(x) (ii) a The network structure is shown in fig. 2, and the network architecture parameters are shown in table 1 below.

TABLE 1 residual attention network model parameter Table

The step 3 is as follows:

Assuming convolutional layer output of the first layerIs z^lThen obtaining the final output o by normalization processing BN and the activation function ReLU^lThe specific calculation formula is as follows:

o^l＝ReLU(BN(z^l))＝ReLU(BN(Wo^l-1+b)) (1)

σ in equation (2)_lAnd mu_lThe definition of (A) is as follows:

Step 3.3, outputting M trunk branches_i，c(x)Sum mask branch output T_i，c(x) Performing dot product, introducing a residual error mechanism in the process to obtain residual error attention output, wherein the calculation formula is as follows:

H_i，c(x)＝(1+M_i，c(x))*T_i，c(x) (5)

step 4, outputting the attention output characteristic diagram H obtained in the step 3_i，c(x) Respectively sending the data into a global second-order pooling network GSoP based on a channel and a global second-order pooling network based on a space position, and outputting a dependency relationship Z between characteristic graphs by the global second-order pooling network based on the channel_transSpatial position-based dependency Z between spatial positions in a global second-order pooling network output profile_non-local(ii) a The network architecture parameters are shown in table 2 below.

TABLE 2 GSop model parameter Table

The step 4 is as follows:

step 4.1, firstly, the residual attention characteristic H obtained in the step 3_i，c(x) Feeding into a channel-based global second-order pooling network to learn the dependency Z between feature maps_transThe network structure is shown in fig. 3.

As shown in fig. 3, the output feature map size of the residual attention network is h ' × w ' × c ', the feature map of the input GSoP network is firstly subjected to channel dimensionality reduction by a 1 × 1 convolution, and then a feature map with the size of h ' × w ' × c ' is obtained, wherein h ', w ', c ', and c are respectively the height, width, input channel number, and channel number after dimensionality reduction of the feature map;

step 4.2, then the residual attention characteristic H obtained in the step 3_i，c(x) Feeding into GSoP network model based on spatial position to learn dependence Z between spatial positions in characteristic diagram_non-localThe network structure is shown in fig. 4.

As shown in fig. 4, the output feature map size of the residual attention network is h '× w' × c ', the feature map of the input GSoP network is firstly subjected to channel dimensionality reduction by a 1 × 1 convolution, and then the feature map with the size of h' × w '× c' is obtained, wherein h ', w', c are respectively the height, width, input channel number and channel number after dimensionality reduction of the feature map;

the two-dimensional covariance matrix of hw × hw is reconstructed into a three-dimensional tensor of size 1 × hw × hw, which facilitates the following row-by-row convolution operation. Taking each row of the reconstructed covariance matrix as a group to perform line-by-line convolution, and performing packet convolution to obtain an output size of 1 × hw × 4 hw; then, a 1 × 1 convolution and sigmoid function are carried out, the output size is 1 × hw × 4hw, and a new weight matrix with h × w × 1 output is obtained through reconstruction;

The step 6 is as follows:

the structure of the multitask learning diagram is shown in fig. 5.

in equation (6), Loss represents a Loss value, c-4.685 is a hyperparameter, and y_iWhich represents the value of the real tag,

representing the predicted tag value.

Examples

The experiment of the invention is carried out based on the AffectNet data set, the Root Mean Square Error (RMSE) and the Consistency Correlation Coefficient (CCC) are respectively calculated according to the label value predicted by the model and the original label value, and then the result is compared with the existing method, so that the performance of the invention is evaluated and analyzed.

The results of the experiment are shown in table 3:

TABLE 3 comparison of Performance of different network models

As can be seen from table 3, according to the Root Mean Square Error (RMSE) evaluation index, the method of the present invention obtained 0.366 for Arousal, which is lower than both 0.513 and 0.410 obtained by the conventional method SVR and CNN, respectively. 0.317 was obtained for Valence, which is a reduction compared to both 0.384 and 0.370 obtained by the two conventional methods above. The method proposed in this section yields 0.556 for Arousal in terms of the Coherence Correlation Coefficient (CCC), which is an improvement over both 0.182 and 0.340 obtained by the above conventional methods. 0.603 for Valence was obtained, which is an improvement over both 0.372 and 0.600 obtained by the above conventional methods.

According to the analysis, the facial emotion recognition method combining the attention mechanism and the high-order feature representation is superior to the traditional method, and the effectiveness of modeling for the dependence between the long-distance features and the improvement on the nonlinear characterization capability are verified.

Claims

1. The facial emotion recognition method integrating the attention mechanism and the high-order feature representation is characterized by comprising the following steps:

Step 2, reading a training sample set x_trainOf each sample imageMarking original emotion with values, and collecting training sample set x_trainSending each sample image into a multi-task cascaded convolutional neural network MTCNN, finishing face alignment according to face detection and 5 key feature points to obtain an output image x_input＝[x₁，x₂，...，x_n]Where xn represents the nth output image and n represents the total number of output images, i.e. the training sample set x_trainThe total number of images in;

Step 4, outputting the attention output characteristic diagram H obtained in the step 3_i，c(x) Respectively sending the data into a global second-order pooling network GSoP based on a channel and a global second-order pooling network based on a space position, and outputting a dependency relationship Z between characteristic graphs by the global second-order pooling network based on the channel_transSpatial position-based dependency Z between spatial positions in a global second-order pooling network output profile_non-local；

2. The method for facial emotion recognition based on fusion of an attention mechanism and a higher-order feature representation according to claim 1, wherein in the step 1,

for training sample set x_trainThe sample image is n x h x w dimension tensor z_train＝[(h₁，w₁)，(h₂，w₂)，...，(h_n，w_n)]Wherein n represents the total number of samples of the training sample set, h and w represent the length and width of each sample image respectively, and the original emotion marking value of the sample is a vector y with dimension of n multiplied by 2_a，v＝[(a₁，v₁)，(a₂，v₂)，...，(a_n，v_n)]Wherein (a)_n，v_n) Respectively representing a training sample set x_trainArousal and Valence labels for the nth sample image;

for test sample set x_testThe sample image is a tensor x with dimensions of m x h x w_test＝[(h₁，w₁)，(h₂w₂)，...，(h_m，w_m)]Where m represents the total number of samples of the test sample set, and h and w represent the length and width, respectively, of each sample image; the original emotion annotation value of the sample is a m multiplied by 2 dimensional vector y_a，v＝[(a₁，v₁)，(a₂，v₂)，...，(a_m，v_m)]Wherein (a)_m，v_m) Respectively representing a set of test samples x_testArousal and Valence labels of the mth sample picture.

3. The method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 2, wherein the step 2 is specifically as follows:

Wherein x_kRepresenting the kth image in the image pyramid, wherein k represents the total number of images in the image pyramid so as to adapt to the face detection of different sizes; wherein, examineThe measurement process is divided into three Network structures of a Proposal Network, a refined Network, and an Output Network;

step 2.2, the image pyramid obtained in the step 2.1 is used

Obtaining the final MTCNN output image x through four convolution layers and a full connection layer, and simultaneously through face region frame regression and face key point positioning_input＝[x₁，x₂，...，x_n]Wherein xn represents the nth output diagramLike, n represents the total number of output images.

4. The method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 3, wherein the step 3 is specifically as follows:

o^l＝ReLU(BN(z^l))＝ReLU(BN(Wo^l-1+b)) (1)

σ in equation (2)_lAnd mu_lThe definition of (A) is as follows:

H_i，c(x)＝(1+M_i，c(x))*T_i，c(x) (5)。

5. the method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 4, wherein the step 4 is specifically as follows:

step 4.1, firstly, the residual attention characteristic H obtained in the step 3_i，c(x) Feeding into a channel-based global second-order pooling network to learn the dependency Z between feature maps_transLet the output characteristic diagram size of the residual attention network be h '× w' × c ', firstly perform channel dimensionality reduction on the characteristic diagram of the input GSoP network through a 1 × 1 convolution, and then obtain the characteristic diagram with the size of h' × w '× c', wherein h ', w', c are respectively the height, width and input of the characteristic diagramThe number of channels entering and the number of channels after dimensionality reduction;

obtaining a covariance matrix with the position-by-position size of hw multiplied by hw' through second-order pooling operation, wherein the ith row of the covariance matrix represents the correlation or the dependency relationship between the ith spatial position and all spatial positions in the output characteristic diagram of the residual attention network;

reconstructing the two-dimensional covariance matrix of hw '× hw' into a three-dimensional tensor with the size of 1 × hw '× hw', performing row-by-row convolution by taking each row of the reconstructed covariance matrix as a group, and performing packet convolution to obtain the output size of 1 × hw '× 4 hw'; then, a 1 × 1 convolution and sigmoid function are carried out, the output size is 1 × hw × 4hw, and a new weight matrix with h × w × 1 output is obtained through reconstruction;

6. The method for facial emotion recognition based on attention mechanism and high-order feature representation fusion as claimed in claim 5, wherein the step 6 is as follows:

representing the predicted tag value.