CN116189272B

CN116189272B - Facial expression recognition method and system based on feature fusion and attention mechanism

Info

Publication number: CN116189272B
Application number: CN202310493454.6A
Authority: CN
Inventors: 陈昌红; 卢妍菲
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-07-07
Anticipated expiration: 2043-05-05
Also published as: CN116189272A

Abstract

The invention discloses a facial expression recognition method and a system based on feature fusion and an attention mechanism, wherein the method comprises the following steps: the method comprises the steps of (1) preprocessing an acquired facial expression data set; (2) constructing a facial expression recognition neural network model; (3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network; (4) Splicing the feature graphs output by the two middle layers to obtain feature vectors with weights; (5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II; (6) Performing secondary splicing on the output result I and the output result II in a Transformer network model; (7) And sending the result after the secondary splicing into a full-connection layer, and inputting the full-connection layer into a softmax classifier for classification to obtain an expression classification result. The method can improve the accuracy of facial expression recognition.

Description

Facial expression recognition method and system based on feature fusion and attention mechanism

Technical Field

The invention relates to a facial expression recognition method, and belongs to the technical field of image processing.

Background

Facial expression is an important carrier capable of expressing emotion of the heart besides language. In recent years, facial expression recognition (fer) is widely used in the fields of Internet of things, artificial intelligence, mental health evaluation and the like, and is widely focused on and applied to various social circles.

However, the existing expression recognition is mainly based on manually set features and machine learning methods, and the methods mainly have the following defects: manually set features often introduce unavoidable artifacts and errors, require human intervention to assist in extracting useful recognition features from the original image, and make it difficult to obtain deep high semantic features and depth features from the original image.

In order to obtain deep high semantic features, the number of convolutions is increased, but a method for enhancing the learning ability of a network by increasing the number of network layers is not always feasible, because after the number of network layers reaches a certain depth, the number of network layers is increased, so that the problem of random gradient disappearance of the network occurs, and the accuracy of the network is reduced. To solve this problem, the conventional method is to use a data initialization and regularization method, which solves the problem of gradient disappearance, but the problem of network accuracy is not improved.

Disclosure of Invention

The invention aims to solve the technical problems that: in the facial expression recognition process, how to acquire deep high semantic features, and further obtain a better facial expression recognition effect.

In order to solve the technical problems, the invention provides a facial expression recognition method based on feature fusion and attention mechanism, which comprises the following steps:

(1) Preprocessing the acquired facial expression data set;

(2) Constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;

(3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;

(4) Splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;

(5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;

(6) In a transducer network model, performing secondary splicing on the output result I and the output result II;

(7) And (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification to obtain an expression classification result.

In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (1), preprocessing an acquired facial expression dataset, including the following steps:

creating a PIL object, so that the operations of all images in the facial expression data set are based on the PIL object;

the facial expression image is adjusted to 224 multiplied by 224, and the input data is randomly and horizontally flipped according to the probability p of being flipped of the given input data;

carrying out normalization processing on the input data after horizontal overturning;

and loading the normalized data set image to a facial expression recognition neural network model.

In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (2), the res net50 convolutional neural network structure includes seven parts:

the first part is used for filling parameters into the input image;

the second part does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;

the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual block blocks, wherein each residual block has three layers of convolutions;

the seventh part comprises an average pooling layer and a full-connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full-connection layer and then outputs a result characteristic diagram.

In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (4), the feature graphs output by the two middle layers are spliced along the channel dimension, the feature vectors in the feature graphs output by the two middle layers are 512×60×60 and 1024×60×60 respectively, and after the feature vectors are spliced, the feature vectors 1536×60×60 with weights are obtained.

In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (5), feature vectors with weights for features at the last layer further pass through two convolution layers, where the two convolution layers are a (1×1) convolution layer and a (3×3) convolution layer, respectively, where the (1×1) convolution layer is used to compress the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, so as to obtain an output result one and an output result two respectively.

In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (6),Q、K、Vrespectively representing a query vector, a key vector and a value vector, wherein the key vector and the value vector are in a paired form, and the output result I and the output result II in the step (5) are input into an RKTM module and are respectively used as the query vector and the key vector;

in the ordinary differential equation, the Euler equation is expressed as:

，

the residual connection employed by the ResNet50 convolutional neural network is denoted as:

，

solving the transducer network model by using a second-order Longkurd formula to obtain the following formula:

，

wherein, the liquid crystal display device comprises a liquid crystal display device,

time of presentation->

Representing a transducer network model, +.>

For model parameters from ResNet50 convolutional neural network, +.>

、/>

Representing an attention sub-module I and an attention sub-module II in the RKTM module respectively;

for one input image

First, feature extraction using ResNet50 convolutional neural network yields +.>

Wherein->

For the feature, R is the real set, +.>

Respectively representing the number, length and width of channels, and obtaining ++after dimension reduction of the multidimensional data>

Let parameter one->

There is->

The size of the feature is noted->

Wherein b represents the sample size for each batch of training;

transformer network model

The calculation process of (1) is as follows:

head label (head) with attention mechanism is

Features->

Is deformed into

Wherein

Parameter two

；

Exchange parameter two

And parameter one->

Two channels get->

Setting matrix one

Matrix two->

Matrix III->

Is a learnable parameter, then obtain

，

Will query vectors

And key vector->

Transposed matrix of->

Multiplying and performing +_in the final dimension>

Operation, get attention score matrix +.>

The operation process is as follows:

，

and then will beAttention score matrix and value vector

Multiplying to obtain output

，

Output of

Is shaped as +.>

The output result of the formula (6) is put into a second-order Dragon's Kernel formula to obtain a transducer network model +.>

An expression.

A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:

and a pretreatment module: preprocessing the acquired facial expression data set;

the neural network model building module: constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;

and the information extraction module is used for: extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;

and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain a feature vector with weight, thereby realizing feature fusion;

and the convolution operation module is used for: carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;

and a secondary splicing module: in a transducer network model, performing secondary splicing on the output result I and the output result II;

and a classification module: and (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a facial expression recognition method based on feature fusion and attention mechanisms as described above.

An embedded device configured with a trusted execution environment, the trusted execution environment comprising:

a memory for storing instructions;

and the processor is used for executing the instructions to enable the embedded device to execute the facial expression recognition method based on the feature fusion and the attention mechanism.

The invention has the beneficial effects that: the facial expression recognition method based on the feature fusion and the attention mechanism is based on the ResNet50 neural network, wherein the residual error module can solve the gradient problem, the deep network of the ResNet50 neural network also enables the expressed features to be better, the corresponding detection or classification performance is stronger, the parameter quantity can be reduced, and the calculated amount can be reduced to a certain extent. The basic characteristics of the transducer model are that a Self-Attention mechanism (Self-Attention) and a residual error connection structure (Residual Connection) are introduced, and compared with the traditional sequence model, the information of all positions in an input sequence can be fully considered globally, so that a deeper network can be effectively trained, the effect of doubly improving the recognition accuracy is achieved integrally, and meanwhile, the training process is accelerated.

Meanwhile, the method can obtain the model with stronger generalization capability by solving the second-order Dragon's Kuntze formula

I.e. also have a good classification ability on training data other than the present example.

Drawings

Fig. 1 is a schematic diagram of the overall network structure of embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a ResNet50 convolutional neural network architecture;

FIG. 3 is a schematic structural view of the RKTM module;

FIG. 4 is a schematic diagram of recognition accuracy of the method of the present invention;

fig. 5 is a schematic of the accuracy of direct training of a res net50 convolutional neural network.

Detailed Description

The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments.

Example 1

In this embodiment, a common facial expression data set fer2013 is used, and the data set is composed of 35886 different facial expression images, each image is a gray image with a size fixed to 48×48, and has 7 types of expressions, namely, anger (anger), aversion (fear), happiness (happiness), sadness (sadness), surprise (surprise), neutrality (neutral), and the official stores expression related data in a cvs file, which can be converted and then stored as image data.

A facial expression recognition method based on feature fusion and attention mechanism comprises the following steps:

1) Preprocessing the acquired facial expression data set, comprising the following steps of:

adjusting the facial expression image to 224×224, and randomly and horizontally flipping the input data according to the default given probability of the input data being flipped p=0.5;

carrying out normalization processing on the input data after horizontal overturn, and mean average value: [0.485, 0.456, 0.406], std standard deviation: [0.229, 0.224, 0.225];

and loading the normalized data set image to a facial expression recognition neural network model, and enhancing the data in the data set through preprocessing to enrich training data.

2) Constructing a facial expression recognition neural network model, wherein fig. 1 is a schematic diagram of an overall neural network structure of embodiment 1, and comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism; compared with a convolutional neural network, the transducer model has the characteristic of global receptive field, the distance between any two pixels is the same, and the relation between vectors in the whole characteristic diagram can be measured, wherein RKTM is used as a main coding functional module of the encoder.

As shown in fig. 2, the res net50 convolutional neural network structure includes seven parts:

a first section (stage 0) for filling an input image with parameters (padding), the parameters being (3, 3);

the second part (stage 1) does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;

the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual blocks, the number of the residual blocks is 3, 4, 6 and 3, each residual block has three layers of convolution, the convolution kernels are 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step size is 1 when the convolution operation is carried out, but the filling parameter (padding) of the second convolution is (1, 1), namely, one circle of 0 is supplemented around the input image data;

the seventh part comprises an average pooling layer and a full connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full connection layer (fc) and then outputs the result, so that the input image with the size of 224×224 becomes a characteristic diagram with the size of 56×56, and the storage space is greatly reduced.

3) The output characteristics of the fourth part and the fifth part of the ResNet50 convolutional neural network contain rich image structure information and are called an intermediate layer; the sixth part is the last layer of the ResNet50 convolutional neural network containing convolutional operations, and the output features contain rich semantic features, called the last layer; because the ResNet50 convolutional neural network is trained by using the ImageNet, and the classification task is corresponding, the final-layer output of the feature extractor is semantic features;

4) Splicing the feature graphs output by the two middle layers along the channel dimension, wherein the feature vectors in the feature graphs output by the two middle layers are 512 multiplied by 60 and 1024 multiplied by 60 respectively, and after splicing, obtaining feature vectors 1536 multiplied by 60 with weights, thereby realizing the fusion of the features of different layers;

5) The final layer features and the feature vectors with weights further pass through two convolution layers, namely a (1×1) convolution layer and a (3×3) convolution layer, wherein the (1×1) convolution layer is used for compressing the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion to respectively obtain an output result I and an output result II. This step ensures that the two output results can be successfully input into the following transducer network model, a deep learning model using self-attention mechanisms.

6) In a transducer network model, performing secondary splicing on the output result I and the output result II;

7) Downsampling the spliced features, wherein the aim is to extract the features, send the features into a full-connection layer to obtain final feature vectors, finally input a softmax classifier for calculation, output class probability and obtain expression classification results;

8) The verification is carried out on the fer2013 facial expression public data set, as shown in fig. 4, the recognition accuracy of the method reaches 65%, the recognition rate of the ResNet50 network can only reach 57% after direct training, as shown in fig. 5, the embodiment improves the face recognition accuracy on a specific data set by 8% through feature fusion and embedding an improved attention mechanism. A large number of research results show that the depth features extracted by using the convolutional neural network have good robustness to deformation such as translation, rotation, scaling and the like, and different convolutional layers can extract features of different levels, so that the local and global characteristics of an image can be effectively represented, and the model of the embodiment has better robustness.

Example 2

(1) Preprocessing the acquired facial expression data set;

(7) And (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.

In step 6), fig. 3 is a block diagram of the RKTM module, i.e., the multi-headed self-attention module, as an encoder in a transducer model, wherein,Q、K、Vrespectively representing a query vector query, a key vector key and a value vector value, wherein the key vector and the value vector appear in a pair form and depend on an input value input.

Inputting the first output result and the second output result in the step 5) into an RKTM module to respectively serve as a query vector and a key vector;

in the ordinary differential equation, the Euler equation is expressed as:

，

，

the Euler equation is a first-order form of the Dragon-Kutta equation, and the second-order Dragon equation is used to solve the transducer network model to obtain the following equation:

，

time of presentation->

Representing a transducer network model, +.>

For model parameters from ResNet50 convolutional neural network, +.>

、/>

Representing the attention sub-module one and the attention sub-module two in the RKTM module, respectively.

For one input image

Wherein->

For the feature, R is the real set, +.>

Let parameter one->

There is->

Since deep learning uses a small-batch training method, the size of the feature is denoted as +.>

Wherein b is the batch_size, which is the sample size for each batch of training;

transformer network model

The calculation process of (1) is as follows:

head label (head) with attention mechanism is

Features->

Is deformed into

Wherein

Parameter two

；

Exchange parameter two

And parameter one->

Two channels get->

Setting matrix one

Matrix two->

Matrix III->

Is a learnable parameter, then obtain

，

Will query vectors

And key vector->

Transposed matrix of->

Multiplication, i.e. dot product calculation, and in the final dimension

Operation, get attention score matrix +.>

The operation process is as follows:

，

the attention score measures the similarity between every two features, and then the attention score matrix and the value vector

Multiplying to obtain output

，

Output of

Is shaped as +.>

It can be seen that +.>

And->

The spatial dimensions of (2) are kept consistent, so that the output result of the formula (6) is brought into a second-order Dragon's Kernel formula to obtain a transducer network model +.>

An expression. The step obtains a specific model of the transducer network model.

And 5) respectively processing the output result I and the output result II in the step 5) through a transducer network model to obtain two output characteristics, and performing secondary splicing operation on the two output characteristics, namely, the two output characteristics are spliced into 128 multiplied by 7 and 64 multiplied by 7, so that the secondary splicing of the characteristics is realized.

a memory for storing instructions;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. The facial expression recognition method based on the feature fusion and the attention mechanism is characterized by comprising the following steps of:

(1) Preprocessing the acquired facial expression data set;

(7) Downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the substmax classifier to classify the subsampled result to obtain an expression classification result;

in step (6), Q, K, V represents a query vector, a key vector and a value vector, and the output result I and the output result II in step (5) are input into an RKTM module and are used as the query vector and the key vector respectively;

in the ordinary differential equation, the Euler equation is expressed as:

y _t+1 ＝f(y _t ,θ _t )+y _t (2)

wherein t represents time, f (y _t ,θ _t ) Representing a transducer network model, θ _t For model parameters from ResNet50 convolutional neural network, f ₁ 、f ₂ Representing an attention sub-module I and an attention sub-module II in the RKTM module respectively;

img e R for an input image ^3×H×W Firstly, feature extraction is carried out by using ResNet50 convolutional neural network to obtain the feature E R ^c×h×ω Wherein the feat isCharacterized in that R is a real number set, c, h and omega respectively represent the number, length and width of channels, and the multi-dimensional data is subjected to dimension reduction to obtain the coat E R ^c×hω Let parameter n=hω, then there is fet=r ^c×n The size of the feature is denoted as feature ε R ^b×c×n Wherein b represents the sample size for each batch of training;

transformer network model f (y _t ,θ _t ) The calculation process of (1) is as follows:

let the head label of the attention mechanism be h1, deform the feature feat into

Wherein the method comprises the steps of

Parameter two

Exchange parameter two d _k And parameters of one n two channels

Setting matrix one

Matrix two->

Matrix III->

Is a learnable parameter, then obtain

Transpose matrix K of query vector Q and key vector K ^T Multiplying and Softmax operation is carried out in the final dimension to obtain the attention score matrix

The operation process is as follows:

multiplying the attention score matrix with the value vector V to obtain an output

The shape of the output y is b multiplied by c multiplied by n, and the output result of the formula (6) is put into a second-order Longcase formula to obtain a transducer network model f (y _t ,θ _t ) An expression.

2. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in step (1), the obtained facial expression data set is preprocessed, comprising the steps of:

3. The facial expression recognition method based on feature fusion and attention mechanisms of claim 1, wherein in step (2), the res net50 convolutional neural network structure comprises seven parts:

the first part is used for filling parameters into the input image;

4. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in the step (4), feature graphs output by two middle layers are spliced along a channel dimension, feature vectors in the feature graphs output by the two middle layers are 512×60×60 and 1024×60×60 respectively, and feature vectors 1536×60×60 with weights are obtained after the splicing.

5. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in step (5), feature vectors with weights for the final layer feature further pass through two convolution layers, namely a (1×1) convolution layer and a (3×3) convolution layer, respectively, the (1×1) convolution layer is used for compressing the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, so as to obtain an output result one and an output result two respectively.

6. A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:

and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;

and a classification module: downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the substmax classifier to classify the subsampled result, so that an expression classification result is obtained;

in the secondary splicing module, Q, K, V represents a query vector, a key vector and a value vector respectively, and the output result I and the output result II in the step (5) are input into the RKTM module and are used as the query vector and the key vector respectively;

in the ordinary differential equation, the Euler equation is expressed as:

y _t+1 ＝f(y _t ,θ _t )+y _t (2)

for one ofInput image img E R ^3×H×W Firstly, feature extraction is carried out by using ResNet50 convolutional neural network to obtain the feature E R ^c×h×ω Wherein, the feature is that R is a real number set, c, h and omega respectively represent the number, length and width of channels, and the multi-dimensional data is subjected to dimension reduction to obtain the feature E R ^c×hω Let parameter n=hω, then there is fet=r ^c×n The size of the feature is denoted as feature ε R ^b×c×n Wherein b represents the sample size for each batch of training;

Wherein the method comprises the steps of

Parameter two

Exchange parameter two d _k And parameters of one n two channels

Setting matrix one

Matrix two->

Matrix III->

Is a learnable parameter, then obtain

The operation process is as follows:

7. The facial expression recognition system based on feature fusion and attention mechanisms of claim 6, wherein in a neural network model building block, the res net50 convolutional neural network structure comprises seven parts:

the first part is used for filling parameters into the input image;

8. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism of any one of claims 1 to 5.

9. An embedded device configured with a trusted execution environment, the trusted execution environment comprising:

a memory for storing instructions;

a processor configured to execute the instructions, so that the embedded device performs a facial expression recognition method that implements the feature fusion and attention mechanism-based facial expression recognition method according to any one of claims 1 to 5.