CN116189272B - Facial expression recognition method and system based on feature fusion and attention mechanism - Google Patents

Facial expression recognition method and system based on feature fusion and attention mechanism Download PDF

Info

Publication number
CN116189272B
CN116189272B CN202310493454.6A CN202310493454A CN116189272B CN 116189272 B CN116189272 B CN 116189272B CN 202310493454 A CN202310493454 A CN 202310493454A CN 116189272 B CN116189272 B CN 116189272B
Authority
CN
China
Prior art keywords
feature
facial expression
neural network
output result
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310493454.6A
Other languages
Chinese (zh)
Other versions
CN116189272A (en
Inventor
陈昌红
卢妍菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202310493454.6A priority Critical patent/CN116189272B/en
Publication of CN116189272A publication Critical patent/CN116189272A/en
Application granted granted Critical
Publication of CN116189272B publication Critical patent/CN116189272B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a facial expression recognition method and a system based on feature fusion and an attention mechanism, wherein the method comprises the following steps: the method comprises the steps of (1) preprocessing an acquired facial expression data set; (2) constructing a facial expression recognition neural network model; (3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network; (4) Splicing the feature graphs output by the two middle layers to obtain feature vectors with weights; (5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II; (6) Performing secondary splicing on the output result I and the output result II in a Transformer network model; (7) And sending the result after the secondary splicing into a full-connection layer, and inputting the full-connection layer into a softmax classifier for classification to obtain an expression classification result. The method can improve the accuracy of facial expression recognition.

Description

Facial expression recognition method and system based on feature fusion and attention mechanism
Technical Field
The invention relates to a facial expression recognition method, and belongs to the technical field of image processing.
Background
Facial expression is an important carrier capable of expressing emotion of the heart besides language. In recent years, facial expression recognition (fer) is widely used in the fields of Internet of things, artificial intelligence, mental health evaluation and the like, and is widely focused on and applied to various social circles.
However, the existing expression recognition is mainly based on manually set features and machine learning methods, and the methods mainly have the following defects: manually set features often introduce unavoidable artifacts and errors, require human intervention to assist in extracting useful recognition features from the original image, and make it difficult to obtain deep high semantic features and depth features from the original image.
In order to obtain deep high semantic features, the number of convolutions is increased, but a method for enhancing the learning ability of a network by increasing the number of network layers is not always feasible, because after the number of network layers reaches a certain depth, the number of network layers is increased, so that the problem of random gradient disappearance of the network occurs, and the accuracy of the network is reduced. To solve this problem, the conventional method is to use a data initialization and regularization method, which solves the problem of gradient disappearance, but the problem of network accuracy is not improved.
Disclosure of Invention
The invention aims to solve the technical problems that: in the facial expression recognition process, how to acquire deep high semantic features, and further obtain a better facial expression recognition effect.
In order to solve the technical problems, the invention provides a facial expression recognition method based on feature fusion and attention mechanism, which comprises the following steps:
(1) Preprocessing the acquired facial expression data set;
(2) Constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
(3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
(4) Splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
(5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
(6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
(7) And (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification to obtain an expression classification result.
In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (1), preprocessing an acquired facial expression dataset, including the following steps:
creating a PIL object, so that the operations of all images in the facial expression data set are based on the PIL object;
the facial expression image is adjusted to 224 multiplied by 224, and the input data is randomly and horizontally flipped according to the probability p of being flipped of the given input data;
carrying out normalization processing on the input data after horizontal overturning;
and loading the normalized data set image to a facial expression recognition neural network model.
In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (2), the res net50 convolutional neural network structure includes seven parts:
the first part is used for filling parameters into the input image;
the second part does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual block blocks, wherein each residual block has three layers of convolutions;
the seventh part comprises an average pooling layer and a full-connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full-connection layer and then outputs a result characteristic diagram.
In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (4), the feature graphs output by the two middle layers are spliced along the channel dimension, the feature vectors in the feature graphs output by the two middle layers are 512×60×60 and 1024×60×60 respectively, and after the feature vectors are spliced, the feature vectors 1536×60×60 with weights are obtained.
In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (5), feature vectors with weights for features at the last layer further pass through two convolution layers, where the two convolution layers are a (1×1) convolution layer and a (3×3) convolution layer, respectively, where the (1×1) convolution layer is used to compress the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, so as to obtain an output result one and an output result two respectively.
In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (6),QKVrespectively representing a query vector, a key vector and a value vector, wherein the key vector and the value vector are in a paired form, and the output result I and the output result II in the step (5) are input into an RKTM module and are respectively used as the query vector and the key vector;
in the ordinary differential equation, the Euler equation is expressed as:
Figure SMS_1
the residual connection employed by the ResNet50 convolutional neural network is denoted as:
Figure SMS_2
solving the transducer network model by using a second-order Longkurd formula to obtain the following formula:
Figure SMS_3
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_4
time of presentation->
Figure SMS_5
Representing a transducer network model, +.>
Figure SMS_6
For model parameters from ResNet50 convolutional neural network, +.>
Figure SMS_7
、/>
Figure SMS_8
Representing an attention sub-module I and an attention sub-module II in the RKTM module respectively;
for one input image
Figure SMS_10
First, feature extraction using ResNet50 convolutional neural network yields +.>
Figure SMS_13
Wherein->
Figure SMS_15
For the feature, R is the real set, +.>
Figure SMS_11
Respectively representing the number, length and width of channels, and obtaining ++after dimension reduction of the multidimensional data>
Figure SMS_12
Let parameter one->
Figure SMS_14
There is->
Figure SMS_16
The size of the feature is noted->
Figure SMS_9
Wherein b represents the sample size for each batch of training;
transformer network model
Figure SMS_17
The calculation process of (1) is as follows:
head label (head) with attention mechanism is
Figure SMS_18
Features->
Figure SMS_19
Is deformed into
Figure SMS_20
Wherein
Parameter two
Figure SMS_21
Exchange parameter two
Figure SMS_22
And parameter one->
Figure SMS_23
Two channels get->
Figure SMS_24
Setting matrix one
Figure SMS_25
Matrix two->
Figure SMS_26
Matrix III->
Figure SMS_27
Is a learnable parameter, then obtain
Figure SMS_28
Will query vectors
Figure SMS_29
And key vector->
Figure SMS_30
Transposed matrix of->
Figure SMS_31
Multiplying and performing +_in the final dimension>
Figure SMS_32
Operation, get attention score matrix +.>
Figure SMS_33
The operation process is as follows:
Figure SMS_34
and then will beAttention score matrix and value vector
Figure SMS_35
Multiplying to obtain output
Figure SMS_36
Output of
Figure SMS_37
Is shaped as +.>
Figure SMS_38
The output result of the formula (6) is put into a second-order Dragon's Kernel formula to obtain a transducer network model +.>
Figure SMS_39
An expression.
A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:
and a pretreatment module: preprocessing the acquired facial expression data set;
the neural network model building module: constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
and the information extraction module is used for: extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain a feature vector with weight, thereby realizing feature fusion;
and the convolution operation module is used for: carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
and a secondary splicing module: in a transducer network model, performing secondary splicing on the output result I and the output result II;
and a classification module: and (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a facial expression recognition method based on feature fusion and attention mechanisms as described above.
An embedded device configured with a trusted execution environment, the trusted execution environment comprising:
a memory for storing instructions;
and the processor is used for executing the instructions to enable the embedded device to execute the facial expression recognition method based on the feature fusion and the attention mechanism.
The invention has the beneficial effects that: the facial expression recognition method based on the feature fusion and the attention mechanism is based on the ResNet50 neural network, wherein the residual error module can solve the gradient problem, the deep network of the ResNet50 neural network also enables the expressed features to be better, the corresponding detection or classification performance is stronger, the parameter quantity can be reduced, and the calculated amount can be reduced to a certain extent. The basic characteristics of the transducer model are that a Self-Attention mechanism (Self-Attention) and a residual error connection structure (Residual Connection) are introduced, and compared with the traditional sequence model, the information of all positions in an input sequence can be fully considered globally, so that a deeper network can be effectively trained, the effect of doubly improving the recognition accuracy is achieved integrally, and meanwhile, the training process is accelerated.
Meanwhile, the method can obtain the model with stronger generalization capability by solving the second-order Dragon's Kuntze formula
Figure SMS_40
I.e. also have a good classification ability on training data other than the present example.
Drawings
Fig. 1 is a schematic diagram of the overall network structure of embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of a ResNet50 convolutional neural network architecture;
FIG. 3 is a schematic structural view of the RKTM module;
FIG. 4 is a schematic diagram of recognition accuracy of the method of the present invention;
fig. 5 is a schematic of the accuracy of direct training of a res net50 convolutional neural network.
Detailed Description
The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments.
Example 1
In this embodiment, a common facial expression data set fer2013 is used, and the data set is composed of 35886 different facial expression images, each image is a gray image with a size fixed to 48×48, and has 7 types of expressions, namely, anger (anger), aversion (fear), happiness (happiness), sadness (sadness), surprise (surprise), neutrality (neutral), and the official stores expression related data in a cvs file, which can be converted and then stored as image data.
A facial expression recognition method based on feature fusion and attention mechanism comprises the following steps:
1) Preprocessing the acquired facial expression data set, comprising the following steps of:
creating a PIL object, so that the operations of all images in the facial expression data set are based on the PIL object;
adjusting the facial expression image to 224×224, and randomly and horizontally flipping the input data according to the default given probability of the input data being flipped p=0.5;
carrying out normalization processing on the input data after horizontal overturn, and mean average value: [0.485, 0.456, 0.406], std standard deviation: [0.229, 0.224, 0.225];
and loading the normalized data set image to a facial expression recognition neural network model, and enhancing the data in the data set through preprocessing to enrich training data.
2) Constructing a facial expression recognition neural network model, wherein fig. 1 is a schematic diagram of an overall neural network structure of embodiment 1, and comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism; compared with a convolutional neural network, the transducer model has the characteristic of global receptive field, the distance between any two pixels is the same, and the relation between vectors in the whole characteristic diagram can be measured, wherein RKTM is used as a main coding functional module of the encoder.
As shown in fig. 2, the res net50 convolutional neural network structure includes seven parts:
a first section (stage 0) for filling an input image with parameters (padding), the parameters being (3, 3);
the second part (stage 1) does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual blocks, the number of the residual blocks is 3, 4, 6 and 3, each residual block has three layers of convolution, the convolution kernels are 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step size is 1 when the convolution operation is carried out, but the filling parameter (padding) of the second convolution is (1, 1), namely, one circle of 0 is supplemented around the input image data;
the seventh part comprises an average pooling layer and a full connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full connection layer (fc) and then outputs the result, so that the input image with the size of 224×224 becomes a characteristic diagram with the size of 56×56, and the storage space is greatly reduced.
3) The output characteristics of the fourth part and the fifth part of the ResNet50 convolutional neural network contain rich image structure information and are called an intermediate layer; the sixth part is the last layer of the ResNet50 convolutional neural network containing convolutional operations, and the output features contain rich semantic features, called the last layer; because the ResNet50 convolutional neural network is trained by using the ImageNet, and the classification task is corresponding, the final-layer output of the feature extractor is semantic features;
4) Splicing the feature graphs output by the two middle layers along the channel dimension, wherein the feature vectors in the feature graphs output by the two middle layers are 512 multiplied by 60 and 1024 multiplied by 60 respectively, and after splicing, obtaining feature vectors 1536 multiplied by 60 with weights, thereby realizing the fusion of the features of different layers;
5) The final layer features and the feature vectors with weights further pass through two convolution layers, namely a (1×1) convolution layer and a (3×3) convolution layer, wherein the (1×1) convolution layer is used for compressing the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion to respectively obtain an output result I and an output result II. This step ensures that the two output results can be successfully input into the following transducer network model, a deep learning model using self-attention mechanisms.
6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
7) Downsampling the spliced features, wherein the aim is to extract the features, send the features into a full-connection layer to obtain final feature vectors, finally input a softmax classifier for calculation, output class probability and obtain expression classification results;
8) The verification is carried out on the fer2013 facial expression public data set, as shown in fig. 4, the recognition accuracy of the method reaches 65%, the recognition rate of the ResNet50 network can only reach 57% after direct training, as shown in fig. 5, the embodiment improves the face recognition accuracy on a specific data set by 8% through feature fusion and embedding an improved attention mechanism. A large number of research results show that the depth features extracted by using the convolutional neural network have good robustness to deformation such as translation, rotation, scaling and the like, and different convolutional layers can extract features of different levels, so that the local and global characteristics of an image can be effectively represented, and the model of the embodiment has better robustness.
Example 2
A facial expression recognition method based on feature fusion and attention mechanism comprises the following steps:
(1) Preprocessing the acquired facial expression data set;
(2) Constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
(3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
(4) Splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
(5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
(6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
(7) And (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.
In step 6), fig. 3 is a block diagram of the RKTM module, i.e., the multi-headed self-attention module, as an encoder in a transducer model, wherein,QKVrespectively representing a query vector query, a key vector key and a value vector value, wherein the key vector and the value vector appear in a pair form and depend on an input value input.
Inputting the first output result and the second output result in the step 5) into an RKTM module to respectively serve as a query vector and a key vector;
in the ordinary differential equation, the Euler equation is expressed as:
Figure SMS_41
the residual connection employed by the ResNet50 convolutional neural network is denoted as:
Figure SMS_42
the Euler equation is a first-order form of the Dragon-Kutta equation, and the second-order Dragon equation is used to solve the transducer network model to obtain the following equation:
Figure SMS_43
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_44
time of presentation->
Figure SMS_45
Representing a transducer network model, +.>
Figure SMS_46
For model parameters from ResNet50 convolutional neural network, +.>
Figure SMS_47
、/>
Figure SMS_48
Representing the attention sub-module one and the attention sub-module two in the RKTM module, respectively.
For one input image
Figure SMS_50
First, feature extraction using ResNet50 convolutional neural network yields +.>
Figure SMS_53
Wherein->
Figure SMS_54
For the feature, R is the real set, +.>
Figure SMS_49
Respectively representing the number, length and width of channels, and obtaining ++after dimension reduction of the multidimensional data>
Figure SMS_52
Let parameter one->
Figure SMS_55
There is->
Figure SMS_56
Since deep learning uses a small-batch training method, the size of the feature is denoted as +.>
Figure SMS_51
Wherein b is the batch_size, which is the sample size for each batch of training;
transformer network model
Figure SMS_57
The calculation process of (1) is as follows:
head label (head) with attention mechanism is
Figure SMS_58
Features->
Figure SMS_59
Is deformed into
Figure SMS_60
Wherein
Parameter two
Figure SMS_61
Exchange parameter two
Figure SMS_62
And parameter one->
Figure SMS_63
Two channels get->
Figure SMS_64
Setting matrix one
Figure SMS_65
Matrix two->
Figure SMS_66
Matrix III->
Figure SMS_67
Is a learnable parameter, then obtain
Figure SMS_68
Will query vectors
Figure SMS_69
And key vector->
Figure SMS_70
Transposed matrix of->
Figure SMS_71
Multiplication, i.e. dot product calculation, and in the final dimension
Figure SMS_72
Operation, get attention score matrix +.>
Figure SMS_73
The operation process is as follows:
Figure SMS_74
the attention score measures the similarity between every two features, and then the attention score matrix and the value vector
Figure SMS_75
Multiplying to obtain output
Figure SMS_76
Output of
Figure SMS_77
Is shaped as +.>
Figure SMS_78
It can be seen that +.>
Figure SMS_79
And->
Figure SMS_80
The spatial dimensions of (2) are kept consistent, so that the output result of the formula (6) is brought into a second-order Dragon's Kernel formula to obtain a transducer network model +.>
Figure SMS_81
An expression. The step obtains a specific model of the transducer network model.
And 5) respectively processing the output result I and the output result II in the step 5) through a transducer network model to obtain two output characteristics, and performing secondary splicing operation on the two output characteristics, namely, the two output characteristics are spliced into 128 multiplied by 7 and 64 multiplied by 7, so that the secondary splicing of the characteristics is realized.
A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:
and a pretreatment module: preprocessing the acquired facial expression data set;
the neural network model building module: constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
and the information extraction module is used for: extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain a feature vector with weight, thereby realizing feature fusion;
and the convolution operation module is used for: carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
and a secondary splicing module: in a transducer network model, performing secondary splicing on the output result I and the output result II;
and a classification module: and (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a facial expression recognition method based on feature fusion and attention mechanisms as described above.
An embedded device configured with a trusted execution environment, the trusted execution environment comprising:
a memory for storing instructions;
and the processor is used for executing the instructions to enable the embedded device to execute the facial expression recognition method based on the feature fusion and the attention mechanism.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (9)

1. The facial expression recognition method based on the feature fusion and the attention mechanism is characterized by comprising the following steps of:
(1) Preprocessing the acquired facial expression data set;
(2) Constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
(3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
(4) Splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
(5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
(6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
(7) Downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the substmax classifier to classify the subsampled result to obtain an expression classification result;
in step (6), Q, K, V represents a query vector, a key vector and a value vector, and the output result I and the output result II in step (5) are input into an RKTM module and are used as the query vector and the key vector respectively;
in the ordinary differential equation, the Euler equation is expressed as:
Figure QLYQS_1
the residual connection employed by the ResNet50 convolutional neural network is denoted as:
y t+1 =f(y tt )+y t (2)
solving the transducer network model by using a second-order Longkurd formula to obtain the following formula:
Figure QLYQS_2
wherein t represents time, f (y tt ) Representing a transducer network model, θ t For model parameters from ResNet50 convolutional neural network, f 1 、f 2 Representing an attention sub-module I and an attention sub-module II in the RKTM module respectively;
img e R for an input image 3×H×W Firstly, feature extraction is carried out by using ResNet50 convolutional neural network to obtain the feature E R c×h×ω Wherein the feat isCharacterized in that R is a real number set, c, h and omega respectively represent the number, length and width of channels, and the multi-dimensional data is subjected to dimension reduction to obtain the coat E R c×hω Let parameter n=hω, then there is fet=r c×n The size of the feature is denoted as feature ε R b×c×n Wherein b represents the sample size for each batch of training;
transformer network model f (y tt ) The calculation process of (1) is as follows:
let the head label of the attention mechanism be h1, deform the feature feat into
Figure QLYQS_3
Wherein the method comprises the steps of
Parameter two
Figure QLYQS_4
Exchange parameter two d k And parameters of one n two channels
Figure QLYQS_5
Setting matrix one
Figure QLYQS_6
Matrix two->
Figure QLYQS_7
Matrix III->
Figure QLYQS_8
Is a learnable parameter, then obtain
Figure QLYQS_9
Transpose matrix K of query vector Q and key vector K T Multiplying and Softmax operation is carried out in the final dimension to obtain the attention score matrix
Figure QLYQS_10
The operation process is as follows:
Figure QLYQS_11
multiplying the attention score matrix with the value vector V to obtain an output
Figure QLYQS_12
The shape of the output y is b multiplied by c multiplied by n, and the output result of the formula (6) is put into a second-order Longcase formula to obtain a transducer network model f (y tt ) An expression.
2. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in step (1), the obtained facial expression data set is preprocessed, comprising the steps of:
creating a PIL object, so that the operations of all images in the facial expression data set are based on the PIL object;
the facial expression image is adjusted to 224 multiplied by 224, and the input data is randomly and horizontally flipped according to the probability p of being flipped of the given input data;
carrying out normalization processing on the input data after horizontal overturning;
and loading the normalized data set image to a facial expression recognition neural network model.
3. The facial expression recognition method based on feature fusion and attention mechanisms of claim 1, wherein in step (2), the res net50 convolutional neural network structure comprises seven parts:
the first part is used for filling parameters into the input image;
the second part does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual block blocks, wherein each residual block has three layers of convolutions;
the seventh part comprises an average pooling layer and a full-connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full-connection layer and then outputs a result characteristic diagram.
4. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in the step (4), feature graphs output by two middle layers are spliced along a channel dimension, feature vectors in the feature graphs output by the two middle layers are 512×60×60 and 1024×60×60 respectively, and feature vectors 1536×60×60 with weights are obtained after the splicing.
5. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in step (5), feature vectors with weights for the final layer feature further pass through two convolution layers, namely a (1×1) convolution layer and a (3×3) convolution layer, respectively, the (1×1) convolution layer is used for compressing the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, so as to obtain an output result one and an output result two respectively.
6. A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:
and a pretreatment module: preprocessing the acquired facial expression data set;
the neural network model building module: constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
and the information extraction module is used for: extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
and the convolution operation module is used for: carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
and a secondary splicing module: in a transducer network model, performing secondary splicing on the output result I and the output result II;
and a classification module: downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the substmax classifier to classify the subsampled result, so that an expression classification result is obtained;
in the secondary splicing module, Q, K, V represents a query vector, a key vector and a value vector respectively, and the output result I and the output result II in the step (5) are input into the RKTM module and are used as the query vector and the key vector respectively;
in the ordinary differential equation, the Euler equation is expressed as:
Figure QLYQS_13
the residual connection employed by the ResNet50 convolutional neural network is denoted as:
y t+1 =f(y tt )+y t (2)
solving the transducer network model by using a second-order Longkurd formula to obtain the following formula:
Figure QLYQS_14
wherein t represents time, f (y tt ) Representing a transducer network model, θ t For model parameters from ResNet50 convolutional neural network, f 1 、f 2 Representing an attention sub-module I and an attention sub-module II in the RKTM module respectively;
for one ofInput image img E R 3×H×W Firstly, feature extraction is carried out by using ResNet50 convolutional neural network to obtain the feature E R c×h×ω Wherein, the feature is that R is a real number set, c, h and omega respectively represent the number, length and width of channels, and the multi-dimensional data is subjected to dimension reduction to obtain the feature E R c×hω Let parameter n=hω, then there is fet=r c×n The size of the feature is denoted as feature ε R b×c×n Wherein b represents the sample size for each batch of training;
transformer network model f (y tt ) The calculation process of (1) is as follows:
let the head label of the attention mechanism be h1, deform the feature feat into
Figure QLYQS_15
Wherein the method comprises the steps of
Parameter two
Figure QLYQS_16
Exchange parameter two d k And parameters of one n two channels
Figure QLYQS_17
Setting matrix one
Figure QLYQS_18
Matrix two->
Figure QLYQS_19
Matrix III->
Figure QLYQS_20
Is a learnable parameter, then obtain
Figure QLYQS_21
Transpose matrix K of query vector Q and key vector K T Multiplying and Softmax operation is carried out in the final dimension to obtain the attention score matrix
Figure QLYQS_22
The operation process is as follows:
Figure QLYQS_23
multiplying the attention score matrix with the value vector V to obtain an output
Figure QLYQS_24
The shape of the output y is b multiplied by c multiplied by n, and the output result of the formula (6) is put into a second-order Longcase formula to obtain a transducer network model f (y tt ) An expression.
7. The facial expression recognition system based on feature fusion and attention mechanisms of claim 6, wherein in a neural network model building block, the res net50 convolutional neural network structure comprises seven parts:
the first part is used for filling parameters into the input image;
the second part does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual block blocks, wherein each residual block has three layers of convolutions;
the seventh part comprises an average pooling layer and a full-connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full-connection layer and then outputs a result characteristic diagram.
8. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism of any one of claims 1 to 5.
9. An embedded device configured with a trusted execution environment, the trusted execution environment comprising:
a memory for storing instructions;
a processor configured to execute the instructions, so that the embedded device performs a facial expression recognition method that implements the feature fusion and attention mechanism-based facial expression recognition method according to any one of claims 1 to 5.
CN202310493454.6A 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism Active CN116189272B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310493454.6A CN116189272B (en) 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310493454.6A CN116189272B (en) 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism

Publications (2)

Publication Number Publication Date
CN116189272A CN116189272A (en) 2023-05-30
CN116189272B true CN116189272B (en) 2023-07-07

Family

ID=86433105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310493454.6A Active CN116189272B (en) 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism

Country Status (1)

Country Link
CN (1) CN116189272B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424313A (en) * 2022-07-20 2022-12-02 河海大学常州校区 Expression recognition method and device based on deep and shallow layer multi-feature fusion
CN115862091A (en) * 2022-11-09 2023-03-28 暨南大学 Facial expression recognition method, device, equipment and medium based on Emo-ResNet
CN115984930A (en) * 2022-12-26 2023-04-18 中国电信股份有限公司 Micro expression recognition method and device and micro expression recognition model training method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110081881B (en) * 2019-04-19 2022-05-10 成都飞机工业(集团)有限责任公司 Carrier landing guiding method based on unmanned aerial vehicle multi-sensor information fusion technology
CN111680541B (en) * 2020-04-14 2022-06-21 华中科技大学 Multi-modal emotion analysis method based on multi-dimensional attention fusion network
CN112418095B (en) * 2020-11-24 2023-06-30 华中师范大学 Facial expression recognition method and system combined with attention mechanism
CN112541409B (en) * 2020-11-30 2021-09-14 北京建筑大学 Attention-integrated residual network expression recognition method
CN114764941A (en) * 2022-04-25 2022-07-19 深圳技术大学 Expression recognition method and device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424313A (en) * 2022-07-20 2022-12-02 河海大学常州校区 Expression recognition method and device based on deep and shallow layer multi-feature fusion
CN115862091A (en) * 2022-11-09 2023-03-28 暨南大学 Facial expression recognition method, device, equipment and medium based on Emo-ResNet
CN115984930A (en) * 2022-12-26 2023-04-18 中国电信股份有限公司 Micro expression recognition method and device and micro expression recognition model training method

Also Published As

Publication number Publication date
CN116189272A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Zhang et al. Depth-wise separable convolutions and multi-level pooling for an efficient spatial CNN-based steganalysis
CN109522818B (en) Expression recognition method and device, terminal equipment and storage medium
Pan et al. Compressing recurrent neural networks with tensor ring for action recognition
CN109948691B (en) Image description generation method and device based on depth residual error network and attention
Guo et al. Robust student network learning
CN112348191B (en) Knowledge base completion method based on multi-mode representation learning
CN112613581A (en) Image recognition method, system, computer equipment and storage medium
CN108182475A (en) It is a kind of based on automatic coding machine-the multi-dimensional data characteristic recognition method of the learning machine that transfinites
WO2022002943A1 (en) Semantic Relation Preserving Knowledge Distillation For Image-To-Image Translation
CN113920516B (en) Calligraphy character skeleton matching method and system based on twin neural network
Kembuan et al. Convolutional neural network (CNN) for image classification of indonesia sign language using tensorflow
CN110390307B (en) Expression recognition method, and expression recognition model training method and device
CN115966010A (en) Expression recognition method based on attention and multi-scale feature fusion
CN114882278A (en) Tire pattern classification method and device based on attention mechanism and transfer learning
CN114694224A (en) Customer service question and answer method, customer service question and answer device, customer service question and answer equipment, storage medium and computer program product
CN116189272B (en) Facial expression recognition method and system based on feature fusion and attention mechanism
Yuan et al. Children's drawing psychological analysis using shallow convolutional neural network
CN111368734A (en) Micro expression recognition method based on normal expression assistance
Xie et al. Global semantic-guided network for saliency prediction
CN116109980A (en) Action recognition method based on video text matching
Li et al. Generating anime characters and experimental analysis based on DCGAN model
CN113673303B (en) Intensity regression method, device and medium for face action unit
CN111898576B (en) Behavior identification method based on human skeleton space-time relationship
CN108960275A (en) A kind of image-recognizing method and system based on depth Boltzmann machine
TWI722383B (en) Pre feature extraction method applied on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant