CN116189272A - Facial expression recognition method and system based on feature fusion and attention mechanism - Google Patents
Facial expression recognition method and system based on feature fusion and attention mechanism Download PDFInfo
- Publication number
- CN116189272A CN116189272A CN202310493454.6A CN202310493454A CN116189272A CN 116189272 A CN116189272 A CN 116189272A CN 202310493454 A CN202310493454 A CN 202310493454A CN 116189272 A CN116189272 A CN 116189272A
- Authority
- CN
- China
- Prior art keywords
- facial expression
- neural network
- layer
- result
- output result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008921 facial expression Effects 0.000 title claims abstract description 70
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000007246 mechanism Effects 0.000 title claims abstract description 40
- 230000004927 fusion Effects 0.000 title claims abstract description 39
- 239000013598 vector Substances 0.000 claims abstract description 51
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 34
- 238000003062 neural network model Methods 0.000 claims abstract description 21
- 230000014509 gene expression Effects 0.000 claims abstract description 14
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 18
- 238000010586 diagram Methods 0.000 claims description 16
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000001965 increasing effect Effects 0.000 description 3
- 230000008034 disappearance Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 206010063659 Aversion Diseases 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000004630 mental health Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Evolutionary Computation (AREA)
- Human Computer Interaction (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a facial expression recognition method and a system based on feature fusion and an attention mechanism, wherein the method comprises the following steps: the method comprises the steps of (1) preprocessing an acquired facial expression data set; (2) constructing a facial expression recognition neural network model; (3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network; (4) Splicing the feature graphs output by the two middle layers to obtain feature vectors with weights; (5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II; (6) Performing secondary splicing on the output result I and the output result II in a Transformer network model; (7) And sending the result after the secondary splicing into a full-connection layer, and inputting the full-connection layer into a softmax classifier for classification to obtain an expression classification result. The method can improve the accuracy of facial expression recognition.
Description
Technical Field
The invention relates to a facial expression recognition method, and belongs to the technical field of image processing.
Background
Facial expression is an important carrier capable of expressing emotion of the heart besides language. In recent years, facial expression recognition (fer) is widely used in the fields of Internet of things, artificial intelligence, mental health evaluation and the like, and is widely focused on and applied to various social circles.
However, the existing expression recognition is mainly based on manually set features and machine learning methods, and the methods mainly have the following defects: manually set features often introduce unavoidable artifacts and errors, require human intervention to assist in extracting useful recognition features from the original image, and make it difficult to obtain deep high semantic features and depth features from the original image.
In order to obtain deep high semantic features, the number of convolutions is increased, but a method for enhancing the learning ability of a network by increasing the number of network layers is not always feasible, because after the number of network layers reaches a certain depth, the number of network layers is increased, so that the problem of random gradient disappearance of the network occurs, and the accuracy of the network is reduced. To solve this problem, the conventional method is to use a data initialization and regularization method, which solves the problem of gradient disappearance, but the problem of network accuracy is not improved.
Disclosure of Invention
The invention aims to solve the technical problems that: in the facial expression recognition process, how to acquire deep high semantic features, and further obtain a better facial expression recognition effect.
In order to solve the technical problems, the invention provides a facial expression recognition method based on feature fusion and attention mechanism, which comprises the following steps:
(1) Preprocessing the acquired facial expression data set;
(2) Constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
(3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
(4) Splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
(5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
(6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
(7) And (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification to obtain an expression classification result.
In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (1), preprocessing an acquired facial expression dataset, including the following steps:
creating a PIL object, so that the operations of all images in the facial expression data set are based on the PIL object;
the facial expression image is adjusted to 224 multiplied by 224, and the input data is randomly and horizontally flipped according to the probability p of being flipped of the given input data;
carrying out normalization processing on the input data after horizontal overturning;
and loading the normalized data set image to a facial expression recognition neural network model.
In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (2), the res net50 convolutional neural network structure includes seven parts:
the first part is used for filling parameters into the input image;
the second part does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual block blocks, wherein each residual block has three layers of convolutions;
the seventh part comprises an average pooling layer and a full-connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full-connection layer and then outputs a result characteristic diagram.
In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (4), the feature graphs output by the two middle layers are spliced along the channel dimension, the feature vectors in the feature graphs output by the two middle layers are 512×60×60 and 1024×60×60 respectively, and after the feature vectors are spliced, the feature vectors 1536×60×60 with weights are obtained.
In the foregoing facial expression recognition method based on feature fusion and attention mechanism, in step (5), feature vectors with weights for features at the last layer further pass through two convolution layers, where the two convolution layers are a (1×1) convolution layer and a (3×3) convolution layer, respectively, where the (1×1) convolution layer is used to compress the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, so as to obtain an output result one and an output result two respectively.
In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (6),Q、K、Vrespectively representing a query vector, a key vector and a value vector, wherein the key vector and the value vector are in a paired form, and the output result I and the output result II in the step (5) are input into an RKTM module and are respectively used as the query vector and the key vector;
in the ordinary differential equation, the Euler equation is expressed as:
the residual connection employed by the ResNet50 convolutional neural network is denoted as:
solving the transducer network model by using a second-order Longkurd formula to obtain the following formula:
wherein ,time of presentation->Representing a transducer network model, +.>For model parameters from ResNet50 convolutional neural network, +.>、Representing an attention sub-module I and an attention sub-module II in the RKTM module respectively; />
For one input imageFirst, feature extraction using ResNet50 convolutional neural network yields +.>, whereinFor the feature, R is the real set, +.>Respectively representing the number, length and width of channels, and obtaining ++after dimension reduction of the multidimensional data>Let parameter one->There is->The size of the feature is noted asWherein b represents the sample size for each batch of training;
Exchange parameter twoAnd parameter one->Two channels get->Setting matrix oneMatrix two->Matrix III->Is a learnable parameter, then obtain
Will query vectorsAnd key vector->Transposed matrix of->Multiplying and performing +_in the final dimension>Operation, get attention score matrix +.>The operation process is as follows:
Output ofIs shaped as +.>The output result of the formula (6) is put into a second-order Dragon's Kernel formula to obtain a transducer network model +.>An expression.
A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:
and a pretreatment module: preprocessing the acquired facial expression data set;
the neural network model building module: constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
and the information extraction module is used for: extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain a feature vector with weight, thereby realizing feature fusion;
and the convolution operation module is used for: carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
and a secondary splicing module: in a transducer network model, performing secondary splicing on the output result I and the output result II;
and a classification module: and (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a facial expression recognition method based on feature fusion and attention mechanisms as described above.
An embedded device configured with a trusted execution environment, the trusted execution environment comprising:
a memory for storing instructions;
and the processor is used for executing the instructions to enable the embedded device to execute the facial expression recognition method based on the feature fusion and the attention mechanism.
The invention has the beneficial effects that: the facial expression recognition method based on the feature fusion and the attention mechanism is based on the ResNet50 neural network, wherein the residual error module can solve the gradient problem, the deep network of the ResNet50 neural network also enables the expressed features to be better, the corresponding detection or classification performance is stronger, the parameter quantity can be reduced, and the calculated amount can be reduced to a certain extent. The basic characteristics of the transducer model are that a Self-Attention mechanism (Self-Attention) and a residual error connection structure (Residual Connection) are introduced, and compared with the traditional sequence model, the information of all positions in an input sequence can be fully considered globally, so that a deeper network can be effectively trained, the effect of doubly improving the recognition accuracy is achieved integrally, and meanwhile, the training process is accelerated.
Drawings
Fig. 1 is a schematic diagram of the overall network structure of embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of a ResNet50 convolutional neural network architecture;
FIG. 3 is a schematic structural view of the RKTM module;
FIG. 4 is a schematic diagram of recognition accuracy of the method of the present invention;
fig. 5 is a schematic of the accuracy of direct training of a res net50 convolutional neural network.
Detailed Description
The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments.
Example 1
In this embodiment, a common facial expression data set fer2013 is used, and the data set is composed of 35886 different facial expression images, each image is a gray image with a size fixed to 48×48, and has 7 types of expressions, namely, anger (anger), aversion (fear), happiness (happiness), sadness (sadness), surprise (surprise), neutrality (neutral), and the official stores expression related data in a cvs file, which can be converted and then stored as image data.
A facial expression recognition method based on feature fusion and attention mechanism comprises the following steps:
1) Preprocessing the acquired facial expression data set, comprising the following steps of:
creating a PIL object, so that the operations of all images in the facial expression data set are based on the PIL object;
adjusting the facial expression image to 224×224, and randomly and horizontally flipping the input data according to the default given probability of the input data being flipped p=0.5;
carrying out normalization processing on the input data after horizontal overturn, and mean average value: [0.485, 0.456, 0.406], std standard deviation: [0.229, 0.224, 0.225];
and loading the normalized data set image to a facial expression recognition neural network model, and enhancing the data in the data set through preprocessing to enrich training data.
2) Constructing a facial expression recognition neural network model, wherein fig. 1 is a schematic diagram of an overall neural network structure of embodiment 1, and comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism; compared with a convolutional neural network, the transducer model has the characteristic of global receptive field, the distance between any two pixels is the same, and the relation between vectors in the whole characteristic diagram can be measured, wherein RKTM is used as a main coding functional module of the encoder.
As shown in fig. 2, the res net50 convolutional neural network structure includes seven parts:
a first section (stage 0) for filling an input image with parameters (padding), the parameters being (3, 3);
the second part (stage 1) does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual blocks, the number of the residual blocks is 3, 4, 6 and 3, each residual block has three layers of convolution, the convolution kernels are 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step size is 1 when the convolution operation is carried out, but the filling parameter (padding) of the second convolution is (1, 1), namely, one circle of 0 is supplemented around the input image data;
the seventh part comprises an average pooling layer and a full connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full connection layer (fc) and then outputs the result, so that the input image with the size of 224×224 becomes a characteristic diagram with the size of 56×56, and the storage space is greatly reduced.
3) The output characteristics of the fourth part and the fifth part of the ResNet50 convolutional neural network contain rich image structure information and are called an intermediate layer; the sixth part is the last layer of the ResNet50 convolutional neural network containing convolutional operations, and the output features contain rich semantic features, called the last layer; because the ResNet50 convolutional neural network is trained by using the ImageNet, and the classification task is corresponding, the final-layer output of the feature extractor is semantic features;
4) Splicing the feature graphs output by the two middle layers along the channel dimension, wherein the feature vectors in the feature graphs output by the two middle layers are 512 multiplied by 60 and 1024 multiplied by 60 respectively, and after splicing, obtaining feature vectors 1536 multiplied by 60 with weights, thereby realizing the fusion of the features of different layers;
5) The final layer features and the feature vectors with weights further pass through two convolution layers, namely a (1×1) convolution layer and a (3×3) convolution layer, wherein the (1×1) convolution layer is used for compressing the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion to respectively obtain an output result I and an output result II. This step ensures that the two output results can be successfully input into the following transducer network model, a deep learning model using self-attention mechanisms.
6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
7) Downsampling the spliced features, wherein the aim is to extract the features, send the features into a full-connection layer to obtain final feature vectors, finally input a softmax classifier for calculation, output class probability and obtain expression classification results;
8) The verification is carried out on the fer2013 facial expression public data set, as shown in fig. 4, the recognition accuracy of the method reaches 65%, the recognition rate of the ResNet50 network can only reach 57% after direct training, as shown in fig. 5, the embodiment improves the face recognition accuracy on a specific data set by 8% through feature fusion and embedding an improved attention mechanism. A large number of research results show that the depth features extracted by using the convolutional neural network have good robustness to deformation such as translation, rotation, scaling and the like, and different convolutional layers can extract features of different levels, so that the local and global characteristics of an image can be effectively represented, and the model of the embodiment has better robustness.
Example 2
A facial expression recognition method based on feature fusion and attention mechanism comprises the following steps:
(1) Preprocessing the acquired facial expression data set;
(2) Constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
(3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
(4) Splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
(5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
(6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
(7) And (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.
In step 6), fig. 3 is a block diagram of the RKTM module, i.e., the multi-headed self-attention module, as an encoder in a transducer model, wherein,Q、K、Vrespectively representing a query vector query, a key vector key and a value vector value, wherein the key vector and the value vector appear in a pair form and depend on an input value input.
Inputting the first output result and the second output result in the step 5) into an RKTM module to respectively serve as a query vector and a key vector;
in the ordinary differential equation, the Euler equation is expressed as:
the residual connection employed by the ResNet50 convolutional neural network is denoted as:
the Euler equation is a first-order form of the Dragon-Kutta equation, and the second-order Dragon equation is used to solve the transducer network model to obtain the following equation:
wherein ,time of presentation->Representing a transducer network model, +.>For model parameters from ResNet50 convolutional neural network, +.>、Representing the attention sub-module one and the attention sub-module two in the RKTM module, respectively.
For one input imageFirst, feature extraction using ResNet50 convolutional neural network yields +.>, wherein ,For the feature, R is the real set, +.>Respectively representing the number, length and width of channels, and obtaining ++after dimension reduction of the multidimensional data>Let parameter one->There is->Since deep learning uses a small-batch training method, the size of the feature is denoted as +.>Wherein b is the batch_size, which is the sample size for each batch of training;
Exchange parameter twoAnd parameter one->Two channels get->Setting matrix oneMatrix two->Matrix III->Is a learnable parameter, then obtain
Will query vectorsAnd key vector->Transposed matrix of->Multiplication, i.e. dot product calculation, and in the final dimensionOperation, get attention score matrix +.>The operation process is as follows:
the attention score measures the similarity between every two features, and then the attention score matrix and the value vectorMultiplying to obtain output
Output ofIs shaped as +.>It can be seen that +.>And->The spatial dimensions of (2) are kept consistent, so that the output result of the formula (6) is brought into a second-order Dragon's Kernel formula to obtain a transducer network model +.>An expression. The step obtains a specific model of the transducer network model.
And 5) respectively processing the output result I and the output result II in the step 5) through a transducer network model to obtain two output characteristics, and performing secondary splicing operation on the two output characteristics, namely, the two output characteristics are spliced into 128 multiplied by 7 and 64 multiplied by 7, so that the secondary splicing of the characteristics is realized.
A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:
and a pretreatment module: preprocessing the acquired facial expression data set;
the neural network model building module: constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
and the information extraction module is used for: extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain a feature vector with weight, thereby realizing feature fusion;
and the convolution operation module is used for: carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
and a secondary splicing module: in a transducer network model, performing secondary splicing on the output result I and the output result II;
and a classification module: and (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a facial expression recognition method based on feature fusion and attention mechanisms as described above.
An embedded device configured with a trusted execution environment, the trusted execution environment comprising:
a memory for storing instructions;
and the processor is used for executing the instructions to enable the embedded device to execute the facial expression recognition method based on the feature fusion and the attention mechanism.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
Claims (10)
1. The facial expression recognition method based on the feature fusion and the attention mechanism is characterized by comprising the following steps of:
(1) Preprocessing the acquired facial expression data set;
(2) Constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
(3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
(4) Splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
(5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
(6) In a transducer network model, performing secondary splicing on the output result I and the output result II;
(7) And (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification to obtain an expression classification result.
2. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in step (1), the obtained facial expression data set is preprocessed, comprising the steps of:
creating a PIL object, so that the operations of all images in the facial expression data set are based on the PIL object;
the facial expression image is adjusted to 224 multiplied by 224, and the input data is randomly and horizontally flipped according to the probability p of being flipped of the given input data;
carrying out normalization processing on the input data after horizontal overturning;
and loading the normalized data set image to a facial expression recognition neural network model.
3. The facial expression recognition method based on feature fusion and attention mechanisms of claim 1, wherein in step (2), the res net50 convolutional neural network structure comprises seven parts:
the first part is used for filling parameters into the input image;
the second part does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual block blocks, wherein each residual block has three layers of convolutions;
the seventh part comprises an average pooling layer and a full-connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full-connection layer and then outputs a result characteristic diagram.
4. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in the step (4), feature graphs output by two middle layers are spliced along a channel dimension, feature vectors in the feature graphs output by the two middle layers are 512×60×60 and 1024×60×60 respectively, and feature vectors 1536×60×60 with weights are obtained after the splicing.
5. The facial expression recognition method based on feature fusion and attention mechanism according to claim 1, wherein in step (5), feature vectors with weights for the final layer feature further pass through two convolution layers, namely a (1×1) convolution layer and a (3×3) convolution layer, respectively, the (1×1) convolution layer is used for compressing the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, so as to obtain an output result one and an output result two respectively.
6. The facial expression recognition method based on the feature fusion and the attention mechanism of claim 1,characterized in that in the step (6), Q、K、Vrespectively representing a query vector, a key vector and a value vector, and inputting the first output result and the second output result in the step (5) into an RKTM module to be respectively used as the query vector and the key vector;
in the ordinary differential equation, the Euler equation is expressed as:
the residual connection employed by the ResNet50 convolutional neural network is denoted as:
solving the transducer network model by using a second-order Longkurd formula to obtain the following formula:
wherein ,time of presentation->Representing a transducer network model, +.>For model parameters from ResNet50 convolutional neural network, +.>、Representing an attention sub-module I and an attention sub-module II in the RKTM module respectively;
for one input imageFirst, the ResNet50 convolutional neural network is used to extract features to obtain, whereinFor the feature, R is the real set, +.>Respectively representing the number, length and width of channels, and obtaining ++after dimension reduction of the multidimensional data>Let parameter one->There is->The size of the feature is noted asWherein b represents the sample size for each batch of training;
Exchange parameter twoAnd parameter one->Two channels get->Setting matrix one->Matrix two->Matrix III->Is a learnable parameter, then get +>
Will query vectorsAnd key vector->Transposed matrix of->Multiplying and performing +_in the final dimension>Operation, get attention score matrix +.>The operation process is as follows:
7. A facial expression recognition system based on feature fusion and attention mechanism, comprising the following modules:
and a pretreatment module: preprocessing the acquired facial expression data set;
the neural network model building module: constructing a facial expression recognition neural network model, wherein the facial expression recognition neural network model comprises a ResNet50 convolutional neural network and a transducer model with a multi-head self-attention mechanism;
and the information extraction module is used for: extracting two middle layer features and end layer features of the ResNet50 convolutional neural network, wherein the middle layer features comprise image structure information, and the end layer features comprise semantic features;
and (3) a primary splicing module: splicing the feature graphs output by the two middle layers along the channel dimension to obtain feature vectors with weights, so as to realize fusion of features of different layers;
and the convolution operation module is used for: carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II, and inputting the output result I and the output result II into a transducer network model;
and a secondary splicing module: in a transducer network model, performing secondary splicing on the output result I and the output result II;
and a classification module: and (3) downsampling the result after the secondary splicing, sending the downsampled result into a full-connection layer, and finally inputting the subsampled result into a softmax classifier for classification, thereby obtaining an expression classification result.
8. The facial expression recognition system based on feature fusion and attention mechanisms of claim 7, wherein in a neural network model building block, the res net50 convolutional neural network structure comprises seven parts:
the first part is used for filling parameters into the input image;
the second part does not contain residual blocks and is used for carrying out convolution, regularization, activation function and maximum pooling calculation on the input image data in sequence;
the third part, the fourth part, the fifth part and the sixth part respectively comprise a plurality of residual block blocks, wherein each residual block has three layers of convolutions;
the seventh part comprises an average pooling layer and a full-connection layer, and the image data output by the sixth part sequentially passes through the average pooling layer and the full-connection layer and then outputs a result characteristic diagram.
9. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism of any one of claims 1 to 6.
10. An embedded device configured with a trusted execution environment, the trusted execution environment comprising:
a memory for storing instructions;
and the processor is used for executing the instructions to enable the embedded device to execute the facial expression recognition method based on the feature fusion and attention mechanism according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310493454.6A CN116189272B (en) | 2023-05-05 | 2023-05-05 | Facial expression recognition method and system based on feature fusion and attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310493454.6A CN116189272B (en) | 2023-05-05 | 2023-05-05 | Facial expression recognition method and system based on feature fusion and attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116189272A true CN116189272A (en) | 2023-05-30 |
CN116189272B CN116189272B (en) | 2023-07-07 |
Family
ID=86433105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310493454.6A Active CN116189272B (en) | 2023-05-05 | 2023-05-05 | Facial expression recognition method and system based on feature fusion and attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116189272B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118365974A (en) * | 2024-06-20 | 2024-07-19 | 山东省水利科学研究院 | Water quality class detection method, system and equipment based on hybrid neural network |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110081881A (en) * | 2019-04-19 | 2019-08-02 | 成都飞机工业(集团)有限责任公司 | It is a kind of based on unmanned plane multi-sensor information fusion technology warship bootstrap technique |
CN111680541A (en) * | 2020-04-14 | 2020-09-18 | 华中科技大学 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
CN112418095A (en) * | 2020-11-24 | 2021-02-26 | 华中师范大学 | Facial expression recognition method and system combined with attention mechanism |
CN112541409A (en) * | 2020-11-30 | 2021-03-23 | 北京建筑大学 | Attention-integrated residual network expression recognition method |
CN114764941A (en) * | 2022-04-25 | 2022-07-19 | 深圳技术大学 | Expression recognition method and device and electronic equipment |
CN115424313A (en) * | 2022-07-20 | 2022-12-02 | 河海大学常州校区 | Expression recognition method and device based on deep and shallow layer multi-feature fusion |
CN115862091A (en) * | 2022-11-09 | 2023-03-28 | 暨南大学 | Facial expression recognition method, device, equipment and medium based on Emo-ResNet |
CN115984930A (en) * | 2022-12-26 | 2023-04-18 | 中国电信股份有限公司 | Micro expression recognition method and device and micro expression recognition model training method |
-
2023
- 2023-05-05 CN CN202310493454.6A patent/CN116189272B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110081881A (en) * | 2019-04-19 | 2019-08-02 | 成都飞机工业(集团)有限责任公司 | It is a kind of based on unmanned plane multi-sensor information fusion technology warship bootstrap technique |
CN111680541A (en) * | 2020-04-14 | 2020-09-18 | 华中科技大学 | Multi-modal emotion analysis method based on multi-dimensional attention fusion network |
CN112418095A (en) * | 2020-11-24 | 2021-02-26 | 华中师范大学 | Facial expression recognition method and system combined with attention mechanism |
CN112541409A (en) * | 2020-11-30 | 2021-03-23 | 北京建筑大学 | Attention-integrated residual network expression recognition method |
CN114764941A (en) * | 2022-04-25 | 2022-07-19 | 深圳技术大学 | Expression recognition method and device and electronic equipment |
CN115424313A (en) * | 2022-07-20 | 2022-12-02 | 河海大学常州校区 | Expression recognition method and device based on deep and shallow layer multi-feature fusion |
CN115862091A (en) * | 2022-11-09 | 2023-03-28 | 暨南大学 | Facial expression recognition method, device, equipment and medium based on Emo-ResNet |
CN115984930A (en) * | 2022-12-26 | 2023-04-18 | 中国电信股份有限公司 | Micro expression recognition method and device and micro expression recognition model training method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118365974A (en) * | 2024-06-20 | 2024-07-19 | 山东省水利科学研究院 | Water quality class detection method, system and equipment based on hybrid neural network |
Also Published As
Publication number | Publication date |
---|---|
CN116189272B (en) | 2023-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Hard negative generation for identity-disentangled facial expression recognition | |
Zhang et al. | Depth-wise separable convolutions and multi-level pooling for an efficient spatial CNN-based steganalysis | |
CN109522818B (en) | Expression recognition method and device, terminal equipment and storage medium | |
CN109948691B (en) | Image description generation method and device based on depth residual error network and attention | |
CN110532900B (en) | Facial expression recognition method based on U-Net and LS-CNN | |
CN113158862B (en) | Multitasking-based lightweight real-time face detection method | |
CN108596258A (en) | A kind of image classification method based on convolutional neural networks random pool | |
Glauner | Deep convolutional neural networks for smile recognition | |
CN112818764B (en) | Low-resolution image facial expression recognition method based on feature reconstruction model | |
CN116189272B (en) | Facial expression recognition method and system based on feature fusion and attention mechanism | |
Kembuan et al. | Convolutional neural network (CNN) for image classification of indonesia sign language using tensorflow | |
CN113920516B (en) | Calligraphy character skeleton matching method and system based on twin neural network | |
CN113378949A (en) | Dual-generation confrontation learning method based on capsule network and mixed attention | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
CN115457568B (en) | Historical document image noise reduction method and system based on generation countermeasure network | |
CN114694224A (en) | Customer service question and answer method, customer service question and answer device, customer service question and answer equipment, storage medium and computer program product | |
CN118196231B (en) | Lifelong learning draft method based on concept segmentation | |
CN110390307B (en) | Expression recognition method, and expression recognition model training method and device | |
Kumar et al. | Pair wise training for stacked convolutional autoencoders using small scale images | |
Li et al. | End-to-end training for compound expression recognition | |
CN114882278A (en) | Tire pattern classification method and device based on attention mechanism and transfer learning | |
Li et al. | Generating anime characters and experimental analysis based on DCGAN model | |
Shukla et al. | Deep Learning Model to Identify Hide Images using CNN Algorithm | |
CN116109980A (en) | Action recognition method based on video text matching | |
CN111898576B (en) | Behavior identification method based on human skeleton space-time relationship |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |