CN116189272A

CN116189272A - Facial expression recognition method and system based on feature fusion and attention mechanism

Info

Publication number: CN116189272A
Application number: CN202310493454.6A
Authority: CN
Inventors: 陈昌红; 卢妍菲
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing Easyvision Cuizhi Technology Co ltd
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-05-30
Anticipated expiration: 2043-05-05
Also published as: CN116189272B

Abstract

The invention discloses a facial expression recognition method and a system based on feature fusion and an attention mechanism, wherein the method comprises the following steps: the method comprises the steps of (1) preprocessing an acquired facial expression data set; (2) constructing a facial expression recognition neural network model; (3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network; (4) Splicing the feature graphs output by the two middle layers to obtain feature vectors with weights; (5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II; (6) Performing secondary splicing on the output result I and the output result II in a Transformer network model; (7) And sending the result after the secondary splicing into a full-connection layer, and inputting the full-connection layer into a softmax classifier for classification to obtain an expression classification result. The method can improve the accuracy of facial expression recognition.

Description

Facial expression recognition method and system based on feature fusion and attention mechanism

技术领域Technical Field

本发明涉及一种人脸表情识别方法，属于图像处理技术领域。The invention relates to a method for recognizing facial expressions, and belongs to the technical field of image processing.

背景技术Background Art

人脸面部表情是除语言之外，能表达内心情感的重要载体。近几年来，人脸表情识别(fer)在物联网、人工智能、心理健康评估等领域大放异彩，得社会各界的广泛关注与应用。Facial expressions are an important carrier for expressing inner emotions besides language. In recent years, facial expression recognition (FER) has shined in the fields of Internet of Things, artificial intelligence, mental health assessment, etc., and has received extensive attention and application from all walks of life.

但是现有的表情识别主要基于人工设置的特征和机器学习的方法，这些方法主要存在下述缺陷：人工设置的特征通常会带来不可避免的人为因素和误差、必须要以人为干预辅佐才能从原始图像中提取有用的识别特征、难以从原始图像中获取深入的高语义特征以及深度特征。However, existing expression recognition is mainly based on manually set features and machine learning methods. These methods have the following main defects: manually set features usually bring inevitable human factors and errors, human intervention is necessary to extract useful recognition features from the original image, and it is difficult to obtain in-depth high-semantic features and deep features from the original image.

为了获取深层次的高语义特征，卷积的层数也越来越多，但通过增加网络层数的方法来增强网络的学习能力的方法并不总是可行的，因为网络层数到达一定的深度之后，再增加网络层数，那么网络就会出现随机梯度消失的问题，也会导致网络的准确率下降。为了解决这一问题，传统的方法是采用数据初始化和正则化的方法，这解决了梯度消失的问题，但是网络准确率的问题并没有改善。In order to obtain deep semantic features, the number of convolution layers is increasing. However, it is not always feasible to enhance the learning ability of the network by increasing the number of network layers. After the network layer reaches a certain depth, if the number of network layers is increased, the network will have the problem of random gradient disappearance, which will also cause the network accuracy to decrease. To solve this problem, the traditional method is to use data initialization and regularization methods, which solves the problem of gradient disappearance, but the problem of network accuracy is not improved.

发明内容Summary of the invention

本发明所要解决的技术问题：在人脸表情识别过程中，如何获取深层次的高语义特征，进而获得更好的人脸表情识别效果。The technical problem to be solved by the present invention is: in the process of facial expression recognition, how to obtain deep-level high-semantic features to obtain better facial expression recognition effect.

为解决上述技术问题，本发明提供一种基于特征融合和注意力机制的人脸表情识别方法，包括以下步骤：In order to solve the above technical problems, the present invention provides a facial expression recognition method based on feature fusion and attention mechanism, comprising the following steps:

（1）对获取的人脸表情数据集进行预处理；(1) Preprocess the acquired facial expression dataset;

（2）构建人脸表情识别神经网络模型，包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型；(2) Construct a neural network model for facial expression recognition, including a ResNet50 convolutional neural network and a Transformer model with a multi-head self-attention mechanism;

（3）提取ResNet50卷积神经网络的两个中间层特征以及末层特征，所述中间层特征包含图像结构信息，末层特征包含语义特征；(3) Extracting two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

（4）将两个中间层输出的特征图沿通道维度进行拼接，得到具有权重的特征向量，从而实现不同层次特征的融合；(4) The feature maps output by the two intermediate layers are concatenated along the channel dimension to obtain a feature vector with weights, thereby achieving the fusion of features at different levels;

（5）对末层特征、具有权重的特征向量同时进行卷积操作，分别得到输出结果一和输出结果二，将输出结果一和输出结果二输入Transformer网络模型；(5) Perform convolution operations on the last layer features and the feature vector with weights at the same time to obtain output results 1 and 2 respectively, and input output results 1 and 2 into the Transformer network model;

（6）在Transformer网络模型中，将输出结果一和输出结果二进行二次拼接；(6) In the Transformer network model, output result 1 and output result 2 are concatenated twice;

（7）将二次拼接后的结果进行下采样，再送入全连接层，最后输入softmax分类器进行分类，得到表情分类结果。(7) The result after the second splicing is downsampled, sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

前述的一种基于特征融合和注意力机制的人脸表情识别方法，在步骤（1）中，对获取的人脸表情数据集进行预处理，包括以下步骤：In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (1), the acquired facial expression dataset is preprocessed, including the following steps:

创建PIL对象，使人脸表情数据集中所有图像的操作均基于PIL对象；Create a PIL object so that all operations on images in the facial expression dataset are based on the PIL object;

将人脸表情图像调整为224×224大小，按照给定的输入数据被翻转的概率p随机水平翻转输入数据；The facial expression image is resized to 224×224, and the input data is randomly flipped horizontally according to the given probability p of the input data being flipped;

对水平翻转后的输入数据进行归一化处理；Normalize the input data after horizontal flipping;

将归一化处理后的数据集图像加载至人脸表情识别神经网络模型。The normalized dataset images are loaded into the facial expression recognition neural network model.

前述的一种基于特征融合和注意力机制的人脸表情识别方法，在步骤（2）中，所述ResNet50卷积神经网络结构包括七个部分：In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (2), the ResNet50 convolutional neural network structure includes seven parts:

第一部分用于对输入图像填充参数；The first part is used to fill the parameters of the input image;

第二部分不包含残差块，用于对输入图像数据依次进行卷积、正则化、激活函数、最大池化的计算；The second part does not contain residual blocks and is used to perform convolution, regularization, activation function, and maximum pooling calculations on the input image data in sequence;

第三部分、第四部分、第五部分、第六部分各包含若干个残差块block，其中每个残差块block均有三层卷积；The third, fourth, fifth and sixth parts each contain several residual blocks, each of which has three layers of convolution.

第七部分包括一个平均池化层和全连接层，第六部分输出的图像数据依次经过一个平均池化层和全连接层后输出结果特征图。The seventh part includes an average pooling layer and a fully connected layer. The image data output by the sixth part passes through an average pooling layer and a fully connected layer in turn and then outputs the resulting feature map.

前述的一种基于特征融合和注意力机制的人脸表情识别方法，在步骤（4）中，将两个中间层输出的特征图沿通道维度进行拼接，两个中间层输出的特征图中特征向量大小分别为512×60×60和1024×60×60，经过拼接后，得到具有权重的特征向量1536×60×60。In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (4), the feature maps output by the two intermediate layers are concatenated along the channel dimension. The feature vector sizes in the feature maps output by the two intermediate layers are 512×60×60 and 1024×60×60, respectively. After concatenation, a weighted feature vector of 1536×60×60 is obtained.

前述的一种基于特征融合和注意力机制的人脸表情识别方法，在步骤（5）中，对末层特征、具有权重的特征向量进一步分别通过两个卷积层，所述两个卷积层分别为（1 ×1）卷积层和（3 × 3）卷积层，所述（1 × 1）卷积层用于将原通道数2048 压缩成256，所述（3 × 3）卷积层用于特征融合，分别得到输出结果一和输出结果二。In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (5), the last layer features and the feature vector with weights are further passed through two convolutional layers respectively, and the two convolutional layers are respectively a (1 × 1) convolutional layer and a (3 × 3) convolutional layer. The (1 × 1) convolutional layer is used to compress the original channel number 2048 into 256, and the (3 × 3) convolutional layer is used for feature fusion, and output results one and output results two are obtained respectively.

前述的一种基于特征融合和注意力机制的人脸表情识别方法，在步骤（6）中，Q、K、V分别代表查询向量、关键向量、值向量，关键向量与值向量以成对形式出现，将步骤（5）中的输出结果一和输出结果二输入RKTM模块，分别做为查询向量和关键向量；In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (6), Q , K , and V represent the query vector, the key vector, and the value vector, respectively, and the key vector and the value vector appear in pairs. The output result 1 and the output result 2 in step (5) are input into the RKTM module as the query vector and the key vector, respectively;

在常微分方程中，欧拉公式表示为下式：In ordinary differential equations, Euler's formula is expressed as follows:

，

,

ResNet50卷积神经网络所采用的残差连接表示为：The residual connection used by the ResNet50 convolutional neural network is expressed as:

，

,

使用二阶龙格库特公式对Transformer网络模型进行求解得到以下式：The second-order Runge-Coot formula is used to solve the Transformer network model to obtain the following formula:

，

,

其中，

表示时间，

表示Transformer网络模型，

为来自ResNet50卷积神经网络的模型参数，

、

分别代表RKTM模块中的注意力子模块一和注意力子模块二；in,

Indicates time,

Represents the Transformer network model,

are the model parameters from the ResNet50 convolutional neural network,

,

They represent the attention submodule 1 and attention submodule 2 in the RKTM module respectively;

对于一张输入图像

，首先使用ResNet50卷积神经网络提取特征得到

，其中

为特征，R是实数集，

分别表示通道数、长、宽，对多维数据进行降维后得到

，设参数一

，则有

，特征的大小记作

,其中，b表示每一批训练的样本大小；For an input image

First, we use the ResNet50 convolutional neural network to extract features

,in

is characterized, R is the set of real numbers,

Represent the number of channels, length, and width respectively. After reducing the dimensionality of multidimensional data,

, let parameter 1

, then

, the size of the feature is denoted by

, where b represents the sample size of each batch of training;

Transformer网络模型

的计算过程为：Transformer Network Model

The calculation process is:

设注意力机制的头部标签（head）为

，将特征

变形为

，其中Assume the head label of the attention mechanism is

, the feature

Transformed into

,in

参数二

；Parameter 2

;

交换参数二

和参数一

两个通道后得到

，设置矩阵一

、矩阵二

、矩阵三

为可学习参数，则得到Exchange parameter 2

and parameter 1

After two channels, we get

, set matrix 1

Matrix 2

, Matrix Three

is a learnable parameter, then we get

，

,

将查询向量

与关键向量

的转置矩阵

相乘，并在最后维进行

操作，得到注意力分数矩阵

，运算过程如下：The query vector

With key vector

The transposed matrix of

Multiply and perform in the last dimension

Operation, get the attention score matrix

, the operation process is as follows:

，

,

再将注意力分数矩阵与值向量

相乘得到输出Then combine the attention score matrix and the value vector

Multiply to get the output

，

,

输出

的形状为

，将式（6）输出结果带入到二阶龙格库特公式得到Transformer网络模型

表达式。Output

The shape is

, Substitute the output of formula (6) into the second-order Runge-Coot formula to obtain the Transformer network model

expression.

一种基于特征融合和注意力机制的人脸表情识别系统，包括以下模块：A facial expression recognition system based on feature fusion and attention mechanism, including the following modules:

预处理模块：对获取的人脸表情数据集进行预处理；Preprocessing module: preprocess the acquired facial expression data set;

神经网络模型构建模块：构建人脸表情识别神经网络模型，包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型；Neural network model building module: Build a neural network model for facial expression recognition, including the ResNet50 convolutional neural network and the Transformer model with a multi-head self-attention mechanism;

信息提取模块：提取ResNet50卷积神经网络的两个中间层特征以及末层特征，所述中间层特征包含图像结构信息，末层特征包含语义特征；Information extraction module: extracts two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

一次拼接模块：将两个中间层输出的特征图沿通道维度进行拼接，得到具有权重的特征向量，从而实现特征融合；One-step concatenation module: concatenates the feature maps output by the two intermediate layers along the channel dimension to obtain a feature vector with weights, thereby achieving feature fusion;

卷积操作模块：对末层特征、具有权重的特征向量同时进行卷积操作，分别得到输出结果一和输出结果二，将输出结果一和输出结果二输入Transformer网络模型；Convolution operation module: Perform convolution operation on the last layer features and the feature vector with weights at the same time to obtain output result 1 and output result 2 respectively, and input output result 1 and output result 2 into the Transformer network model;

二次拼接模块：在Transformer网络模型中，将输出结果一和输出结果二进行二次拼接；Secondary splicing module: In the Transformer network model, output result 1 and output result 2 are spliced twice;

分类模块：将二次拼接后的结果进行下采样，再送入全连接层，最后输入softmax分类器进行分类，从而得到表情分类结果。Classification module: The result after the second splicing is downsampled, then sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时，实现如上述的基于特征融合和注意力机制的人脸表情识别方法。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism as described above.

一种嵌入式装置，所述嵌入式装置配置有可信执行环境，所述可信执行环境，包括：An embedded device is configured with a trusted execution environment, wherein the trusted execution environment comprises:

存储器，用于存储指令；A memory for storing instructions;

处理器，用于执行所述指令，使得所述嵌入式装置执行实现如上述的基于特征融合和注意力机制的人脸表情识别方法。The processor is used to execute the instructions so that the embedded device implements the facial expression recognition method based on feature fusion and attention mechanism as described above.

本发明所达到的有益效果：本发明的基于特征融合和注意力机制的人脸表情识别方法，基于ResNet50神经网络，其中残差模块可以解决梯度问题，而ResNet50神经网络的深层次网络也使其表达的特征更好，相应的检测或分类的性能更强，可以减少参数量，也能在一定程度上减少计算量。Transformer模型的基本特点是引入了自注意力机制(Self-Attention)和残差连接结构(Residual Connection)，相比传统的序列模型，能够从全局上充分考虑输入序列中所有位置的信息，从而能够有效训练更深层的网络,整体达到了一个双重提升识别准确率的效果，同时加速了训练进程。The beneficial effects achieved by the present invention: The facial expression recognition method based on feature fusion and attention mechanism of the present invention is based on ResNet50 neural network, in which the residual module can solve the gradient problem, and the deep network of ResNet50 neural network also makes its expression features better, and the corresponding detection or classification performance is stronger, which can reduce the number of parameters and the amount of calculation to a certain extent. The basic feature of the Transformer model is the introduction of the self-attention mechanism (Self-Attention) and the residual connection structure (Residual Connection). Compared with the traditional sequence model, it can fully consider the information of all positions in the input sequence from a global perspective, so that it can effectively train a deeper network, and achieve a double effect of improving the recognition accuracy as a whole, while accelerating the training process.

同时，本发明通过对二阶龙格库特公式进行求解可以得到泛化能力更强的模型

，即在非本实施例训练数据上也拥有着良好分类的能力。At the same time, the present invention can obtain a model with stronger generalization ability by solving the second-order Runge-Coot formula

, that is, it has good classification ability even on training data other than that of this embodiment.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例1的整体网络结构示意图；FIG1 is a schematic diagram of the overall network structure of Embodiment 1 of the present invention;

图2为ResNet50卷积神经网络网络结构示意图；Figure 2 is a schematic diagram of the network structure of the ResNet50 convolutional neural network;

图3为RKTM模块的结构示意图；FIG3 is a schematic diagram of the structure of the RKTM module;

图4为本发明方法的识别准确率示意图；FIG4 is a schematic diagram of the recognition accuracy of the method of the present invention;

图5为直接训练ResNet50卷积神经网络的准确率示意图。Figure 5 is a schematic diagram of the accuracy of directly training the ResNet50 convolutional neural network.

具体实施方式DETAILED DESCRIPTION

以下结合附图和具体实施例对本发明的技术方案作进一步说明。The technical solution of the present invention is further described below in conjunction with the accompanying drawings and specific embodiments.

实施例1Example 1

本实施例中，使用的是人脸表情公共数据集fer2013，此数据集由35886张不同人脸表情图像组成，每张图像是大小固定为48×48的灰度图像，共有7个类别的表情，为生气（anger）、厌恶（disgust）、恐惧（fear）、开心（happiness）、伤心（sadness）、惊讶（surprise）、中性（neutral），官方将表情相关数据保存在cvs文件，可转化后存储为图像数据。In this embodiment, the public facial expression dataset fer2013 is used. This dataset consists of 35,886 images of different facial expressions. Each image is a grayscale image with a fixed size of 48×48. There are 7 categories of expressions, namely anger, disgust, fear, happiness, sadness, surprise, and neutral. The official saves the expression-related data in a cvs file, which can be converted and stored as image data.

一种基于特征融合和注意力机制的人脸表情识别方法，包括以下步骤：A facial expression recognition method based on feature fusion and attention mechanism includes the following steps:

1)对获取的人脸表情数据集进行预处理，包括以下步骤：1) Preprocessing the acquired facial expression dataset includes the following steps:

将人脸表情图像调整为224×224大小，按照默认给定的输入数据被翻转的概率p=0.5随机水平翻转输入数据；The facial expression image is resized to 224×224, and the input data is randomly flipped horizontally with a default probability of p=0.5.

对水平翻转后的输入数据进行归一化处理，mean平均值：[0.485, 0.456,0.406]，std标准差：[0.229, 0.224, 0.225]；Normalize the input data after horizontal flipping, mean: [0.485, 0.456, 0.406], std: [0.229, 0.224, 0.225];

将归一化处理后的数据集图像加载至人脸表情识别神经网络模型，通过预处理，对数据集中的数据进行增强以丰富训练数据。The normalized dataset images are loaded into the facial expression recognition neural network model, and the data in the dataset are enhanced through preprocessing to enrich the training data.

2）构建人脸表情识别神经网络模型，图1为实施例1的整体神经网络结构示意图，包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型；Transformer模型相较于卷积神经网络具有全局感受野的特点，任意两个像素的距离是相同的，并且可以衡量整个特征图中向量间的关系，RKTM作为编码器的主要编码功能模块。2) Construct a neural network model for facial expression recognition. Figure 1 is a schematic diagram of the overall neural network structure of Example 1, including a ResNet50 convolutional neural network and a Transformer model with a multi-head self-attention mechanism. Compared with the convolutional neural network, the Transformer model has the characteristics of a global receptive field, the distance between any two pixels is the same, and the relationship between vectors in the entire feature map can be measured. RKTM is the main encoding function module of the encoder.

如图2所示，所述ResNet50卷积神经网络结构包括七个部分：As shown in Figure 2, the ResNet50 convolutional neural network structure includes seven parts:

第一部分（stage0）用于对输入图像填充参数（padding），参数为（3，3）；The first part (stage0) is used to fill the input image with padding parameters (3, 3).

第二部分（stage1）不包含残差块，用于对输入图像数据依次进行卷积、正则化、激活函数、最大池化的计算；The second part (stage 1) does not contain residual blocks and is used to perform convolution, regularization, activation function, and maximum pooling calculations on the input image data in sequence;

第三部分、第四部分、第五部分、第六部分均各包含若干个残差块block，数量分别是3、4、6、3，其中每个残差块block均有三层卷积，卷积核的大小分别是1×1、3×3、1×1，进行卷积操作时步长均为1，但第二次卷积的填充参数（padding）为（1，1），即在输入图像数据的四周补充了一圈0；The third, fourth, fifth and sixth parts each contain several residual blocks, the number of which are 3, 4, 6 and 3 respectively. Each residual block has three layers of convolution, and the sizes of the convolution kernels are 1×1, 3×3 and 1×1 respectively. The step size is 1 during the convolution operation, but the padding parameter (padding) of the second convolution is (1, 1), that is, a circle of 0 is added around the input image data;

第七部分包括一个平均池化层和全连接层，第六部分输出的图像数据依次经过一个平均池化层和全连接层（avgpool, fc）后输出结果，使大小为224×224的输入图像成为大小56×56的特征图，极大减少了存储空间。The seventh part includes an average pooling layer and a fully connected layer. The image data output by the sixth part passes through an average pooling layer and a fully connected layer (avgpool, fc) in turn and then outputs the result, so that the input image of size 224×224 becomes a feature map of size 56×56, which greatly reduces the storage space.

3）ResNet50卷积神经网络的第四部分、第五部分输出特征包含丰富的图像结构信息，均称为中间层；第六部分是ResNet50卷积神经网络的最后一个包含卷积操作的层，输出特征包含丰富的语义特征，称为末层；由于ResNet50卷积神经网络使用ImageNet进行训练，对应的是分类任务，所以特征提取器的末层输出为语义特征；3) The output features of the fourth and fifth parts of the ResNet50 convolutional neural network contain rich image structure information, both of which are called intermediate layers; the sixth part is the last layer of the ResNet50 convolutional neural network that contains convolution operations, and the output features contain rich semantic features, which is called the final layer; since the ResNet50 convolutional neural network is trained using ImageNet, which corresponds to a classification task, the final layer output of the feature extractor is a semantic feature;

4）将两个中间层输出的特征图沿通道维度进行拼接，两个中间层输出的特征图中特征向量大小分别为512×60×60和1024×60×60，经过拼接后，得到具有权重的特征向量1536×60×60，从而实现不同层次特征的融合；4) The feature maps output by the two middle layers are concatenated along the channel dimension. The feature vector sizes in the feature maps output by the two middle layers are 512×60×60 and 1024×60×60 respectively. After concatenation, a weighted feature vector of 1536×60×60 is obtained, thereby realizing the fusion of features at different levels.

5）对末层特征、具有权重的特征向量进一步分别通过两个卷积层，所述两个卷积层分别为（1 × 1）卷积层和（3 × 3）卷积层，所述（1 × 1）卷积层用于将原通道数2048压缩成256，所述（3 × 3）卷积层用于特征融合，分别得到输出结果一和输出结果二。此步骤可以确保两个输出结果能够顺利输入下面的Transformer网络模型，即一种使用自注意力机制的深度学习模型。5) The last layer features and the feature vectors with weights are further passed through two convolutional layers, namely, a (1 × 1) convolutional layer and a (3 × 3) convolutional layer. The (1 × 1) convolutional layer is used to compress the original number of channels 2048 into 256, and the (3 × 3) convolutional layer is used for feature fusion, and output results 1 and 2 are obtained respectively. This step ensures that the two output results can be smoothly input into the following Transformer network model, which is a deep learning model using a self-attention mechanism.

6）在Transformer网络模型中，将输出结果一和输出结果二进行二次拼接；6) In the Transformer network model, output result 1 and output result 2 are concatenated twice;

7）将拼接后的特征进行下采样，此目的是提取特征，再送入全连接层得到最终特征向量，最后输入softmax分类器进行计算，输出类别概率，从而得到表情分类结果；7) Downsample the concatenated features to extract features, then send them to the fully connected layer to get the final feature vector, and finally input them into the softmax classifier for calculation and output the category probability to get the expression classification result;

8）在fer2013人脸表情公共数据集上进行验证，如图4所示，本发明方法的识别准确率达到了65%，而直接训练ResNet50网络的识别率只能达到57%，如图5所示，本实施例通过特征融合和嵌入改进的注意力机制，将特定数据集上的人脸识别准确率提升了8%。大量的研究结果表明，利用卷积神经网络提取的深度特征对于平移、旋转、缩放等形变具有很好的鲁棒性，不同卷积层能够提取不同层级的特征，可以有效表征图像的局部和全局特性，故本实施例的模型具有更好的鲁棒性。8) Verification was performed on the fer2013 facial expression public dataset. As shown in FIG4, the recognition accuracy of the method of the present invention reached 65%, while the recognition rate of directly training the ResNet50 network could only reach 57%. As shown in FIG5, this embodiment improved the face recognition accuracy on a specific dataset by 8% through feature fusion and embedding an improved attention mechanism. A large number of research results show that the deep features extracted by convolutional neural networks have good robustness to deformations such as translation, rotation, and scaling. Different convolutional layers can extract features at different levels and can effectively characterize the local and global characteristics of the image. Therefore, the model of this embodiment has better robustness.

实施例 2Example 2

（7）将二次拼接后的结果进行下采样，再送入全连接层，最后输入softmax分类器进行分类，从而得到表情分类结果。(7) The result after the second splicing is downsampled, sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

在步骤6）中，图3是RKTM模块的结构图，即多头自注意力模块，为Transformer模型中的编码器，其中，Q、K、V分别代表查询向量query、关键向量key、值向量value，关键向量与值向量以成对形式出现，均取决于输入值input。In step 6), Figure 3 is a structural diagram of the RKTM module, i.e., the multi-head self-attention module, which is the encoder in the Transformer model, where Q , K , and V represent the query vector query, the key vector key, and the value vector value, respectively. The key vector and the value vector appear in pairs and both depend on the input value input.

将步骤5）中的输出结果一和输出结果二输入RKTM模块，分别做为查询向量和关键向量；Input the output result 1 and output result 2 in step 5) into the RKTM module as the query vector and key vector respectively;

，

,

，

,

欧拉公式是龙格库特（Runge-Kutta）公式的一阶形式，使用二阶龙格库特公式对Transformer网络模型进行求解得到以下公式：The Euler formula is the first-order form of the Runge-Kutta formula. Using the second-order Runge-Kutta formula to solve the Transformer network model yields the following formula:

，

,

其中，

表示时间，

表示Transformer网络模型，

为来自ResNet50卷积神经网络的模型参数，

、

分别代表RKTM模块中的注意力子模块一和注意力子模块二。in,

Indicates time,

Represents the Transformer network model,

are the model parameters from the ResNet50 convolutional neural network,

,

They represent the attention submodule 1 and attention submodule 2 in the RKTM module respectively.

对于一张输入图像

，首先使用ResNet50卷积神经网络提取特征得到

，其中，

为特征，R是实数集，

分别表示通道数、长、宽，对多维数据进行降维后得到

，设参数一

，则有

，由于深度学习使用小批量训练的方式，所以特征的大小记作

, 其中，b即batch_size ,是每一批训练的样本大小；For an input image

First, we use the ResNet50 convolutional neural network to extract features

,in,

is characterized, R is the set of real numbers,

, let parameter 1

, then

Since deep learning uses small batch training, the size of the feature is recorded as

, where b is batch_size, which is the sample size of each batch of training;

Transformer网络模型

的计算过程为：Transformer Network Model

The calculation process is:

设注意力机制的头部标签（head）为

，将特征

变形为

，其中Assume the head label of the attention mechanism is

, the feature

Transformed into

,in

参数二

；Parameter 2

;

交换参数二

和参数一

两个通道后得到

，设置矩阵一

、矩阵二

、矩阵三

为可学习参数，则得到Exchange parameter 2

and parameter 1

After two channels, we get

, set matrix 1

Matrix 2

, Matrix Three

is a learnable parameter, then we get

，

,

将查询向量

与关键向量

的转置矩阵

相乘，即点积计算，并在最后维进行

操作，得到注意力分数矩阵

，运算过程如下：The query vector

With key vector

The transposed matrix of

Multiplication, that is, dot product calculation, is performed in the last dimension

Operation, get the attention score matrix

, the operation process is as follows:

，

,

注意力分数衡量了两两特征之间的相似度，再将注意力分数矩阵与值向量

相乘得到输出The attention score measures the similarity between two features, and then the attention score matrix is combined with the value vector

Multiply to get the output

，

,

输出

的形状为

，可以看出，

与

的空间维度保持一致，因此将式（6）输出结果带入到二阶龙格库特公式得到Transformer网络模型

表达式。本步骤得到了Transformer网络模型的具体模型。Output

The shape is

, it can be seen that

and

The spatial dimension of is consistent, so the output result of formula (6) is substituted into the second-order Runge-Coot formula to obtain the Transformer network model

Expression. This step obtains the specific model of the Transformer network model.

步骤5）中的输出结果一和输出结果二分别经过Transformer网络模型处理后，得到两个输出特征，将所述两个输出特征进行二次拼接操作，即64×7×7和64×7×7拼接后变成128×7×7，实现了特征的二次拼接。After the output result 1 and the output result 2 in step 5) are processed by the Transformer network model respectively, two output features are obtained, and the two output features are subjected to a secondary splicing operation, that is, 64×7×7 and 64×7×7 are spliced into 128×7×7, thereby realizing the secondary splicing of features.

一次拼接模块：将两个中间层输出的特征图沿通道维度进行拼接，得到具有权重的特征向量，从而实现特征融合；One-step concatenation module: concatenates the feature maps output by the two intermediate layers along the channel dimension to obtain a weighted feature vector, thereby achieving feature fusion;

存储器，用于存储指令；A memory for storing instructions;

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

本申请是参照根据本申请实施例的方法、设备（系统）、和计算机程序产品的流程图和／或方框图来描述的。应理解可由计算机程序指令实现流程图和／或方框图中的每一流程和／或方框、以及流程图和／或方框图中的流程和／或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowcharts and/or block diagrams of the methods, devices (systems), and computer program products according to the embodiments of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the processes and/or boxes in the flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing device generate a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和变形，这些改进和变形也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the technical principles of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.

Claims

1. A facial expression recognition method based on feature fusion and attention mechanism, characterized in that it comprises the following steps:

(1) Preprocess the acquired facial expression dataset;

(2) Construct a neural network model for facial expression recognition, including a ResNet50 convolutional neural network and a Transformer model with a multi-head self-attention mechanism;

(3) Extracting two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

(4) The feature maps output by the two intermediate layers are concatenated along the channel dimension to obtain a weighted feature vector to achieve the fusion of features at different levels;

(5) Perform convolution operations on the last layer features and the feature vector with weights at the same time to obtain output results 1 and 2 respectively, and input output results 1 and 2 into the Transformer network model;

(6) In the Transformer network model, output result 1 and output result 2 are concatenated twice;

(7) The result after the second splicing is downsampled, sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

2. A facial expression recognition method based on feature fusion and attention mechanism according to claim 1, characterized in that, in step (1), the acquired facial expression data set is preprocessed, comprising the following steps:

Create a PIL object so that all operations on images in the facial expression dataset are based on the PIL object;

The facial expression image is resized to 224×224, and the input data is randomly flipped horizontally according to the given probability p of the input data being flipped;

Normalize the input data after horizontal flipping;

The normalized dataset images are loaded into the facial expression recognition neural network model.

3. The method for facial expression recognition based on feature fusion and attention mechanism according to claim 1, characterized in that, in step (2), the ResNet50 convolutional neural network structure includes seven parts:

The first part is used to fill the parameters of the input image;

The second part does not contain residual blocks and is used to perform convolution, regularization, activation function, and maximum pooling calculations on the input image data in sequence;

The third, fourth, fifth and sixth parts each contain several residual blocks, each of which has three layers of convolution.

The seventh part includes an average pooling layer and a fully connected layer. The image data output by the sixth part passes through an average pooling layer and a fully connected layer in turn and then outputs the resulting feature map.

4. According to claim 1, a facial expression recognition method based on feature fusion and attention mechanism is characterized in that, in step (4), the feature maps output by the two intermediate layers are spliced along the channel dimension, and the feature vector sizes in the feature maps output by the two intermediate layers are 512×60×60 and 1024×60×60 respectively. After splicing, a weighted feature vector of 1536×60×60 is obtained.

5. According to claim 1, a facial expression recognition method based on feature fusion and attention mechanism is characterized in that, in step (5), the last layer features and the feature vector with weights are further passed through two convolution layers respectively, and the two convolution layers are respectively a (1×1) convolution layer and a (3×3) convolution layer, the (1×1) convolution layer is used to compress the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, and output results one and output results two are obtained respectively.

6. A facial expression recognition method based on feature fusion and attention mechanism according to claim 1, characterized in that, in step (6), Q , K , and V represent a query vector, a key vector, and a value vector, respectively, and the output result 1 and the output result 2 in step (5) are input into the RKTM module as the query vector and the key vector, respectively;

In ordinary differential equations, Euler's formula is expressed as follows:

,

The residual connection used by the ResNet50 convolutional neural network is expressed as:

,

The second-order Runge-Coot formula is used to solve the Transformer network model to obtain the following formula:

,

in,

Indicates time,

Represents the Transformer network model,

are the model parameters from the ResNet50 convolutional neural network,

,

For an input image

First, we use the ResNet50 convolutional neural network to extract features

,in

is characterized, R is the set of real numbers,

, let parameter 1

, then

, the size of the feature is denoted by

, where b represents the sample size of each batch of training;

Transformer Network Model

The calculation process is:

Assume that the head label of the attention mechanism is

, the feature

Transformed into

,in

Parameter 2

;

Exchange parameter 2

and parameter 1

After two channels, we get

, set matrix 1

Matrix 2

, Matrix Three

is a learnable parameter, then we get

,

The query vector

With key vector

The transposed matrix of

Multiply and perform in the last dimension

Operation, get the attention score matrix

, the operation process is as follows:

,

Then combine the attention score matrix and the value vector

Multiply to get the output

,

Output

The shape is

expression.

7. A facial expression recognition system based on feature fusion and attention mechanism, characterized by comprising the following modules:

Preprocessing module: preprocess the acquired facial expression data set;

Neural network model building module: Build a neural network model for facial expression recognition, including the ResNet50 convolutional neural network and the Transformer model with a multi-head self-attention mechanism;

Information extraction module: extracts two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

One-step concatenation module: concatenates the feature maps output by the two intermediate layers along the channel dimension to obtain a feature vector with weights, thereby achieving the fusion of features at different levels;

Convolution operation module: Perform convolution operation on the last layer features and the feature vector with weights at the same time to obtain output result 1 and output result 2 respectively, and input output result 1 and output result 2 into the Transformer network model;

Secondary splicing module: In the Transformer network model, output result 1 and output result 2 are spliced twice;

Classification module: The result after the second splicing is downsampled, then sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

8. The facial expression recognition system based on feature fusion and attention mechanism according to claim 7, characterized in that in the neural network model construction module, the ResNet50 convolutional neural network structure includes seven parts:

The first part is used to fill the parameters of the input image;

9. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism as described in any one of claims 1 to 6.

10. An embedded device, the embedded device being configured with a trusted execution environment, the trusted execution environment comprising:

A memory for storing instructions;

A processor is used to execute the instruction so that the embedded device implements the facial expression recognition method based on feature fusion and attention mechanism as described in any one of claims 1 to 6.