CN116189272A - Facial expression recognition method and system based on feature fusion and attention mechanism - Google Patents

Facial expression recognition method and system based on feature fusion and attention mechanism Download PDF

Info

Publication number
CN116189272A
CN116189272A CN202310493454.6A CN202310493454A CN116189272A CN 116189272 A CN116189272 A CN 116189272A CN 202310493454 A CN202310493454 A CN 202310493454A CN 116189272 A CN116189272 A CN 116189272A
Authority
CN
China
Prior art keywords
facial expression
feature
output
neural network
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310493454.6A
Other languages
Chinese (zh)
Other versions
CN116189272B (en
Inventor
陈昌红
卢妍菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Easyvision Cuizhi Technology Co ltd
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202310493454.6A priority Critical patent/CN116189272B/en
Publication of CN116189272A publication Critical patent/CN116189272A/en
Application granted granted Critical
Publication of CN116189272B publication Critical patent/CN116189272B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a facial expression recognition method and a system based on feature fusion and an attention mechanism, wherein the method comprises the following steps: the method comprises the steps of (1) preprocessing an acquired facial expression data set; (2) constructing a facial expression recognition neural network model; (3) Extracting two middle layer features and end layer features of the ResNet50 convolutional neural network; (4) Splicing the feature graphs output by the two middle layers to obtain feature vectors with weights; (5) Carrying out convolution operation on the final layer characteristics and the characteristic vector with weight simultaneously to respectively obtain an output result I and an output result II; (6) Performing secondary splicing on the output result I and the output result II in a Transformer network model; (7) And sending the result after the secondary splicing into a full-connection layer, and inputting the full-connection layer into a softmax classifier for classification to obtain an expression classification result. The method can improve the accuracy of facial expression recognition.

Description

基于特征融合和注意力机制的人脸表情识别方法及系统Facial expression recognition method and system based on feature fusion and attention mechanism

技术领域Technical Field

本发明涉及一种人脸表情识别方法,属于图像处理技术领域。The invention relates to a method for recognizing facial expressions, and belongs to the technical field of image processing.

背景技术Background Art

人脸面部表情是除语言之外,能表达内心情感的重要载体。近几年来,人脸表情识别(fer)在物联网、人工智能、心理健康评估等领域大放异彩,得社会各界的广泛关注与应用。Facial expressions are an important carrier for expressing inner emotions besides language. In recent years, facial expression recognition (FER) has shined in the fields of Internet of Things, artificial intelligence, mental health assessment, etc., and has received extensive attention and application from all walks of life.

但是现有的表情识别主要基于人工设置的特征和机器学习的方法,这些方法主要存在下述缺陷:人工设置的特征通常会带来不可避免的人为因素和误差、必须要以人为干预辅佐才能从原始图像中提取有用的识别特征、难以从原始图像中获取深入的高语义特征以及深度特征。However, existing expression recognition is mainly based on manually set features and machine learning methods. These methods have the following main defects: manually set features usually bring inevitable human factors and errors, human intervention is necessary to extract useful recognition features from the original image, and it is difficult to obtain in-depth high-semantic features and deep features from the original image.

为了获取深层次的高语义特征,卷积的层数也越来越多,但通过增加网络层数的方法来增强网络的学习能力的方法并不总是可行的,因为网络层数到达一定的深度之后,再增加网络层数,那么网络就会出现随机梯度消失的问题,也会导致网络的准确率下降。为了解决这一问题,传统的方法是采用数据初始化和正则化的方法,这解决了梯度消失的问题,但是网络准确率的问题并没有改善。In order to obtain deep semantic features, the number of convolution layers is increasing. However, it is not always feasible to enhance the learning ability of the network by increasing the number of network layers. After the network layer reaches a certain depth, if the number of network layers is increased, the network will have the problem of random gradient disappearance, which will also cause the network accuracy to decrease. To solve this problem, the traditional method is to use data initialization and regularization methods, which solves the problem of gradient disappearance, but the problem of network accuracy is not improved.

发明内容Summary of the invention

本发明所要解决的技术问题:在人脸表情识别过程中,如何获取深层次的高语义特征,进而获得更好的人脸表情识别效果。The technical problem to be solved by the present invention is: in the process of facial expression recognition, how to obtain deep-level high-semantic features to obtain better facial expression recognition effect.

为解决上述技术问题,本发明提供一种基于特征融合和注意力机制的人脸表情识别方法,包括以下步骤:In order to solve the above technical problems, the present invention provides a facial expression recognition method based on feature fusion and attention mechanism, comprising the following steps:

(1)对获取的人脸表情数据集进行预处理;(1) Preprocess the acquired facial expression dataset;

(2)构建人脸表情识别神经网络模型,包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型;(2) Construct a neural network model for facial expression recognition, including a ResNet50 convolutional neural network and a Transformer model with a multi-head self-attention mechanism;

(3)提取ResNet50卷积神经网络的两个中间层特征以及末层特征,所述中间层特征包含图像结构信息,末层特征包含语义特征;(3) Extracting two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

(4)将两个中间层输出的特征图沿通道维度进行拼接,得到具有权重的特征向量,从而实现不同层次特征的融合;(4) The feature maps output by the two intermediate layers are concatenated along the channel dimension to obtain a feature vector with weights, thereby achieving the fusion of features at different levels;

(5)对末层特征、具有权重的特征向量同时进行卷积操作,分别得到输出结果一和输出结果二,将输出结果一和输出结果二输入Transformer网络模型;(5) Perform convolution operations on the last layer features and the feature vector with weights at the same time to obtain output results 1 and 2 respectively, and input output results 1 and 2 into the Transformer network model;

(6)在Transformer网络模型中,将输出结果一和输出结果二进行二次拼接;(6) In the Transformer network model, output result 1 and output result 2 are concatenated twice;

(7)将二次拼接后的结果进行下采样,再送入全连接层,最后输入softmax分类器进行分类,得到表情分类结果。(7) The result after the second splicing is downsampled, sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

前述的一种基于特征融合和注意力机制的人脸表情识别方法,在步骤(1)中,对获取的人脸表情数据集进行预处理,包括以下步骤:In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (1), the acquired facial expression dataset is preprocessed, including the following steps:

创建PIL对象,使人脸表情数据集中所有图像的操作均基于PIL对象;Create a PIL object so that all operations on images in the facial expression dataset are based on the PIL object;

将人脸表情图像调整为224×224大小,按照给定的输入数据被翻转的概率p随机水平翻转输入数据;The facial expression image is resized to 224×224, and the input data is randomly flipped horizontally according to the given probability p of the input data being flipped;

对水平翻转后的输入数据进行归一化处理;Normalize the input data after horizontal flipping;

将归一化处理后的数据集图像加载至人脸表情识别神经网络模型。The normalized dataset images are loaded into the facial expression recognition neural network model.

前述的一种基于特征融合和注意力机制的人脸表情识别方法,在步骤(2)中,所述ResNet50卷积神经网络结构包括七个部分:In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (2), the ResNet50 convolutional neural network structure includes seven parts:

第一部分用于对输入图像填充参数;The first part is used to fill the parameters of the input image;

第二部分不包含残差块,用于对输入图像数据依次进行卷积、正则化、激活函数、最大池化的计算;The second part does not contain residual blocks and is used to perform convolution, regularization, activation function, and maximum pooling calculations on the input image data in sequence;

第三部分、第四部分、第五部分、第六部分各包含若干个残差块block,其中每个残差块block均有三层卷积;The third, fourth, fifth and sixth parts each contain several residual blocks, each of which has three layers of convolution.

第七部分包括一个平均池化层和全连接层,第六部分输出的图像数据依次经过一个平均池化层和全连接层后输出结果特征图。The seventh part includes an average pooling layer and a fully connected layer. The image data output by the sixth part passes through an average pooling layer and a fully connected layer in turn and then outputs the resulting feature map.

前述的一种基于特征融合和注意力机制的人脸表情识别方法,在步骤(4)中,将两个中间层输出的特征图沿通道维度进行拼接,两个中间层输出的特征图中特征向量大小分别为512×60×60和1024×60×60,经过拼接后,得到具有权重的特征向量1536×60×60。In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (4), the feature maps output by the two intermediate layers are concatenated along the channel dimension. The feature vector sizes in the feature maps output by the two intermediate layers are 512×60×60 and 1024×60×60, respectively. After concatenation, a weighted feature vector of 1536×60×60 is obtained.

前述的一种基于特征融合和注意力机制的人脸表情识别方法,在步骤(5)中,对末层特征、具有权重的特征向量进一步分别通过两个卷积层,所述两个卷积层分别为(1 ×1)卷积层和(3 × 3)卷积层,所述(1 × 1)卷积层用于将原通道数2048 压缩成256,所述(3 × 3)卷积层用于特征融合,分别得到输出结果一和输出结果二。In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (5), the last layer features and the feature vector with weights are further passed through two convolutional layers respectively, and the two convolutional layers are respectively a (1 × 1) convolutional layer and a (3 × 3) convolutional layer. The (1 × 1) convolutional layer is used to compress the original channel number 2048 into 256, and the (3 × 3) convolutional layer is used for feature fusion, and output results one and output results two are obtained respectively.

前述的一种基于特征融合和注意力机制的人脸表情识别方法,在步骤(6)中,QKV分别代表查询向量、关键向量、值向量,关键向量与值向量以成对形式出现,将步骤(5)中的输出结果一和输出结果二输入RKTM模块,分别做为查询向量和关键向量;In the aforementioned facial expression recognition method based on feature fusion and attention mechanism, in step (6), Q , K , and V represent the query vector, the key vector, and the value vector, respectively, and the key vector and the value vector appear in pairs. The output result 1 and the output result 2 in step (5) are input into the RKTM module as the query vector and the key vector, respectively;

在常微分方程中,欧拉公式表示为下式:In ordinary differential equations, Euler's formula is expressed as follows:

Figure SMS_1
Figure SMS_1
,

ResNet50卷积神经网络所采用的残差连接表示为 :The residual connection used by the ResNet50 convolutional neural network is expressed as:

Figure SMS_2
Figure SMS_2
,

使用二阶龙格库特公式对Transformer网络模型进行求解得到以下式:The second-order Runge-Coot formula is used to solve the Transformer network model to obtain the following formula:

Figure SMS_3
Figure SMS_3
,

其中,

Figure SMS_4
表示时间,
Figure SMS_5
表示Transformer网络模型,
Figure SMS_6
为来自ResNet50卷积神经网络的模型参数,
Figure SMS_7
Figure SMS_8
分别代表RKTM模块中的注意力子模块一和注意力子模块二;in,
Figure SMS_4
Indicates time,
Figure SMS_5
Represents the Transformer network model,
Figure SMS_6
are the model parameters from the ResNet50 convolutional neural network,
Figure SMS_7
,
Figure SMS_8
They represent the attention submodule 1 and attention submodule 2 in the RKTM module respectively;

对于一张输入图像

Figure SMS_10
,首先使用ResNet50卷积神经网络提取特征得到
Figure SMS_13
,其中
Figure SMS_15
为特征,R是实数集,
Figure SMS_11
分别表示通道数、长、宽,对多维数据进行降维后得到
Figure SMS_12
,设参数一
Figure SMS_14
,则有
Figure SMS_16
,特征的大小记作
Figure SMS_9
,其中,b表示每一批训练的样本大小;For an input image
Figure SMS_10
First, we use the ResNet50 convolutional neural network to extract features
Figure SMS_13
,in
Figure SMS_15
is characterized, R is the set of real numbers,
Figure SMS_11
Represent the number of channels, length, and width respectively. After reducing the dimensionality of multidimensional data,
Figure SMS_12
, let parameter 1
Figure SMS_14
, then
Figure SMS_16
, the size of the feature is denoted by
Figure SMS_9
, where b represents the sample size of each batch of training;

Transformer网络模型

Figure SMS_17
的计算过程为:Transformer Network Model
Figure SMS_17
The calculation process is:

设注意力机制的头部标签(head)为

Figure SMS_18
,将特征
Figure SMS_19
变形为
Figure SMS_20
,其中Assume the head label of the attention mechanism is
Figure SMS_18
, the feature
Figure SMS_19
Transformed into
Figure SMS_20
,in

参数二

Figure SMS_21
Parameter 2
Figure SMS_21
;

交换参数二

Figure SMS_22
和参数一
Figure SMS_23
两个通道后得到
Figure SMS_24
,设置矩阵一
Figure SMS_25
、矩阵二
Figure SMS_26
、矩阵三
Figure SMS_27
为可学习参数,则得到Exchange parameter 2
Figure SMS_22
and parameter 1
Figure SMS_23
After two channels, we get
Figure SMS_24
, set matrix 1
Figure SMS_25
Matrix 2
Figure SMS_26
, Matrix Three
Figure SMS_27
is a learnable parameter, then we get

Figure SMS_28
Figure SMS_28
,

将查询向量

Figure SMS_29
与关键向量
Figure SMS_30
的转置矩阵
Figure SMS_31
相乘,并在最后维进行
Figure SMS_32
操作,得到注意力分数矩阵
Figure SMS_33
,运算过程如下:The query vector
Figure SMS_29
With key vector
Figure SMS_30
The transposed matrix of
Figure SMS_31
Multiply and perform in the last dimension
Figure SMS_32
Operation, get the attention score matrix
Figure SMS_33
, the operation process is as follows:

Figure SMS_34
Figure SMS_34
,

再将注意力分数矩阵与值向量

Figure SMS_35
相乘得到输出Then combine the attention score matrix and the value vector
Figure SMS_35
Multiply to get the output

Figure SMS_36
Figure SMS_36
,

输出

Figure SMS_37
的形状为
Figure SMS_38
,将式(6)输出结果带入到二阶龙格库特公式得到Transformer网络模型
Figure SMS_39
表达式。Output
Figure SMS_37
The shape is
Figure SMS_38
, Substitute the output of formula (6) into the second-order Runge-Coot formula to obtain the Transformer network model
Figure SMS_39
expression.

一种基于特征融合和注意力机制的人脸表情识别系统,包括以下模块:A facial expression recognition system based on feature fusion and attention mechanism, including the following modules:

预处理模块:对获取的人脸表情数据集进行预处理;Preprocessing module: preprocess the acquired facial expression data set;

神经网络模型构建模块:构建人脸表情识别神经网络模型,包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型;Neural network model building module: Build a neural network model for facial expression recognition, including the ResNet50 convolutional neural network and the Transformer model with a multi-head self-attention mechanism;

信息提取模块:提取ResNet50卷积神经网络的两个中间层特征以及末层特征,所述中间层特征包含图像结构信息,末层特征包含语义特征;Information extraction module: extracts two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

一次拼接模块:将两个中间层输出的特征图沿通道维度进行拼接,得到具有权重的特征向量,从而实现特征融合;One-step concatenation module: concatenates the feature maps output by the two intermediate layers along the channel dimension to obtain a feature vector with weights, thereby achieving feature fusion;

卷积操作模块:对末层特征、具有权重的特征向量同时进行卷积操作,分别得到输出结果一和输出结果二,将输出结果一和输出结果二输入Transformer网络模型;Convolution operation module: Perform convolution operation on the last layer features and the feature vector with weights at the same time to obtain output result 1 and output result 2 respectively, and input output result 1 and output result 2 into the Transformer network model;

二次拼接模块:在Transformer网络模型中,将输出结果一和输出结果二进行二次拼接;Secondary splicing module: In the Transformer network model, output result 1 and output result 2 are spliced twice;

分类模块:将二次拼接后的结果进行下采样,再送入全连接层,最后输入softmax分类器进行分类,从而得到表情分类结果。Classification module: The result after the second splicing is downsampled, then sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现如上述的基于特征融合和注意力机制的人脸表情识别方法。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism as described above.

一种嵌入式装置,所述嵌入式装置配置有可信执行环境,所述可信执行环境,包括:An embedded device is configured with a trusted execution environment, wherein the trusted execution environment comprises:

存储器,用于存储指令;A memory for storing instructions;

处理器,用于执行所述指令,使得所述嵌入式装置执行实现如上述的基于特征融合和注意力机制的人脸表情识别方法。The processor is used to execute the instructions so that the embedded device implements the facial expression recognition method based on feature fusion and attention mechanism as described above.

本发明所达到的有益效果:本发明的基于特征融合和注意力机制的人脸表情识别方法,基于ResNet50神经网络,其中残差模块可以解决梯度问题,而ResNet50神经网络的深层次网络也使其表达的特征更好,相应的检测或分类的性能更强,可以减少参数量,也能在一定程度上减少计算量。Transformer模型的基本特点是引入了自注意力机制(Self-Attention)和残差连接结构(Residual Connection),相比传统的序列模型,能够从全局上充分考虑输入序列中所有位置的信息,从而能够有效训练更深层的网络,整体达到了一个双重提升识别准确率的效果,同时加速了训练进程。The beneficial effects achieved by the present invention: The facial expression recognition method based on feature fusion and attention mechanism of the present invention is based on ResNet50 neural network, in which the residual module can solve the gradient problem, and the deep network of ResNet50 neural network also makes its expression features better, and the corresponding detection or classification performance is stronger, which can reduce the number of parameters and the amount of calculation to a certain extent. The basic feature of the Transformer model is the introduction of the self-attention mechanism (Self-Attention) and the residual connection structure (Residual Connection). Compared with the traditional sequence model, it can fully consider the information of all positions in the input sequence from a global perspective, so that it can effectively train a deeper network, and achieve a double effect of improving the recognition accuracy as a whole, while accelerating the training process.

同时,本发明通过对二阶龙格库特公式进行求解可以得到泛化能力更强的模型

Figure SMS_40
,即在非本实施例训练数据上也拥有着良好分类的能力。At the same time, the present invention can obtain a model with stronger generalization ability by solving the second-order Runge-Coot formula
Figure SMS_40
, that is, it has good classification ability even on training data other than that of this embodiment.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例1的整体网络结构示意图;FIG1 is a schematic diagram of the overall network structure of Embodiment 1 of the present invention;

图2为ResNet50卷积神经网络网络结构示意图;Figure 2 is a schematic diagram of the network structure of the ResNet50 convolutional neural network;

图3为RKTM模块的结构示意图;FIG3 is a schematic diagram of the structure of the RKTM module;

图4为本发明方法的识别准确率示意图;FIG4 is a schematic diagram of the recognition accuracy of the method of the present invention;

图5为直接训练ResNet50卷积神经网络的准确率示意图。Figure 5 is a schematic diagram of the accuracy of directly training the ResNet50 convolutional neural network.

具体实施方式DETAILED DESCRIPTION

以下结合附图和具体实施例对本发明的技术方案作进一步说明。The technical solution of the present invention is further described below in conjunction with the accompanying drawings and specific embodiments.

实施例1Example 1

本实施例中,使用的是人脸表情公共数据集fer2013,此数据集由35886张不同人脸表情图像组成,每张图像是大小固定为48×48的灰度图像,共有7个类别的表情,为生气(anger)、厌恶(disgust)、恐惧(fear)、开心(happiness)、伤心(sadness)、惊讶(surprise)、中性(neutral),官方将表情相关数据保存在cvs文件,可转化后存储为图像数据。In this embodiment, the public facial expression dataset fer2013 is used. This dataset consists of 35,886 images of different facial expressions. Each image is a grayscale image with a fixed size of 48×48. There are 7 categories of expressions, namely anger, disgust, fear, happiness, sadness, surprise, and neutral. The official saves the expression-related data in a cvs file, which can be converted and stored as image data.

一种基于特征融合和注意力机制的人脸表情识别方法,包括以下步骤:A facial expression recognition method based on feature fusion and attention mechanism includes the following steps:

1)对获取的人脸表情数据集进行预处理,包括以下步骤:1) Preprocessing the acquired facial expression dataset includes the following steps:

创建PIL对象,使人脸表情数据集中所有图像的操作均基于PIL对象;Create a PIL object so that all operations on images in the facial expression dataset are based on the PIL object;

将人脸表情图像调整为224×224大小,按照默认给定的输入数据被翻转的概率p=0.5随机水平翻转输入数据;The facial expression image is resized to 224×224, and the input data is randomly flipped horizontally with a default probability of p=0.5.

对水平翻转后的输入数据进行归一化处理,mean平均值:[0.485, 0.456,0.406],std标准差:[0.229, 0.224, 0.225];Normalize the input data after horizontal flipping, mean: [0.485, 0.456, 0.406], std: [0.229, 0.224, 0.225];

将归一化处理后的数据集图像加载至人脸表情识别神经网络模型,通过预处理,对数据集中的数据进行增强以丰富训练数据。The normalized dataset images are loaded into the facial expression recognition neural network model, and the data in the dataset are enhanced through preprocessing to enrich the training data.

2)构建人脸表情识别神经网络模型,图1为实施例1的整体神经网络结构示意图,包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型;Transformer模型相较于卷积神经网络具有全局感受野的特点,任意两个像素的距离是相同的,并且可以衡量整个特征图中向量间的关系,RKTM作为编码器的主要编码功能模块。2) Construct a neural network model for facial expression recognition. Figure 1 is a schematic diagram of the overall neural network structure of Example 1, including a ResNet50 convolutional neural network and a Transformer model with a multi-head self-attention mechanism. Compared with the convolutional neural network, the Transformer model has the characteristics of a global receptive field, the distance between any two pixels is the same, and the relationship between vectors in the entire feature map can be measured. RKTM is the main encoding function module of the encoder.

如图2所示,所述ResNet50卷积神经网络结构包括七个部分:As shown in Figure 2, the ResNet50 convolutional neural network structure includes seven parts:

第一部分(stage0)用于对输入图像填充参数(padding),参数为(3,3);The first part (stage0) is used to fill the input image with padding parameters (3, 3).

第二部分(stage1)不包含残差块,用于对输入图像数据依次进行卷积、正则化、激活函数、最大池化的计算;The second part (stage 1) does not contain residual blocks and is used to perform convolution, regularization, activation function, and maximum pooling calculations on the input image data in sequence;

第三部分、第四部分、第五部分、第六部分均各包含若干个残差块block,数量分别是3、4、6、3,其中每个残差块block均有三层卷积,卷积核的大小分别是1×1、3×3、1×1,进行卷积操作时步长均为1,但第二次卷积的填充参数(padding)为(1,1),即在输入图像数据的四周补充了一圈0;The third, fourth, fifth and sixth parts each contain several residual blocks, the number of which are 3, 4, 6 and 3 respectively. Each residual block has three layers of convolution, and the sizes of the convolution kernels are 1×1, 3×3 and 1×1 respectively. The step size is 1 during the convolution operation, but the padding parameter (padding) of the second convolution is (1, 1), that is, a circle of 0 is added around the input image data;

第七部分包括一个平均池化层和全连接层,第六部分输出的图像数据依次经过一个平均池化层和全连接层(avgpool, fc)后输出结果,使大小为224×224的输入图像成为大小56×56的特征图,极大减少了存储空间。The seventh part includes an average pooling layer and a fully connected layer. The image data output by the sixth part passes through an average pooling layer and a fully connected layer (avgpool, fc) in turn and then outputs the result, so that the input image of size 224×224 becomes a feature map of size 56×56, which greatly reduces the storage space.

3)ResNet50卷积神经网络的第四部分、第五部分输出特征包含丰富的图像结构信息,均称为中间层;第六部分是ResNet50卷积神经网络的最后一个包含卷积操作的层,输出特征包含丰富的语义特征,称为末层;由于ResNet50卷积神经网络使用ImageNet进行训练,对应的是分类任务,所以特征提取器的末层输出为语义特征;3) The output features of the fourth and fifth parts of the ResNet50 convolutional neural network contain rich image structure information, both of which are called intermediate layers; the sixth part is the last layer of the ResNet50 convolutional neural network that contains convolution operations, and the output features contain rich semantic features, which is called the final layer; since the ResNet50 convolutional neural network is trained using ImageNet, which corresponds to a classification task, the final layer output of the feature extractor is a semantic feature;

4)将两个中间层输出的特征图沿通道维度进行拼接,两个中间层输出的特征图中特征向量大小分别为512×60×60和1024×60×60,经过拼接后,得到具有权重的特征向量1536×60×60,从而实现不同层次特征的融合;4) The feature maps output by the two middle layers are concatenated along the channel dimension. The feature vector sizes in the feature maps output by the two middle layers are 512×60×60 and 1024×60×60 respectively. After concatenation, a weighted feature vector of 1536×60×60 is obtained, thereby realizing the fusion of features at different levels.

5)对末层特征、具有权重的特征向量进一步分别通过两个卷积层,所述两个卷积层分别为(1 × 1)卷积层和(3 × 3)卷积层,所述(1 × 1)卷积层用于将原通道数2048压缩成256,所述(3 × 3)卷积层用于特征融合,分别得到输出结果一和输出结果二。此步骤可以确保两个输出结果能够顺利输入下面的Transformer网络模型,即一种使用自注意力机制的深度学习模型。5) The last layer features and the feature vectors with weights are further passed through two convolutional layers, namely, a (1 × 1) convolutional layer and a (3 × 3) convolutional layer. The (1 × 1) convolutional layer is used to compress the original number of channels 2048 into 256, and the (3 × 3) convolutional layer is used for feature fusion, and output results 1 and 2 are obtained respectively. This step ensures that the two output results can be smoothly input into the following Transformer network model, which is a deep learning model using a self-attention mechanism.

6)在Transformer网络模型中,将输出结果一和输出结果二进行二次拼接;6) In the Transformer network model, output result 1 and output result 2 are concatenated twice;

7)将拼接后的特征进行下采样,此目的是提取特征,再送入全连接层得到最终特征向量,最后输入softmax分类器进行计算,输出类别概率,从而得到表情分类结果;7) Downsample the concatenated features to extract features, then send them to the fully connected layer to get the final feature vector, and finally input them into the softmax classifier for calculation and output the category probability to get the expression classification result;

8)在fer2013人脸表情公共数据集上进行验证,如图4所示,本发明方法的识别准确率达到了65%,而直接训练ResNet50网络的识别率只能达到57%,如图5所示,本实施例通过特征融合和嵌入改进的注意力机制,将特定数据集上的人脸识别准确率提升了8%。大量的研究结果表明,利用卷积神经网络提取的深度特征对于平移、旋转、缩放等形变具有很好的鲁棒性,不同卷积层能够提取不同层级的特征,可以有效表征图像的局部和全局特性,故本实施例的模型具有更好的鲁棒性。8) Verification was performed on the fer2013 facial expression public dataset. As shown in FIG4, the recognition accuracy of the method of the present invention reached 65%, while the recognition rate of directly training the ResNet50 network could only reach 57%. As shown in FIG5, this embodiment improved the face recognition accuracy on a specific dataset by 8% through feature fusion and embedding an improved attention mechanism. A large number of research results show that the deep features extracted by convolutional neural networks have good robustness to deformations such as translation, rotation, and scaling. Different convolutional layers can extract features at different levels and can effectively characterize the local and global characteristics of the image. Therefore, the model of this embodiment has better robustness.

实施例 2Example 2

一种基于特征融合和注意力机制的人脸表情识别方法,包括以下步骤:A facial expression recognition method based on feature fusion and attention mechanism includes the following steps:

(1)对获取的人脸表情数据集进行预处理;(1) Preprocess the acquired facial expression dataset;

(2)构建人脸表情识别神经网络模型,包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型;(2) Construct a neural network model for facial expression recognition, including a ResNet50 convolutional neural network and a Transformer model with a multi-head self-attention mechanism;

(3)提取ResNet50卷积神经网络的两个中间层特征以及末层特征,所述中间层特征包含图像结构信息,末层特征包含语义特征;(3) Extracting two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

(4)将两个中间层输出的特征图沿通道维度进行拼接,得到具有权重的特征向量,从而实现不同层次特征的融合;(4) The feature maps output by the two intermediate layers are concatenated along the channel dimension to obtain a feature vector with weights, thereby achieving the fusion of features at different levels;

(5)对末层特征、具有权重的特征向量同时进行卷积操作,分别得到输出结果一和输出结果二,将输出结果一和输出结果二输入Transformer网络模型;(5) Perform convolution operations on the last layer features and the feature vector with weights at the same time to obtain output results 1 and 2 respectively, and input output results 1 and 2 into the Transformer network model;

(6)在Transformer网络模型中,将输出结果一和输出结果二进行二次拼接;(6) In the Transformer network model, output result 1 and output result 2 are concatenated twice;

(7)将二次拼接后的结果进行下采样,再送入全连接层,最后输入softmax分类器进行分类,从而得到表情分类结果。(7) The result after the second splicing is downsampled, sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

在步骤6)中,图3是RKTM模块的结构图,即多头自注意力模块,为Transformer模型中的编码器,其中,QKV分别代表查询向量query、关键向量key、值向量value,关键向量与值向量以成对形式出现,均取决于输入值input。In step 6), Figure 3 is a structural diagram of the RKTM module, i.e., the multi-head self-attention module, which is the encoder in the Transformer model, where Q , K , and V represent the query vector query, the key vector key, and the value vector value, respectively. The key vector and the value vector appear in pairs and both depend on the input value input.

将步骤5)中的输出结果一和输出结果二输入RKTM模块,分别做为查询向量和关键向量;Input the output result 1 and output result 2 in step 5) into the RKTM module as the query vector and key vector respectively;

在常微分方程中,欧拉公式表示为下式:In ordinary differential equations, Euler's formula is expressed as follows:

Figure SMS_41
Figure SMS_41
,

ResNet50卷积神经网络所采用的残差连接表示为 :The residual connection used by the ResNet50 convolutional neural network is expressed as:

Figure SMS_42
Figure SMS_42
,

欧拉公式是龙格库特(Runge-Kutta)公式的一阶形式,使用二阶龙格库特公式对Transformer网络模型进行求解得到以下公式:The Euler formula is the first-order form of the Runge-Kutta formula. Using the second-order Runge-Kutta formula to solve the Transformer network model yields the following formula:

Figure SMS_43
Figure SMS_43
,

其中,

Figure SMS_44
表示时间,
Figure SMS_45
表示Transformer网络模型,
Figure SMS_46
为来自ResNet50卷积神经网络的模型参数,
Figure SMS_47
Figure SMS_48
分别代表RKTM模块中的注意力子模块一和注意力子模块二。in,
Figure SMS_44
Indicates time,
Figure SMS_45
Represents the Transformer network model,
Figure SMS_46
are the model parameters from the ResNet50 convolutional neural network,
Figure SMS_47
,
Figure SMS_48
They represent the attention submodule 1 and attention submodule 2 in the RKTM module respectively.

对于一张输入图像

Figure SMS_50
,首先使用ResNet50卷积神经网络提取特征得到
Figure SMS_53
,其中,
Figure SMS_54
为特征,R是实数集,
Figure SMS_49
分别表示通道数、长、宽,对多维数据进行降维后得到
Figure SMS_52
,设参数一
Figure SMS_55
,则有
Figure SMS_56
,由于深度学习使用小批量训练的方式,所以特征的大小记作
Figure SMS_51
, 其中,b即batch_size ,是每一批训练的样本大小;For an input image
Figure SMS_50
First, we use the ResNet50 convolutional neural network to extract features
Figure SMS_53
,in,
Figure SMS_54
is characterized, R is the set of real numbers,
Figure SMS_49
Represent the number of channels, length, and width respectively. After reducing the dimensionality of multidimensional data,
Figure SMS_52
, let parameter 1
Figure SMS_55
, then
Figure SMS_56
Since deep learning uses small batch training, the size of the feature is recorded as
Figure SMS_51
, where b is batch_size, which is the sample size of each batch of training;

Transformer网络模型

Figure SMS_57
的计算过程为:Transformer Network Model
Figure SMS_57
The calculation process is:

设注意力机制的头部标签(head)为

Figure SMS_58
,将特征
Figure SMS_59
变形为
Figure SMS_60
,其中Assume the head label of the attention mechanism is
Figure SMS_58
, the feature
Figure SMS_59
Transformed into
Figure SMS_60
,in

参数二

Figure SMS_61
Parameter 2
Figure SMS_61
;

交换参数二

Figure SMS_62
和参数一
Figure SMS_63
两个通道后得到
Figure SMS_64
,设置矩阵一
Figure SMS_65
、矩阵二
Figure SMS_66
、矩阵三
Figure SMS_67
为可学习参数,则得到Exchange parameter 2
Figure SMS_62
and parameter 1
Figure SMS_63
After two channels, we get
Figure SMS_64
, set matrix 1
Figure SMS_65
Matrix 2
Figure SMS_66
, Matrix Three
Figure SMS_67
is a learnable parameter, then we get

Figure SMS_68
Figure SMS_68
,

将查询向量

Figure SMS_69
与关键向量
Figure SMS_70
的转置矩阵
Figure SMS_71
相乘,即点积计算,并在最后维进行
Figure SMS_72
操作,得到注意力分数矩阵
Figure SMS_73
,运算过程如下:The query vector
Figure SMS_69
With key vector
Figure SMS_70
The transposed matrix of
Figure SMS_71
Multiplication, that is, dot product calculation, is performed in the last dimension
Figure SMS_72
Operation, get the attention score matrix
Figure SMS_73
, the operation process is as follows:

Figure SMS_74
Figure SMS_74
,

注意力分数衡量了两两特征之间的相似度,再将注意力分数矩阵与值向量

Figure SMS_75
相乘得到输出The attention score measures the similarity between two features, and then the attention score matrix is combined with the value vector
Figure SMS_75
Multiply to get the output

Figure SMS_76
Figure SMS_76
,

输出

Figure SMS_77
的形状为
Figure SMS_78
,可以看出,
Figure SMS_79
Figure SMS_80
的空间维度保持一致,因此将式(6)输出结果带入到二阶龙格库特公式得到Transformer网络模型
Figure SMS_81
表达式。本步骤得到了Transformer网络模型的具体模型。Output
Figure SMS_77
The shape is
Figure SMS_78
, it can be seen that
Figure SMS_79
and
Figure SMS_80
The spatial dimension of is consistent, so the output result of formula (6) is substituted into the second-order Runge-Coot formula to obtain the Transformer network model
Figure SMS_81
Expression. This step obtains the specific model of the Transformer network model.

步骤5)中的输出结果一和输出结果二分别经过Transformer网络模型处理后,得到两个输出特征,将所述两个输出特征进行二次拼接操作,即64×7×7和64×7×7拼接后变成128×7×7,实现了特征的二次拼接。After the output result 1 and the output result 2 in step 5) are processed by the Transformer network model respectively, two output features are obtained, and the two output features are subjected to a secondary splicing operation, that is, 64×7×7 and 64×7×7 are spliced into 128×7×7, thereby realizing the secondary splicing of features.

一种基于特征融合和注意力机制的人脸表情识别系统,包括以下模块:A facial expression recognition system based on feature fusion and attention mechanism, including the following modules:

预处理模块:对获取的人脸表情数据集进行预处理;Preprocessing module: preprocess the acquired facial expression data set;

神经网络模型构建模块:构建人脸表情识别神经网络模型,包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型;Neural network model building module: Build a neural network model for facial expression recognition, including the ResNet50 convolutional neural network and the Transformer model with a multi-head self-attention mechanism;

信息提取模块:提取ResNet50卷积神经网络的两个中间层特征以及末层特征,所述中间层特征包含图像结构信息,末层特征包含语义特征;Information extraction module: extracts two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features;

一次拼接模块:将两个中间层输出的特征图沿通道维度进行拼接,得到具有权重的特征向量,从而实现特征融合;One-step concatenation module: concatenates the feature maps output by the two intermediate layers along the channel dimension to obtain a weighted feature vector, thereby achieving feature fusion;

卷积操作模块:对末层特征、具有权重的特征向量同时进行卷积操作,分别得到输出结果一和输出结果二,将输出结果一和输出结果二输入Transformer网络模型;Convolution operation module: Perform convolution operation on the last layer features and the feature vector with weights at the same time to obtain output result 1 and output result 2 respectively, and input output result 1 and output result 2 into the Transformer network model;

二次拼接模块:在Transformer网络模型中,将输出结果一和输出结果二进行二次拼接;Secondary splicing module: In the Transformer network model, output result 1 and output result 2 are spliced twice;

分类模块:将二次拼接后的结果进行下采样,再送入全连接层,最后输入softmax分类器进行分类,从而得到表情分类结果。Classification module: The result after the second splicing is downsampled, then sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result.

一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现如上述的基于特征融合和注意力机制的人脸表情识别方法。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism as described above.

一种嵌入式装置,所述嵌入式装置配置有可信执行环境,所述可信执行环境,包括:An embedded device is configured with a trusted execution environment, wherein the trusted execution environment comprises:

存储器,用于存储指令;A memory for storing instructions;

处理器,用于执行所述指令,使得所述嵌入式装置执行实现如上述的基于特征融合和注意力机制的人脸表情识别方法。The processor is used to execute the instructions so that the embedded device implements the facial expression recognition method based on feature fusion and attention mechanism as described above.

本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowcharts and/or block diagrams of the methods, devices (systems), and computer program products according to the embodiments of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the processes and/or boxes in the flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing device generate a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明技术原理的前提下,还可以做出若干改进和变形,这些改进和变形也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the technical principles of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.

Claims (10)

1.一种基于特征融合和注意力机制的人脸表情识别方法,其特征在于,包括以下步骤:1. A facial expression recognition method based on feature fusion and attention mechanism, characterized in that it comprises the following steps: (1)对获取的人脸表情数据集进行预处理;(1) Preprocess the acquired facial expression dataset; (2)构建人脸表情识别神经网络模型,包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型;(2) Construct a neural network model for facial expression recognition, including a ResNet50 convolutional neural network and a Transformer model with a multi-head self-attention mechanism; (3)提取ResNet50卷积神经网络的两个中间层特征以及末层特征,所述中间层特征包含图像结构信息,末层特征包含语义特征;(3) Extracting two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features; (4)将两个中间层输出的特征图沿通道维度进行拼接,得到具有权重的特征向量,实现不同层次特征的融合;(4) The feature maps output by the two intermediate layers are concatenated along the channel dimension to obtain a weighted feature vector to achieve the fusion of features at different levels; (5)对末层特征、具有权重的特征向量同时进行卷积操作,分别得到输出结果一和输出结果二,将输出结果一和输出结果二输入Transformer网络模型;(5) Perform convolution operations on the last layer features and the feature vector with weights at the same time to obtain output results 1 and 2 respectively, and input output results 1 and 2 into the Transformer network model; (6)在Transformer网络模型中,将输出结果一和输出结果二进行二次拼接;(6) In the Transformer network model, output result 1 and output result 2 are concatenated twice; (7)将二次拼接后的结果进行下采样,再送入全连接层,最后输入softmax分类器进行分类,得到表情分类结果。(7) The result after the second splicing is downsampled, sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result. 2.根据权利要求1所述的一种基于特征融合和注意力机制的人脸表情识别方法,其特征在于,在步骤(1)中,对获取的人脸表情数据集进行预处理,包括以下步骤:2. A facial expression recognition method based on feature fusion and attention mechanism according to claim 1, characterized in that, in step (1), the acquired facial expression data set is preprocessed, comprising the following steps: 创建PIL对象,使人脸表情数据集中所有图像的操作均基于PIL对象;Create a PIL object so that all operations on images in the facial expression dataset are based on the PIL object; 将人脸表情图像调整为224×224大小,按照给定的输入数据被翻转的概率p随机水平翻转输入数据;The facial expression image is resized to 224×224, and the input data is randomly flipped horizontally according to the given probability p of the input data being flipped; 对水平翻转后的输入数据进行归一化处理;Normalize the input data after horizontal flipping; 将归一化处理后的数据集图像加载至人脸表情识别神经网络模型。The normalized dataset images are loaded into the facial expression recognition neural network model. 3.根据权利要求1所述的一种基于特征融合和注意力机制的人脸表情识别方法,其特征在于,在步骤(2)中,所述ResNet50卷积神经网络结构包括七个部分:3. The method for facial expression recognition based on feature fusion and attention mechanism according to claim 1, characterized in that, in step (2), the ResNet50 convolutional neural network structure includes seven parts: 第一部分用于对输入图像填充参数;The first part is used to fill the parameters of the input image; 第二部分不包含残差块,用于对输入图像数据依次进行卷积、正则化、激活函数、最大池化的计算;The second part does not contain residual blocks and is used to perform convolution, regularization, activation function, and maximum pooling calculations on the input image data in sequence; 第三部分、第四部分、第五部分、第六部分各包含若干个残差块block,其中每个残差块block均有三层卷积;The third, fourth, fifth and sixth parts each contain several residual blocks, each of which has three layers of convolution. 第七部分包括一个平均池化层和全连接层,第六部分输出的图像数据依次经过一个平均池化层和全连接层后输出结果特征图。The seventh part includes an average pooling layer and a fully connected layer. The image data output by the sixth part passes through an average pooling layer and a fully connected layer in turn and then outputs the resulting feature map. 4.根据权利要求1所述的一种基于特征融合和注意力机制的人脸表情识别方法,其特征在于,在步骤(4)中,将两个中间层输出的特征图沿通道维度进行拼接,两个中间层输出的特征图中特征向量大小分别为512×60×60和1024×60×60,经过拼接后,得到具有权重的特征向量1536×60×60。4. According to claim 1, a facial expression recognition method based on feature fusion and attention mechanism is characterized in that, in step (4), the feature maps output by the two intermediate layers are spliced along the channel dimension, and the feature vector sizes in the feature maps output by the two intermediate layers are 512×60×60 and 1024×60×60 respectively. After splicing, a weighted feature vector of 1536×60×60 is obtained. 5.根据权利要求1所述的一种基于特征融合和注意力机制的人脸表情识别方法,其特征在于,在步骤(5)中,对末层特征、具有权重的特征向量进一步分别通过两个卷积层,所述两个卷积层分别为(1× 1)卷积层和(3 × 3)卷积层,所述(1 × 1)卷积层用于将原通道数2048 压缩成256,所述(3 × 3)卷积层用于特征融合,分别得到输出结果一和输出结果二。5. According to claim 1, a facial expression recognition method based on feature fusion and attention mechanism is characterized in that, in step (5), the last layer features and the feature vector with weights are further passed through two convolution layers respectively, and the two convolution layers are respectively a (1×1) convolution layer and a (3×3) convolution layer, the (1×1) convolution layer is used to compress the original channel number 2048 into 256, and the (3×3) convolution layer is used for feature fusion, and output results one and output results two are obtained respectively. 6.根据权利要求1所述的一种基于特征融合和注意力机制的人脸表情识别方法,其特征在于,在步骤(6)中, QKV分别代表查询向量、关键向量、值向量,将步骤(5)中的输出结果一和输出结果二输入RKTM模块,分别做为查询向量和关键向量;6. A facial expression recognition method based on feature fusion and attention mechanism according to claim 1, characterized in that, in step (6), Q , K , and V represent a query vector, a key vector, and a value vector, respectively, and the output result 1 and the output result 2 in step (5) are input into the RKTM module as the query vector and the key vector, respectively; 在常微分方程中,欧拉公式表示为下式:In ordinary differential equations, Euler's formula is expressed as follows:
Figure QLYQS_1
Figure QLYQS_1
,
ResNet50卷积神经网络所采用的残差连接表示为 :The residual connection used by the ResNet50 convolutional neural network is expressed as:
Figure QLYQS_2
Figure QLYQS_2
,
使用二阶龙格库特公式对Transformer网络模型进行求解得到以下式:The second-order Runge-Coot formula is used to solve the Transformer network model to obtain the following formula:
Figure QLYQS_3
Figure QLYQS_3
,
其中,
Figure QLYQS_4
表示时间,
Figure QLYQS_5
表示Transformer网络模型,
Figure QLYQS_6
为来自ResNet50卷积神经网络的模型参数,
Figure QLYQS_7
Figure QLYQS_8
分别代表RKTM模块中的注意力子模块一和注意力子模块二;
in,
Figure QLYQS_4
Indicates time,
Figure QLYQS_5
Represents the Transformer network model,
Figure QLYQS_6
are the model parameters from the ResNet50 convolutional neural network,
Figure QLYQS_7
,
Figure QLYQS_8
They represent the attention submodule 1 and attention submodule 2 in the RKTM module respectively;
对于一张输入图像
Figure QLYQS_9
,首先使用ResNet50卷积神经网络提取特征得到
Figure QLYQS_12
,其中
Figure QLYQS_13
为特征,R是实数集,
Figure QLYQS_11
分别表示通道数、长、宽,对多维数据进行降维后得到
Figure QLYQS_14
,设参数一
Figure QLYQS_15
,则有
Figure QLYQS_16
,特征的大小记作
Figure QLYQS_10
,其中,b表示每一批训练的样本大小;
For an input image
Figure QLYQS_9
First, we use the ResNet50 convolutional neural network to extract features
Figure QLYQS_12
,in
Figure QLYQS_13
is characterized, R is the set of real numbers,
Figure QLYQS_11
Represent the number of channels, length, and width respectively. After reducing the dimensionality of multidimensional data,
Figure QLYQS_14
, let parameter 1
Figure QLYQS_15
, then
Figure QLYQS_16
, the size of the feature is denoted by
Figure QLYQS_10
, where b represents the sample size of each batch of training;
Transformer网络模型
Figure QLYQS_17
的计算过程为:
Transformer Network Model
Figure QLYQS_17
The calculation process is:
设注意力机制的头部标签为
Figure QLYQS_18
,将特征
Figure QLYQS_19
变形为
Figure QLYQS_20
,其中
Assume that the head label of the attention mechanism is
Figure QLYQS_18
, the feature
Figure QLYQS_19
Transformed into
Figure QLYQS_20
,in
参数二
Figure QLYQS_21
Parameter 2
Figure QLYQS_21
;
交换参数二
Figure QLYQS_22
和参数一
Figure QLYQS_23
两个通道后得到
Figure QLYQS_24
,设置矩阵一
Figure QLYQS_25
、矩阵二
Figure QLYQS_26
、矩阵三
Figure QLYQS_27
为可学习参数,则得到
Exchange parameter 2
Figure QLYQS_22
and parameter 1
Figure QLYQS_23
After two channels, we get
Figure QLYQS_24
, set matrix 1
Figure QLYQS_25
Matrix 2
Figure QLYQS_26
, Matrix Three
Figure QLYQS_27
is a learnable parameter, then we get
Figure QLYQS_28
Figure QLYQS_28
,
将查询向量
Figure QLYQS_29
与关键向量
Figure QLYQS_30
的转置矩阵
Figure QLYQS_31
相乘,并在最后维进行
Figure QLYQS_32
操作,得到注意力分数矩阵
Figure QLYQS_33
,运算过程如下:
The query vector
Figure QLYQS_29
With key vector
Figure QLYQS_30
The transposed matrix of
Figure QLYQS_31
Multiply and perform in the last dimension
Figure QLYQS_32
Operation, get the attention score matrix
Figure QLYQS_33
, the operation process is as follows:
Figure QLYQS_34
Figure QLYQS_34
,
再将注意力分数矩阵与值向量
Figure QLYQS_35
相乘得到输出
Then combine the attention score matrix and the value vector
Figure QLYQS_35
Multiply to get the output
Figure QLYQS_36
Figure QLYQS_36
,
输出
Figure QLYQS_37
的形状为
Figure QLYQS_38
,将式(6)输出结果带入到二阶龙格库特公式得到Transformer网络模型
Figure QLYQS_39
表达式。
Output
Figure QLYQS_37
The shape is
Figure QLYQS_38
, Substitute the output of formula (6) into the second-order Runge-Coot formula to obtain the Transformer network model
Figure QLYQS_39
expression.
7.一种基于特征融合和注意力机制的人脸表情识别系统,其特征在于,包括以下模块:7. A facial expression recognition system based on feature fusion and attention mechanism, characterized by comprising the following modules: 预处理模块:对获取的人脸表情数据集进行预处理;Preprocessing module: preprocess the acquired facial expression data set; 神经网络模型构建模块:构建人脸表情识别神经网络模型,包括ResNet50卷积神经网络和带有多头自注意力机制的Transformer模型;Neural network model building module: Build a neural network model for facial expression recognition, including the ResNet50 convolutional neural network and the Transformer model with a multi-head self-attention mechanism; 信息提取模块:提取ResNet50卷积神经网络的两个中间层特征以及末层特征,所述中间层特征包含图像结构信息,末层特征包含语义特征;Information extraction module: extracts two middle layer features and the last layer features of the ResNet50 convolutional neural network, wherein the middle layer features contain image structure information and the last layer features contain semantic features; 一次拼接模块:将两个中间层输出的特征图沿通道维度进行拼接,得到具有权重的特征向量,从而实现不同层次特征的融合;One-step concatenation module: concatenates the feature maps output by the two intermediate layers along the channel dimension to obtain a feature vector with weights, thereby achieving the fusion of features at different levels; 卷积操作模块:对末层特征、具有权重的特征向量同时进行卷积操作,分别得到输出结果一和输出结果二,将输出结果一和输出结果二输入Transformer网络模型;Convolution operation module: Perform convolution operation on the last layer features and the feature vector with weights at the same time to obtain output result 1 and output result 2 respectively, and input output result 1 and output result 2 into the Transformer network model; 二次拼接模块:在Transformer网络模型中,将输出结果一和输出结果二进行二次拼接;Secondary splicing module: In the Transformer network model, output result 1 and output result 2 are spliced twice; 分类模块:将二次拼接后的结果进行下采样,再送入全连接层,最后输入softmax分类器进行分类,从而得到表情分类结果。Classification module: The result after the second splicing is downsampled, then sent to the fully connected layer, and finally input into the softmax classifier for classification to obtain the expression classification result. 8.根据权利要求7所述的基于特征融合和注意力机制的人脸表情识别系统,其特征在于,在神经网络模型构建模块中,所述ResNet50卷积神经网络结构包括七个部分:8. The facial expression recognition system based on feature fusion and attention mechanism according to claim 7, characterized in that in the neural network model construction module, the ResNet50 convolutional neural network structure includes seven parts: 第一部分用于对输入图像填充参数;The first part is used to fill the parameters of the input image; 第二部分不包含残差块,用于对输入图像数据依次进行卷积、正则化、激活函数、最大池化的计算;The second part does not contain residual blocks and is used to perform convolution, regularization, activation function, and maximum pooling calculations on the input image data in sequence; 第三部分、第四部分、第五部分、第六部分各包含若干个残差块block,其中每个残差块block均有三层卷积;The third, fourth, fifth and sixth parts each contain several residual blocks, each of which has three layers of convolution. 第七部分包括一个平均池化层和全连接层,第六部分输出的图像数据依次经过一个平均池化层和全连接层后输出结果特征图。The seventh part includes an average pooling layer and a fully connected layer. The image data output by the sixth part passes through an average pooling layer and a fully connected layer in turn and then outputs the resulting feature map. 9.一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现如权利要求1~6中任一项所述的基于特征融合和注意力机制的人脸表情识别方法。9. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the facial expression recognition method based on feature fusion and attention mechanism as described in any one of claims 1 to 6. 10.一种嵌入式装置,所述嵌入式装置配置有可信执行环境,所述可信执行环境,包括:10. An embedded device, the embedded device being configured with a trusted execution environment, the trusted execution environment comprising: 存储器,用于存储指令;A memory for storing instructions; 处理器,用于执行所述指令,使得所述嵌入式装置执行实现如权利要求1~6中任一项所述的基于特征融合和注意力机制的人脸表情识别方法。A processor is used to execute the instruction so that the embedded device implements the facial expression recognition method based on feature fusion and attention mechanism as described in any one of claims 1 to 6.
CN202310493454.6A 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism Active CN116189272B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310493454.6A CN116189272B (en) 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310493454.6A CN116189272B (en) 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism

Publications (2)

Publication Number Publication Date
CN116189272A true CN116189272A (en) 2023-05-30
CN116189272B CN116189272B (en) 2023-07-07

Family

ID=86433105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310493454.6A Active CN116189272B (en) 2023-05-05 2023-05-05 Facial expression recognition method and system based on feature fusion and attention mechanism

Country Status (1)

Country Link
CN (1) CN116189272B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118365974A (en) * 2024-06-20 2024-07-19 山东省水利科学研究院 Water quality class detection method, system and equipment based on hybrid neural network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110081881A (en) * 2019-04-19 2019-08-02 成都飞机工业(集团)有限责任公司 It is a kind of based on unmanned plane multi-sensor information fusion technology warship bootstrap technique
CN111680541A (en) * 2020-04-14 2020-09-18 华中科技大学 A Multimodal Sentiment Analysis Method Based on Multidimensional Attention Fusion Network
CN112418095A (en) * 2020-11-24 2021-02-26 华中师范大学 A Facial Expression Recognition Method and System Combined with Attention Mechanism
CN112541409A (en) * 2020-11-30 2021-03-23 北京建筑大学 Attention-integrated residual network expression recognition method
CN114764941A (en) * 2022-04-25 2022-07-19 深圳技术大学 Expression recognition method and device and electronic equipment
CN115424313A (en) * 2022-07-20 2022-12-02 河海大学常州校区 Expression recognition method and device based on deep and shallow layer multi-feature fusion
CN115862091A (en) * 2022-11-09 2023-03-28 暨南大学 Facial expression recognition method, device, equipment and medium based on Emo-ResNet
CN115984930A (en) * 2022-12-26 2023-04-18 中国电信股份有限公司 Micro expression recognition method and device and micro expression recognition model training method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110081881A (en) * 2019-04-19 2019-08-02 成都飞机工业(集团)有限责任公司 It is a kind of based on unmanned plane multi-sensor information fusion technology warship bootstrap technique
CN111680541A (en) * 2020-04-14 2020-09-18 华中科技大学 A Multimodal Sentiment Analysis Method Based on Multidimensional Attention Fusion Network
CN112418095A (en) * 2020-11-24 2021-02-26 华中师范大学 A Facial Expression Recognition Method and System Combined with Attention Mechanism
CN112541409A (en) * 2020-11-30 2021-03-23 北京建筑大学 Attention-integrated residual network expression recognition method
CN114764941A (en) * 2022-04-25 2022-07-19 深圳技术大学 Expression recognition method and device and electronic equipment
CN115424313A (en) * 2022-07-20 2022-12-02 河海大学常州校区 Expression recognition method and device based on deep and shallow layer multi-feature fusion
CN115862091A (en) * 2022-11-09 2023-03-28 暨南大学 Facial expression recognition method, device, equipment and medium based on Emo-ResNet
CN115984930A (en) * 2022-12-26 2023-04-18 中国电信股份有限公司 Micro expression recognition method and device and micro expression recognition model training method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118365974A (en) * 2024-06-20 2024-07-19 山东省水利科学研究院 Water quality class detection method, system and equipment based on hybrid neural network

Also Published As

Publication number Publication date
CN116189272B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
Zhang et al. Depth-wise separable convolutions and multi-level pooling for an efficient spatial CNN-based steganalysis
CN109615582B (en) A Face Image Super-resolution Reconstruction Method Based on Attribute Description Generative Adversarial Network
CN109948691B (en) Image description generation method and device based on depth residual error network and attention
CN111626300B (en) Image segmentation method and modeling method of image semantic segmentation model based on context perception
CN108596024B (en) Portrait generation method based on face structure information
CN110796111B (en) Image processing method, device, equipment and storage medium
CN110689599B (en) 3D visual saliency prediction method based on non-local enhancement generation countermeasure network
CN109829959B (en) Facial analysis-based expression editing method and device
CN107437096A (en) Image classification method based on the efficient depth residual error network model of parameter
CN115565238B (en) Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product
CN110175248B (en) A face image retrieval method and device based on deep learning and hash coding
CN111984772A (en) Medical image question-answering method and system based on deep learning
CN113255788B (en) Method and system for generating confrontation network face correction based on two-stage mask guidance
CN115393933A (en) A video face emotion recognition method based on frame attention mechanism
CN116580278A (en) Lip language identification method, equipment and storage medium based on multi-attention mechanism
CN115830666A (en) A video expression recognition method and application based on spatio-temporal feature decoupling
CN117876793A (en) A hyperspectral image tree species classification method and device
CN116189272A (en) Facial expression recognition method and system based on feature fusion and attention mechanism
CN114821770B (en) Text-to-image cross-modal person re-identification methods, systems, media and devices
CN115984700A (en) Remote sensing image change detection method based on improved Transformer twin network
CN117036368A (en) Image data processing method, device, computer equipment and storage medium
CN118918336A (en) Image change description method based on visual language model
CN116778577A (en) Lip language identification method based on deep learning
CN116311455A (en) An Expression Recognition Method Based on Improved Mobile-former
CN114782995A (en) A Human Interaction Behavior Detection Method Based on Self-Attention Mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20241218

Address after: Room 401, 4th Floor, Building 6, No. 6 Fengxin Road, Yuhuatai District, Nanjing City, Jiangsu Province, 210012

Patentee after: Nanjing EasyVision Cuizhi Technology Co.,Ltd.

Country or region after: China

Address before: No. 9, Wenyuan Road, Qixia District, Nanjing, Jiangsu 210033

Patentee before: NANJING University OF POSTS AND TELECOMMUNICATIONS

Country or region before: China