CN114842542A

CN114842542A - Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Info

Publication number: CN114842542A
Application number: CN202210606040.5A
Authority: CN
Inventors: 邵志文; 周勇; 陈浩; 于清
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-02
Anticipated expiration: 2042-05-31
Also published as: CN114842542B

Abstract

The invention discloses a facial action unit recognition method and device based on adaptive attention and space-time correlation. The original continuous image frames required for model training are first extracted from video data to form a training data set, and then the original image frames are pre-processed. The amplified image frame sequence is obtained by processing, and then the convolutional neural network module I is constructed to extract the hierarchical multi-scale regional features of the amplified image frame, and then the convolutional neural network module II is constructed to perform adaptive attention regression learning of facial action units, and then construct The adaptive spatiotemporal graph convolutional neural network module III learns the spatiotemporal association of facial action units, and finally a fully connected layer module IV is constructed for facial action unit recognition. The invention adopts an end-to-end deep learning framework to learn action unit recognition, and utilizes the interdependence and temporal and spatial correlation between facial action units to effectively identify the movement of facial muscles in a two-dimensional image, and realize the construction of a facial action unit recognition system. .

Description

Facial action unit recognition method and device based on adaptive attention and spatiotemporal correlation

技术领域technical field

本发明涉及一种基于自适应注意力与时空关联的面部动作单元识别方法及装置，属于计算机视觉技术。The invention relates to a facial action unit recognition method and device based on adaptive attention and space-time correlation, which belong to computer vision technology.

背景技术Background technique

为了更精细地研究人类面部表情，美国著名情绪心理学家Ekman等于1978年首次提出了面部动作编码系统(Facial Action Coding System，FACS)，又于2002年作了重要改进。面部动作编码系统根据人脸的解剖学特点划分成若干既相互独立又相互联系的面部动作单元，通过这些面部动作单元的动作特征及其所控制的主要区域可以反映出面部表情。In order to study human facial expressions more carefully, the famous American emotional psychologist Ekman first proposed the Facial Action Coding System (FACS) in 1978, and made important improvements in 2002. The facial action coding system is divided into several facial action units that are both independent and interconnected according to the anatomical characteristics of the face. The facial expression can be reflected by the action features of these facial action units and the main areas they control.

随着计算机技术和信息技术的发展，深度学习技术得到了广泛的应用。在AU(面部动作单元)识别领域，基于深度学习模型研究AU识别已成为主流。目前，AU识别主要分成了两条研究路线：区域学习与AU关联学习。若不考虑AU之间的关联，一般来说仅有其对应面部肌肉所在的几块稀疏区域对它的识别是有贡献的，其他区域则不需要过多关注，因此找到那些需要关注的区域并加以重点学习才能更好地进行AU识别，专注于这一问题的解决方案一般被称为区域学习(Region Learning，RL)。此外，AU是在面部肌肉解剖学的基础上定义的，描述了一块或几块肌肉的运动，某些肌肉在运动过程中会牵动几个AU同时出现，因此AU之间存在一定程度的相关性，显然，AU之间的关联性信息会有助于模型识别性能的提升，因此如何挖掘AU之间的关联并基于相关性提升AU模型识别性能的解决方案一般被称为AU关联学习。With the development of computer technology and information technology, deep learning technology has been widely used. In the field of AU (Facial Action Unit) recognition, research on AU recognition based on deep learning models has become mainstream. At present, AU identification is mainly divided into two research routes: regional learning and AU association learning. If the association between AUs is not considered, generally only a few sparse areas where the corresponding facial muscles are located contribute to its recognition, and other areas do not need too much attention, so find those areas that need attention and Focused learning can be used for better AU recognition, and the solution to this problem is generally called region learning (RL). In addition, AU is defined on the basis of facial muscle anatomy, describing the movement of one or several muscles, some muscles will affect several AUs at the same time during the movement, so there is a certain degree of correlation between AUs , obviously, the correlation information between AUs will help to improve the recognition performance of the model, so the solution of how to mine the correlation between AUs and improve the recognition performance of the AU model based on the correlation is generally called AU correlation learning.

尽管面部动作单元的自动识别取得了令人印象深刻的进展，但目前基于区域学习的AU检测方法，由于AU没有明显的轮廓和纹理，而且可能随人和表情变化而变化，所以仅利用AU标签来监督神经网络自适应地学习隐含的注意力时常会捕捉到不相关的区域。而基于关系推理的AU检测方法，在推理时所有AU共享参数，忽视了每个AU的特异性和动态性，使得其识别准确率还不高，有进一步提升的空间。Although impressive progress has been made in the automatic recognition of facial action units, current AU detection methods based on region learning only utilize AU labels because AUs do not have obvious contours and textures and may vary with people and expressions. To supervised neural networks adaptively learn implicit attention that often captures irrelevant regions. However, in the AU detection method based on relational reasoning, all AUs share parameters during inference, ignoring the specificity and dynamics of each AU, so that the recognition accuracy is not high, and there is room for further improvement.

发明内容SUMMARY OF THE INVENTION

发明目的：为了克服现有技术中存在的不足，本发明提供一种基于自适应注意力与时空关联的面部动作单元识别方法及装置，能够适应非受控场景在光照、遮挡、姿态等方面随机多样变化的不同样本，预期在保持较高识别精度的同时具有较强的鲁棒性。Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a facial action unit recognition method and device based on adaptive attention and space-time correlation, which can adapt to the randomness of illumination, occlusion, posture, etc. in uncontrolled scenes. Different samples with various changes are expected to have strong robustness while maintaining high recognition accuracy.

技术方案：为实现上述目的，本发明采用的技术方案为：Technical scheme: In order to realize the above-mentioned purpose, the technical scheme adopted in the present invention is:

一种基于自适应注意力与时空关联的面部动作单元识别方法，包括如下步骤：A facial action unit recognition method based on adaptive attention and spatiotemporal association, comprising the following steps:

S01：从任意视频中抽取训练所需要的原始连续图像帧组成训练数据集；针对视频序列，原始连续图像帧的数量可以为48帧；S01: Extract the original continuous image frames required for training from any video to form a training data set; for video sequences, the number of original continuous image frames can be 48 frames;

S02：对原始连续图像帧进行预处理，获得扩增图像帧序列；对原始连续图像帧进行预处理的方式包括随机平移、随机旋转、随机缩放、随机水平翻转或者随机裁剪等，对图像进行预处理能在一定程度上提高模型的泛化能力；S02: Preprocess the original continuous image frames to obtain an amplified image frame sequence; the methods of preprocessing the original continuous image frames include random translation, random rotation, random scaling, random horizontal flipping or random cropping, etc. Processing can improve the generalization ability of the model to a certain extent;

S03：构建卷积神经网络模块I提取扩增图像帧序列中各帧的分层多尺度区域特征；S03: constructing the convolutional neural network module 1 to extract the hierarchical multi-scale regional features of each frame in the amplified image frame sequence;

S04：利用步骤S03提取的分层多尺度区域特征，构建卷积神经网络模块II进行AU的全局注意力图回归并提取AU特征，并通过AU检测损失对卷积神经网络模块II进行监督；AU表示面部动作单元；S04: Using the hierarchical multi-scale regional features extracted in step S03, construct convolutional neural network module II to perform global attention map regression of AU and extract AU features, and supervise convolutional neural network module II through AU detection loss; AU represents facial action unit;

S05：利用步骤S04提取的AU特征，构建自适应时空图卷积神经网络模块III，推理每个AU的特定模式、不同AU之间的时空关联性(如共现和互斥)，从而学习每个AU的时空关联特征；S05: Use the AU features extracted in step S04 to construct an adaptive spatiotemporal graph convolutional neural network module III, infer the specific pattern of each AU and the spatiotemporal correlation (such as co-occurrence and mutual exclusion) between different AUs, so as to learn each AU The spatiotemporal correlation characteristics of each AU;

S06：利用步骤S05提取的各AU的时空关联特征，构建一个全连接模块IV实现AU识别；S06: utilize the spatiotemporal correlation feature of each AU extracted in step S05 to construct a fully connected module IV to realize AU identification;

S07：使用训练数据集对由卷积神经网络模块I、卷积神经网络模块II、自适应时空图卷积神经网络模块III和全连接模块IV构成的整体AU识别网络模型进行训练，以基于梯度的优化方法对整体AU识别网络模型的参数进行更新；S07: Use the training data set to train the overall AU recognition network model composed of the convolutional neural network module I, the convolutional neural network module II, the adaptive spatiotemporal graph convolutional neural network module III, and the fully connected module IV to be based on the gradient The optimization method updates the parameters of the overall AU recognition network model;

S08：将给定任意帧数的视频序列输入到训练完成的整体AU识别网络模型，预测AU的出现概率。S08: Input a video sequence with a given arbitrary number of frames into the trained overall AU recognition network model to predict the occurrence probability of AU.

具体的，所述步骤S03中，由于不同局部块的AU具有不同的面部结构和纹理信息，因而需要对每个局部块进行独立的滤波处理，且不同局部块使用不同的滤波权值；为了获得多尺度区域特征，采用卷积神经网络模块I来学习不同尺度下每个局部块的特征，卷积神经网络模块I包括两层结构相同的分层多尺度区域层，卷积神经网络模块I的输入作为第一层分层多尺度区域层的输入，第一层分层多尺度区域层的输出经过最大池化运算后作为第二层分层多尺度区域层的输入，第二层分层多尺度区域层的输出经过最大池化运算后作为卷积神经网络模块I的输出；将扩增图像帧序列的各帧图像分别输入到卷积神经网络模块I，输出为各帧图像的分层多尺度区域特征。Specifically, in the step S03, since the AUs of different local blocks have different facial structures and texture information, it is necessary to perform independent filtering processing on each local block, and different local blocks use different filtering weights; in order to obtain Multi-scale regional features, the convolutional neural network module I is used to learn the features of each local block at different scales. The convolutional neural network module I includes two hierarchical multi-scale regional layers with the same structure. The input is used as the input of the first layer of hierarchical multi-scale region layer, and the output of the first layer of hierarchical multi-scale region layer is used as the input of the second layer of hierarchical multi-scale region layer after the maximum pooling operation. The output of the scale area layer is used as the output of the convolutional neural network module I after the maximum pooling operation; each frame image of the amplified image frame sequence is input to the convolutional neural network module I respectively, and the output is a hierarchical multi-layered image of each frame image. Scale area features.

每层分层多尺度区域层均包括卷积层I-I、卷积层I-II-I、卷积层I-II-II和卷积层I-II-III，在卷积层I-I内，对输入整体进行一次卷积，将卷积结果作为卷积层I-I的输出；将卷积层I-I的输出作为卷积层I-II-I的输入，在卷积层I-II-I内，先将输入均匀划分为8×8尺度的局部块分别进行卷积，再对所有卷积结果进行拼接形成卷积层I-II-I的输出；将卷积层I-II-I的输出作为卷积层I-II-II的输入，在卷积层I-II-II内，先将输入均匀划分为4×4尺度的局部块分别进行卷积，再对所有卷积结果进行拼接形成卷积层I-II-II的输出；将卷积层I-II-II的输出作为卷积层I-II-III的输入，在卷积层I-II-III内，先将输入均匀划分为2×2尺度的局部块分别进行卷积，再对所有卷积结果进行拼接形成卷积层I-II-III的输出；对卷积层I-II-I、卷积层I-II-II和卷积层I-II-III的输出进行通道级串联后(通道级串联后输出的通道数与卷积层I-I输出的通道数相同)与卷积层I-I的输出进行加和，结果作为分层多尺度区域层的输出。Each hierarchical multi-scale region layer includes convolutional layer I-I, convolutional layer I-II-I, convolutional layer I-II-II and convolutional layer I-II-III. The whole input is convolved once, and the convolution result is used as the output of the convolutional layer I-I; the output of the convolutional layer I-I is used as the input of the convolutional layer I-II-I. In the convolutional layer I-II-I, the first The input is evenly divided into 8 × 8 local blocks for convolution respectively, and then all convolution results are spliced to form the output of convolutional layer I-II-I; the output of convolutional layer I-II-I is used as the volume For the input of layer I-II-II, in the convolution layer I-II-II, the input is firstly divided into 4×4 scale local blocks for convolution respectively, and then all convolution results are spliced to form convolution The output of layer I-II-II; the output of the convolutional layer I-II-II is used as the input of the convolutional layer I-II-III. In the convolutional layer I-II-III, the input is firstly divided into 2 The local blocks of ×2 scale are convolved separately, and then all convolution results are spliced to form the output of convolutional layers I-II-III; convolutional layers I-II-I, convolutional layers I-II-II and The outputs of the convolutional layers I-II-III are concatenated at the channel level (the number of channels output after the convolutional concatenation is the same as the number of channels output by the convolutional layers I-I) and the outputs of the convolutional layers I-I are added, and the result is used as a hierarchical The output of the multi-scale region layer.

具体的，所述步骤S04中，卷积神经网络模块II作为自适应注意力学习模块，其输入为图像的分层多尺度区域特征，经过卷积神经网络模块II得到预测的每张图像上各AU的全局注意力图，并进行AU特征提取和AU预测，整个过程在预定义的注意力图和AU检测损失的监督下进行。包括如下步骤：Specifically, in the step S04, the convolutional neural network module II is used as an adaptive attention learning module, and its input is the layered multi-scale regional features of the image, and the prediction is obtained through the convolutional neural network module II on each image. The global attention map of AU, and AU feature extraction and AU prediction are performed, and the whole process is carried out under the supervision of a predefined attention map and AU detection loss. It includes the following steps:

(41)生成每个AU的预测注意力图：分层多尺度区域特征被输入到自适应注意力学习模块，每帧图片上AU的个数为m，每个AU对应一个独立的分支，采用四层卷积层学习AU的全局注意力图M_ij并提取AU特征。(41) Generate the predicted attention map of each AU: hierarchical multi-scale regional features are input to the adaptive attention learning module, the number of AUs on each frame is m, and each AU corresponds to an independent branch, using four Layer convolutional layers learn the global attention map M _ij of AUs and extract AU features.

(42)生成每个AU的真实注意力图：每个AU都有两个中心，由两个相关的人脸特征点指定；真实注意力图由高斯分布生成，其中心点即AU中心点坐标，例如AU的中心点坐标为

则注意力图上位置坐标为(a,b)处的真实注意力权值为：(42) Generating the true attention map of each AU: each AU has two centers, specified by two related face feature points; the true attention map is generated by a Gaussian distribution, and its center point is the AU center point coordinates, such as The coordinates of the center point of AU are

Then the real attention weight at the position coordinate (a, b) on the attention map is:

然后，在每个AU的两个中心位置中选择注意权重中较大的一个，来合并两个AU中心的预定义注意图，即

并采用注意回归损失来鼓励M_ij接近M_ij：Then, the larger one of the attention weights is selected among the two center positions of each AU to merge the predefined attention maps of the two AU centers, i.e.

and employ an attentional regression loss to encourage Mi _ij to approach Mi _ij :

其中：L_a表示全局注意力图回归的损失函数L_a；t表示扩增图像帧序列的长度；m表示每帧图像中AU的个数；l/4×l/4表示全局注意力图的大小；M_ijab表示第i帧图像的第j个AU在坐标位置(a,b)处的真实注意力权值；M_ijab表示第i帧图像的第j个AU在坐标位置(a,b)处的预测注意力权值。Among them: _La represents the loss function _La of the global attention map regression; t represents the length of the augmented image frame sequence; m represents the number of AUs in each frame of image; 1/4×1/4 represents the size of the global attention map; M _ijab represents the real attention weight of the jth AU of the ith frame image at the coordinate position (a, b); _Mijab represents the jth AU of the ith frame image at the coordinate position (a, b) Predict the attention weights.

(43)提取AU特征并进行AU检测：将预测的全局注意力图M_ij与第四个卷积层II-II得到的人脸特征图进行按元素相乘，便可对注意力权重较大区域的特征进行加强；然后将得到的输出特征输入到卷积层II-III，再经过全局平均池化层，从而提取到AU特征；采用AU检测交叉熵损失促进注意力图自适应训练，将学到的AU特征输入到一个一维全连通层，再用Sigmoid函数δ(x)＝1/(1+e^-x)预测各AU出现的概率。(43) Extract AU features and perform AU detection: Multiply the predicted global attention map M _ij with the face feature map obtained by the fourth convolution layer II-II element-wise, and then the attention weight can be increased to the area Then the obtained output features are input to the convolutional layers II-III, and then the global average pooling layer is used to extract the AU features; AU detection cross-entropy loss is used to promote adaptive training of attention maps, and the learned The AU features are input into a one-dimensional fully connected layer, and the Sigmoid function δ(x)=1/(1+e ^-x ) is used to predict the probability of each AU.

采用的AU识别的加权交叉熵损失函数为：The weighted cross-entropy loss function used for AU identification is:

其中：

表示AU识别的加权交叉熵损失函数；p_ij表示第i帧图像的第j个AU出现的真实概率；

表示第i帧图像的第j个AU出现的预测概率；ω_j表示第j个AU的权重；v_j表示第j个AU出现时的权重(使用于交叉熵的第一项即出现项)。in:

Represents the weighted cross-entropy loss function of AU identification; p _ij represents the true probability of the occurrence of the j-th AU in the i-th frame image;

Represents the predicted probability of the occurrence of the jth AU in the ith frame image; ωj represents the weight of the _jth AU; _vj represents the weight when the jth AU appears (the first item used for cross entropy is the appearance item).

(44)卷积神经网络模块I和卷积神经网络模块II整体的总损失函数：结合注意力图损失和AU检测损失可得到总体损失函数。(44) The overall loss function of the convolutional neural network module I and the convolutional neural network module II as a whole: the overall loss function can be obtained by combining the attention map loss and the AU detection loss.

其中：L_AA表示卷积神经网络模块I和卷积神经网络模块II整体的总损失函数。Where: L _AA represents the total loss function of the convolutional neural network module I and the convolutional neural network module II as a whole.

具体的，所述步骤S05中，采用自适应时空图卷积神经网络模块III提取每个AU的特定模式和不同AU之间的时空关联性，自适应时空图卷积神经网络模块III包含两层结构相同的时空图卷积层，将t帧图像、每帧图像上m个12c的AU特征拼接为t×m×12c的整体特征，作为时空图卷积层III-I的输入，时空图卷积层III-I得到的输出作为时空图卷积层III-II的输入，经过时空图卷积层III-II得到输出特征，此特征包含了AU的特定模式、不同AU之间的时空关联性信息。Specifically, in the step S05, the adaptive spatiotemporal graph convolutional neural network module III is used to extract the specific pattern of each AU and the spatiotemporal correlation between different AUs, and the adaptive spatiotemporal graph convolutional neural network module III includes two layers The spatiotemporal graph convolution layer with the same structure splices t frames of images and m 12c AU features on each frame image into an overall feature of t×m×12c as the input of the spatiotemporal graph convolutional layer III-I, the spatiotemporal graph volume The output obtained by the layer III-I is used as the input of the spatiotemporal graph convolution layer III-II, and the output feature is obtained through the spatiotemporal graph convolution layer III-II. This feature includes the specific pattern of AU and the spatiotemporal correlation between different AUs information.

两层时空图卷积层的参数各自独立学习，每层时空图卷积层均由空间图卷积和门控循环单元结合构成，时空卷积层的定义为：The parameters of the two layers of spatio-temporal graph convolution layers are learned independently. Each spatio-temporal graph convolution layer is composed of a combination of spatial graph convolution and gated recurrent units. The definition of the spatio-temporal convolution layer is:

其中：

表示T时刻的输入，h_T表示T时刻的最终隐藏状态(T时刻的输出)，

表示T时刻的初始隐藏状态；z_T用于决定T时刻需要保留的h_T-1的数量，r_T用于决定T时刻

与h_T-1的组合方式。in:

represents the input at time T, h _T represents the final hidden state at time T (the output at time T),

Represents the initial hidden state at time T; z _T is used to determine the number of h _T-1 to be retained at time T, and r _T is used to determine time T

Combination with h _T-1 .

表示单位矩阵，m是每帧图像中AU的个数；

表示自适应学习针对AU关系图的矩阵，U^T表示U的转置；

表示自适应学习的解离矩阵，c^e是为Q设定的列数；

和

分别表示自适应学习针对z_T、r_T和

的权重矩阵，c'表示AU特征的维数，c'＝12c，c是与整体AU识别网络模型相关的设定参数；对于第i帧图像的第j个AU，利用Q中的第i行分量

可以分别从W_z、W_r和

中解离出一个大小为2c'×c'的参数。

Represents the identity matrix, m is the number of AUs in each frame of image;

Represents the matrix of adaptive learning for the AU relationship graph, U ^T represents the transpose of U;

represents the dissociation matrix of adaptive learning, c ^e is the number of columns set for Q;

and

represent adaptive learning for z _T , r _T and

The weight matrix of , c' represents the dimension of the AU feature, c'=12c, c is the setting parameter related to the overall AU recognition network model; for the jth AU of the ith frame image, use the ith row in Q weight

can be obtained from W _z , W _r and

A parameter of size 2c'×c' is dissociated in .

R(X)表示对二维矩阵X中每一索引位置处的元素进行去负数处理，去负数处理后，二维矩阵X中(a,b)索引位置处的元素X_ab更新为X_ab＝max(0,X_ab)。R(X) indicates that the element at each index position in the two-dimensional matrix X is subjected to de-negative processing. After the de-negative processing, the element X _ab at the (a, b) index position in the two-dimensional matrix X is updated to X _ab = max(0,X _ab ).

N(X)表示对二维矩阵X中每一索引位置处的元素进行归一化处理，归一化处理后，二维矩阵X中(a,b)索引位置处的元素X_ab更新为

N(X) means to normalize the elements at each index position in the two-dimensional matrix X. After normalization, the element X _ab at the (a, b) index position in the two-dimensional matrix X is updated to

Z＝X⊙Y表示对二维矩阵X和三维矩阵Y进行函数运算得到二维矩阵Z，二维矩阵Z中(a,b)索引位置处的元素

X_ak表示二维矩阵X中(a,k)索引位置处的元素，Y_akb表示三维矩阵Y中(a,k,b)索引位置处的元素。Z=X⊙Y means that a two-dimensional matrix Z is obtained by performing a function operation on a two-dimensional matrix X and a three-dimensional matrix Y, and the element at the (a, b) index position in the two-dimensional matrix Z

X _ak represents the element at the (a, k) index position in the two-dimensional matrix X, and Y _akb represents the element at the (a, k, b) index position in the three-dimensional matrix Y.

“★”表示元素级相乘，C(·)表示拼接操作，σ(·)表示Sigmoid函数，tanh(·)表示双曲正切激活函数。"★" means element-wise multiplication, C(·) means concatenation operation, σ(·) means Sigmoid function, and tanh(·) means hyperbolic tangent activation function.

具体的，所述步骤S06中，采用一个全连接模块IV采用一维全连接层，对每帧图像的AU进行识别，对时空图卷积层输出的t帧图像包含的所有AU的时空关联特征

进行逐帧、逐AU分解，得到维度为12c的AU特征向量，将其中包含第i帧图像的第j个AU的时空关联特征

输入到全连接模块IV，全连接模块IV对

采用第j个一维全连接层后紧跟一个Sigmoid激活函数的方式，预测第i帧图像的第j个AU的最终出现概率，全连接模块IV对不同帧图像的同一个AU采用相同的全连接层；由于得到的特征具有AU间的时空关联信息，有利于最终的AU识别，AU识别采用的损失函数为：Specifically, in the step S06, a fully connected module IV is used to use a one-dimensional fully connected layer to identify the AUs of each frame of images, and to identify the spatiotemporal correlation features of all AUs included in the t frames of images output from the spatiotemporal graph convolutional layer.

Carry out frame-by-frame and AU-by-AU decomposition to obtain an AU feature vector with a dimension of 12c, which contains the spatiotemporal correlation feature of the jth AU of the ith frame image

Input to fully connected module IV, fully connected module IV pair

The jth one-dimensional fully connected layer is followed by a sigmoid activation function to predict the final occurrence probability of the jth AU of the ith frame image. The fully connected module IV uses the same full connection for the same AU of different frame images Connection layer; since the obtained features have spatiotemporal correlation information between AUs, which is beneficial to the final AU identification, the loss function used for AU identification is:

其中：

表示AU识别的损失函数；t表示扩增图像帧序列的长度；p_ij表示第i帧图像的第j个AU出现的真实概率；

表示第i帧图像的第j个AU出现的最终预测概率；ω_j表示第j个AU的权重；v_j表示第j个AU出现时的权重(使用于交叉熵的第一项即出现项)。in:

Represents the loss function of AU recognition; t represents the length of the augmented image frame sequence; p _ij represents the true probability of the occurrence of the jth AU in the ith frame image;

Represents the final prediction probability of the occurrence of the jth AU of the i-th frame image; ω _j represents the weight of the j-th AU; v _j represents the weight of the j-th AU when it appears (the first item used for cross entropy is the appearance item) .

具体的，所述步骤S07中，通过端到端的方法训练由卷积神经网络模块I、卷积神经网络模块II、自适应时空图卷积神经网络模块III和全连接模块IV构成的整体AU识别网络模型；首先训练卷积神经网络模型，提取精确的AU特征作为图卷积神经网络的输入，然后训练图卷积神经网络学习AU的特定模式与时空关联性特征，利用AU间的时空关联性特征促进面部动作单元的识别。Specifically, in the step S07, the overall AU recognition system composed of the convolutional neural network module I, the convolutional neural network module II, the adaptive spatiotemporal graph convolutional neural network module III and the fully connected module IV is trained by an end-to-end method. Network model; first train the convolutional neural network model, extract accurate AU features as the input of the graph convolutional neural network, and then train the graph convolutional neural network to learn the specific patterns and spatiotemporal correlation features of AUs, and use the spatiotemporal correlation between AUs Features facilitate the recognition of facial action units.

一种用于实现上述基于自适应注意力与时空关联的面部动作单元识别装置，包括图像帧序列获取单元、分层多尺度区域学习单元、自适应注意力回归与特征提取单元、自适应时空图卷积学习单元、AU识别单元和参数优化单元；A device for realizing the above-mentioned facial action unit recognition based on adaptive attention and spatiotemporal association, including an image frame sequence acquisition unit, a hierarchical multi-scale region learning unit, an adaptive attention regression and feature extraction unit, and an adaptive spatiotemporal map Convolution learning unit, AU identification unit and parameter optimization unit;

所述图像帧序列获取单元，用于从任意视频数据中抽取训练所需要的大量原始连续图像组成训练数据集，并对原始连续图像帧进行预处理，获得扩增图像帧序列；The image frame sequence acquisition unit is used to extract a large number of original continuous images required for training from any video data to form a training data set, and preprocess the original continuous image frames to obtain an augmented image frame sequence;

所述分层多尺度区域学习单元，包括卷积神经网络模块I，采用分层多尺度区域层来学习每帧输入图像不同尺度下每个局部块的特征，并对每个局部块进行独立滤波；The layered multi-scale region learning unit, including the convolutional neural network module I, adopts the layered multi-scale region layer to learn the features of each local block under different scales of the input image of each frame, and independently filter each local block. ;

所述自适应注意力回归与特征提取单元，包括卷积神经网络模块II，用于生成图像的全局注意力图，并在预定义的注意力图与AU检测损失的监督下进行自适应回归，同时精确地提取AU特征。The adaptive attention regression and feature extraction unit, including the convolutional neural network module II, is used to generate a global attention map of the image, and perform adaptive regression under the supervision of a predefined attention map and AU detection loss, while accurate to extract AU features.

所述自适应时空图卷积学习单元，包括自适应时空图卷积神经网络模块III，进行每个AU的特定模式学习、不同AU间的时空关联性学习，并提取每个AU的时空关联特征；The adaptive spatiotemporal graph convolution learning unit includes an adaptive spatiotemporal graph convolutional neural network module III, which performs the specific pattern learning of each AU, the spatiotemporal correlation learning between different AUs, and extracts the spatiotemporal correlation features of each AU ;

所述AU识别单元，包括全连接模块IV，利用每个AU的时空关联特征，可以有效地进行AU识别；The AU identification unit, including the fully connected module IV, utilizes the space-time correlation feature of each AU to effectively carry out AU identification;

所述参数优化单元，计算由卷积神经网络模块I、卷积神经网络模块II、自适应时空图卷积神经网络模块III和全连接模块IV构成的整体AU识别网络模型的参数和损失函数值，并基于梯度优化方法对参数进行更新。The parameter optimization unit calculates the parameters and loss function values of the overall AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the adaptive spatiotemporal graph convolutional neural network module III and the fully connected module IV , and update the parameters based on the gradient optimization method.

将输入图像帧序列中每一帧图像分别输入到卷积神经网络模块I、卷积神经网络模块II，得到每一帧的m个AU特征，到自适应时空图卷积神经网络模块III时，才开始把所有帧的特征拼起来作为输入；也就是卷积神经网络模块I、卷积神经网络模块II是对单张图像做处理的，不涉及时间，可以看成对t帧分别做了处理，而自适应时空图卷积神经网络模块III是对t帧同时做处理；全连接模块IV是对单张图像的m个AU的时空关联特征做处理，不涉及时间。Input each frame of image in the input image frame sequence into the convolutional neural network module I and the convolutional neural network module II, respectively, to obtain m AU features of each frame, and when reaching the adaptive spatiotemporal graph convolutional neural network module III, The features of all the frames have only been put together as input; that is, the convolutional neural network module I and the convolutional neural network module II are processing a single image, which does not involve time, and can be regarded as processing t frames separately. , while the adaptive spatiotemporal graph convolutional neural network module III processes t frames at the same time; the fully connected module IV processes the spatiotemporal correlation features of m AUs in a single image, without involving time.

有益效果：本发明提供的基于自适应注意力与时空关联的面部动作单元识别方法和装置，相对于现有技术，具有如下优势：(1)在自适应注意力回归神经网络中，通过AU检测损失促进注意力自适应回归，可以准确地捕捉与AU关联的局部特征，利用位置先验来对注意力分布进行优化，有利于提高对非受控场景的鲁棒性；(2)在自适应时空图卷积神经网络中，每个时空图卷积层在单帧的空间域充分地学习AU关系；而在时域上，进行通用的卷积运算，挖掘帧间关联性，促进各帧的人脸AU识别，最终网络输出各帧每个AU出现的概率；(3)由于采用自适应学习而不是基于先验知识预定义的AU关系，本发明能够适应非受控场景在光照、遮挡、姿态等方面随机多样变化的不同样本，预期在保持较高识别精度的同时具有较强的鲁棒性。Beneficial effects: The facial action unit recognition method and device based on adaptive attention and space-time correlation provided by the present invention have the following advantages compared with the prior art: (1) In the adaptive attention regression neural network, the AU detection The loss promotes the adaptive regression of attention, which can accurately capture the local features associated with the AU, and use the position prior to optimize the attention distribution, which is beneficial to improve the robustness to uncontrolled scenes; (2) In adaptive In the spatiotemporal graph convolutional neural network, each spatiotemporal graph convolutional layer fully learns the AU relationship in the spatial domain of a single frame; while in the time domain, a general convolution operation is performed to mine the correlation between frames and promote the understanding of each frame. Face AU recognition, and finally the network outputs the probability of each AU appearing in each frame; (3) Due to the use of adaptive learning rather than the pre-defined AU relationship based on prior knowledge, the present invention can adapt to uncontrolled scenes in lighting, occlusion, Different samples with random and diverse changes in attitude and other aspects are expected to have strong robustness while maintaining high recognition accuracy.

附图说明Description of drawings

图1为本发明方法的实施流程示意图；Fig. 1 is the implementation flow schematic diagram of the method of the present invention;

图2为分层多尺度区域层的结构示意图；Figure 2 is a schematic structural diagram of a hierarchical multi-scale region layer;

图3为卷积神经网络模块II的结构示意图；Fig. 3 is the structural representation of convolutional neural network module II;

图4为时空图卷积层的结构示意图；FIG. 4 is a schematic structural diagram of a spatiotemporal graph convolution layer;

图5为整个自适应注意力回归神经网络、自适应时空图卷积神经网络模型的结构示意图。Figure 5 is a schematic structural diagram of the entire adaptive attention regression neural network and adaptive spatiotemporal graph convolutional neural network model.

具体实施方式Detailed ways

以下结合附图和具体实施例对本发明作具体的介绍。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

在本发明的描述中，需要理解的是，术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be understood that the terms "center", "portrait", "horizontal", "top", "bottom", "front", "rear", "left", "right", " The orientation or positional relationship indicated by vertical, horizontal, top, bottom, inner, outer, etc. is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and The description is simplified rather than indicating or implying that the device or element referred to must have a particular orientation, be constructed and operate in a particular orientation, and therefore should not be construed as limiting the invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed to indicate or imply relative importance.

本发明提供一种基于自适应注意力与时空关联的面部动作单元识别方法及装置。在自适应注意力回归神经网络中，通过预定义注意力和AU检测的约束，同时捕获由人脸特征点指定的AU强相关区域和分布在全局的弱相关区域，由于学到了各区域的相关性分布，再进行特征提取能精确地提取每个AU的有用信息。在自适应时空图卷积神经网络中，每个时空图卷积层在单帧的空间域充分地学习AU关系；而在时域上，进行通用的卷积运算挖掘帧间关联性，促进各帧的人脸AU识别，最终网络输出各帧每个AU出现的概率。由于采用自适应学习，本发明能够适应非受控场景在光照、遮挡、姿态等方面随机多样变化的不同样本，预期在保持较高识别精度的同时具有较强的鲁棒性。The present invention provides a facial action unit recognition method and device based on adaptive attention and space-time correlation. In the adaptive attention regression neural network, by pre-defining the constraints of attention and AU detection, the AU strong correlation regions specified by the face feature points and the weak correlation regions distributed globally are captured at the same time, because the correlation of each region is learned. Then, feature extraction can accurately extract the useful information of each AU. In the adaptive spatiotemporal graph convolutional neural network, each spatiotemporal graph convolutional layer fully learns the AU relationship in the spatial domain of a single frame; while in the temporal domain, a general convolution operation is performed to mine the inter-frame correlation and promote the Frame face AU recognition, and finally the network outputs the probability of each AU appearing in each frame. Due to the adaptive learning, the present invention can adapt to different samples with random and diverse changes in illumination, occlusion, posture, etc. of uncontrolled scenes, and is expected to have strong robustness while maintaining high recognition accuracy.

如图1所示为基于联合学习与光流估计的面部动作单元识别方法的流程示意图，下面就各个具体步骤加以说明。Figure 1 is a schematic flowchart of a facial action unit recognition method based on joint learning and optical flow estimation, and each specific step is described below.

S01：从任意视频中抽取训练所需要的原始连续图像帧组成训练数据集，抽取的连续图像帧序列长度为48。S01: Extract the original continuous image frames required for training from any video to form a training data set, and the length of the extracted continuous image frame sequence is 48.

针对视频序列，为了避免采集帧数过少导致难以学习正确的动作单元时空关联，或采集帧数过多使得学习时长过久的情况，视频帧序列长度选择为48。For video sequences, in order to avoid the situation where it is difficult to learn the correct spatiotemporal association of action units due to too few captured frames, or the learning time is too long due to too many captured frames, the video frame sequence length is selected as 48.

S02：对原始连续图像帧进行预处理，获得扩增图像帧序列。S02: Preprocess the original continuous image frames to obtain an amplified image frame sequence.

对原始图像进行预处理的方式包括随机平移、随机旋转、随机缩放、随机水平翻转或者随机裁剪等，对图像进行预处理能在一定程度上提高模型的泛化能力。The methods of preprocessing the original image include random translation, random rotation, random scaling, random horizontal flipping or random cropping, etc. Preprocessing the image can improve the generalization ability of the model to a certain extent.

S03：构建卷积神经网络模块I提取扩增图像帧序列的分层多尺度区域特征。S03: Constructing the convolutional neural network module I to extract the hierarchical multi-scale regional features of the augmented image frame sequence.

由于不同局部块的面部动作单元有不同的面部结构和纹理信息，因而需要对每个局部块进行独立的滤波处理，且不同局部块使用不同的滤波权值。Since the facial action units of different local blocks have different facial structures and texture information, each local block needs to be filtered independently, and different local blocks use different filter weights.

由于不同局部块的AU具有不同的面部结构和纹理信息，因而需要对每个局部块进行独立的滤波处理，且不同局部块使用不同的滤波权值；为了获得多尺度区域特征，采用卷积神经网络模块I来学习不同尺度下每个局部块的特征，卷积神经网络模块I包括两层结构相同的分层多尺度区域层，卷积神经网络模块I的输入作为第一层分层多尺度区域层的输入，第一层分层多尺度区域层的输出经过最大池化运算后作为第二层分层多尺度区域层的输入，第二层分层多尺度区域层的输出经过最大池化运算后作为卷积神经网络模块I的输出；将扩增图像帧序列的各帧图像进行通道级串联后作为卷积神经网络模块I的输入，卷积神经网络模块I的输出即为扩增图像帧序列的分层多尺度区域特征。Since the AUs of different local blocks have different facial structure and texture information, each local block needs to be filtered independently, and different local blocks use different filter weights; in order to obtain multi-scale regional features, the convolutional neural network is used. The network module I is used to learn the features of each local block at different scales. The convolutional neural network module I includes two hierarchical multi-scale regional layers with the same structure. The input of the convolutional neural network module I is used as the first layer of hierarchical multi-scale. The input of the regional layer, the output of the first layer of hierarchical multi-scale regional layer after the maximum pooling operation is used as the input of the second layer of hierarchical multi-scale regional layer, the output of the second layer of hierarchical multi-scale regional layer is subjected to maximum pooling After the operation, as the output of the convolutional neural network module 1; each frame image of the amplified image frame sequence is used as the input of the convolutional neural network module 1 after channel-level series connection, and the output of the convolutional neural network module 1 is the amplified image Hierarchical multi-scale region features for frame sequences.

如图2所示，每层分层多尺度区域层包括卷积层I-I、卷积层I-II-I、卷积层I-II-II和卷积层I-II-III，在卷积层I-I内，对输入整体进行一次卷积，将卷积结果作为卷积层I-I的输出；将卷积层I-I的输出作为卷积层I-II-I的输入，在卷积层I-II-I内，先将输入均匀划分为8×8尺度的局部块分别进行卷积，再对所有卷积结果进行拼接形成卷积层I-II-I的输出；将卷积层I-II-I的输出作为卷积层I-II-II的输入，在卷积层I-II-II内，先将输入均匀划分为4×4尺度的局部块分别进行卷积，再对所有卷积结果进行拼接形成卷积层I-II-II的输出；将卷积层I-II-II的输出作为卷积层I-II-III的输入，在卷积层I-II-III内，先将输入均匀划分为2×2尺度的局部块分别进行卷积，再对所有卷积结果进行拼接形成卷积层I-II-III的输出；对卷积层I-II-I、卷积层I-II-II和卷积层I-II-III的输出进行通道级串联后(通道级串联后输出的通道数与卷积层I-I输出的通道数相同)与卷积层I-I的输出进行加和，结果作为分层多尺度区域层的输出。As shown in Figure 2, each hierarchical multi-scale region layer includes convolutional layer I-I, convolutional layer I-II-I, convolutional layer I-II-II and convolutional layer I-II-III. In layer I-I, a convolution is performed on the whole input, and the convolution result is used as the output of convolution layer I-I; the output of convolution layer I-I is used as the input of convolution layer I-II-I. In -I, the input is evenly divided into 8 × 8 local blocks for convolution respectively, and then all convolution results are spliced to form the output of convolutional layer I-II-I; the convolutional layer I-II-I The output of I is used as the input of the convolutional layer I-II-II. In the convolutional layer I-II-II, the input is evenly divided into 4×4 scale local blocks for convolution respectively, and then all convolution results are Splicing to form the output of the convolutional layer I-II-II; the output of the convolutional layer I-II-II is used as the input of the convolutional layer I-II-III. In the convolutional layer I-II-III, the first The input is evenly divided into local blocks of 2×2 scale, and convolution is performed respectively, and all convolution results are spliced to form the output of convolutional layers I-II-III; convolutional layers I-II-I and convolutional layers I - The outputs of II-II and convolutional layers I-II-III are concatenated at the channel level (the number of channels output after channel-level concatenation is the same as the number of channels output by the convolutional layer I-I) and the output of the convolutional layer I-I is summed , and the result as the output of the hierarchical multi-scale region layer.

卷积神经网络模块I中每层分层多尺度区域层后均有一层最大池化层，每层最大池化层的池化核大小为2×2，步长为2；第一层分层多尺度区域层中卷积层I-I、卷积层I-II-I、卷积层I-II-II和卷积层I-II-III的通道数分别为32、16、8、8，第一层分层多尺度区域层中卷积层I-I、卷积层I-II-I、卷积层I-II-II和卷积层I-II-III的滤波器个数分别为32×1、16×8×8、8×4×4、8×2×2；第二层分层多尺度区域层中卷积层I-I、卷积层I-II-I、卷积层I-II-II和卷积层I-II-III的通道数分别为64、32、16、16，第二层分层多尺度区域层中卷积层I-I、卷积层I-II-I、卷积层I-II-II和卷积层I-II-III的滤波器个数分别为64×1、32×8×8、16×4×4、16×2×2；卷积层中的滤波器大小均为3×3，步长均为1。In the convolutional neural network module I, there is a maximum pooling layer after each layered multi-scale region layer. The size of the pooling kernel of each maximum pooling layer is 2×2, and the stride is 2; the first layer is hierarchical The number of channels of convolutional layer I-I, convolutional layer I-II-I, convolutional layer I-II-II and convolutional layer I-II-III in the multi-scale region layer are 32, 16, 8, and 8, respectively. The number of filters of convolutional layer I-I, convolutional layer I-II-I, convolutional layer I-II-II and convolutional layer I-II-III in one layered multi-scale region layer is 32×1 respectively , 16×8×8, 8×4×4, 8×2×2; convolutional layer I-I, convolutional layer I-II-I, convolutional layer I-II- The number of channels of II and convolutional layers I-II-III are 64, 32, 16, and 16, respectively. In the second layer of hierarchical multi-scale region layers, convolutional layers I-I, convolutional layers I-II-I, convolutional layers The number of filters in I-II-II and convolutional layers I-II-III are 64×1, 32×8×8, 16×4×4, 16×2×2, respectively; the filters in the convolutional layer The size is all 3×3, and the stride is all 1.

S04：利用步骤S03提取的分层多尺度区域特征，构建卷积神经网络模块II进行AU的全局注意力图回归并提取AU特征，并通过AU检测损失对卷积神经网络模块II进行监督；AU表示面部动作单元。S04: Using the hierarchical multi-scale regional features extracted in step S03, construct convolutional neural network module II to perform global attention map regression of AU and extract AU features, and supervise convolutional neural network module II through AU detection loss; AU represents Facial Action Unit.

如图3所示，卷积神经网络模块II是一个包含m个分支的多层卷积层，每个分支对应一种AU，同时进行自适应的全局注意力图回归和AU预测。每个卷积层的滤波器大小均为3×3，步长均为1。As shown in Figure 3, the convolutional neural network module II is a multi-layer convolutional layer containing m branches, each branch corresponds to a kind of AU, and performs adaptive global attention map regression and AU prediction at the same time. The filter size of each convolutional layer is 3 × 3, and the stride is 1.

(41)生成每个AU的预测注意力图：分层多尺度区域特征为卷积神经网络模块II的输入，卷积神经网络模块II包含m个分支，每个分支对应一种AU，采用四层卷积层学习AU的全局注意力图M_ij并提取AU特征。(41) Generate the predicted attention map of each AU: the hierarchical multi-scale region features are the input of the convolutional neural network module II. The convolutional neural network module II contains m branches, each branch corresponds to a kind of AU, using four layers The convolutional layer learns the global attention map M _ij of AUs and extracts AU features.

其中：

由于训练数据集中不同AU的出现率有显著的差异，并且大多数AU的出现率远低于未出现率，为了抑制这两种数据不平衡问题，将ω_j和v_j定义为：Since the occurrence rates of different AUs in the training dataset are significantly different, and the occurrence rate of most AUs is much lower than the non-occurrence rate, in order to suppress these two data imbalance problems, ω _j and v _j are defined as:

其中：n和

分别是训练集的样本总数和出现第i帧图像的第j个AU样本的总数，第i帧图像的第j个AU的发生率可以表示为

where: n and

are the total number of samples in the training set and the total number of the jth AU samples in the ith frame image, respectively. The occurrence rate of the jth AU in the ith frame image can be expressed as

S05：利用步骤S04提取的AU特征，构建自适应时空图卷积神经网络模块III，推理每个AU的特定模式、不同AU之间的时空关联性(如共现和互斥)，从而学习每个AU的时空关联特征。S05: Use the AU features extracted in step S04 to construct an adaptive spatiotemporal graph convolutional neural network module III, infer the specific pattern of each AU and the spatiotemporal correlation (such as co-occurrence and mutual exclusion) between different AUs, so as to learn each AU The spatiotemporal correlation characteristics of each AU.

自适应时空图卷积神经网络模块III由两层结构相同的时空图卷积层构成，将各帧的m个12c的AU特征拼接为t×m×12c的整体特征，作为时空图卷积层III-I的输入，时空图卷积层III-I得到的输出作为时空图卷积层III-II的输入，经过时空图卷积层III-II得到输出特征，此特征包含了AU的特定模式、AU间的时空关联性。The adaptive spatiotemporal graph convolutional neural network module III is composed of two spatiotemporal graph convolution layers with the same structure. The m 12c AU features of each frame are spliced into t×m×12c overall features as the spatiotemporal graph convolution layer. The input of III-I, the output obtained by the spatiotemporal graph convolution layer III-I is used as the input of the spatiotemporal graph convolutional layer III-II, and the output feature is obtained through the spatiotemporal graph convolutional layer III-II. This feature contains the specific pattern of AU , the spatiotemporal correlation between AUs.

两个时空图卷积层的参数分别独立学习，每一层时空图卷积层由空间图卷积和门控循环单元结合构成，每个时空卷积层的结构图如图4所示，具体包含如下步骤：The parameters of the two spatio-temporal graph convolution layers are learned independently. Each spatio-temporal graph convolution layer is composed of a combination of spatial graph convolution and gated cyclic units. The structure diagram of each spatio-temporal convolution layer is shown in Figure 4. Contains the following steps:

(51)推理每个AU的特定模式：(51) Infer specific patterns for each AU:

典型的图卷积是在谱域中计算的，且用一阶切比雪夫多项式展开式能很好地逼近：A typical graph convolution is computed in the spectral domain and is well approximated by a first-order Chebyshev polynomial expansion:

其中：

和

分别是图卷积层的输入和输出，

是单位矩阵；

是对称的加权邻接矩阵，表示结点间连接边的强度；

是度矩阵，有

为参数矩阵。图卷积本质上是通过学习所有AU共享的参数矩阵A和Θ⁽⁰⁾将输入的

转换为

in:

and

are the input and output of the graph convolutional layer, respectively,

is the identity matrix;

is a symmetric weighted adjacency matrix, representing the strength of connecting edges between nodes;

is the degree matrix, we have

is the parameter matrix. Graph convolution essentially takes the input by learning the parameter matrix A and Θ ⁽⁰⁾ shared by all AUs

convert to

尽管上式能够学习AU的相互关系，但忽视了每个AU的特定模式，为此，用共享的参数矩阵A推理AU间关联，再对每个AU采用一个独立的参数

得到图卷积运算为：Although the above formula can learn the correlation of AUs, it ignores the specific pattern of each AU. To this end, the shared parameter matrix A is used to infer the correlation between AUs, and an independent parameter is used for each AU.

The graph convolution operation is obtained as:

其中：Z＝X⊙Y表示对二维矩阵X和三维矩阵Y进行函数运算得到二维矩阵Z，二维矩阵Z中(a,b)索引位置处的元素

X_ak表示二维矩阵X中(a,k)索引位置处的元素，Y_akb表示三维矩阵Y中(a,k,b)索引位置处的元素。Among them: Z=X⊙Y indicates that the two-dimensional matrix Z is obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y, and the element at the (a, b) index position in the two-dimensional matrix Z

为了减少Θ矩阵的参数量，引入一个特征分解矩阵

和一个共享的参数矩阵

将图卷积重新表示为：In order to reduce the number of parameters of the Θ matrix, an eigendecomposition matrix is introduced

and a shared parameter matrix

Re-express the graph convolution as:

其中：有Θ⁽¹⁾＝QW，此新公式计算的中间维度c^e通常小于m，对于第i个AU，参数

可以通过特征分离矩阵

由共享参数矩阵W提取得到，使用矩阵Q和W有利于推理AU的特定模式。where: Θ ⁽¹⁾ = QW, the intermediate dimension c ^e calculated by this new formula is usually less than m, and for the i-th AU, the parameter

The matrix can be separated by feature

Extracted from the shared parameter matrix W, the use of matrices Q and W is beneficial to infer specific modes of AU.

(52)推理空间域上AU之间的相互关系：(52) Interrelationships between AUs in the inference space domain:

采用直接学习一个矩阵

来减少计算量，而不是先学习矩阵A再进一步计算归一化邻接矩阵

即

其中，R(X)为修正线性单元(Rectified Linear Unit，ReLU)激活函数，N(X)为标准化函数，可以自适应地编码AU之间的依赖关系，如共现和互斥。然后将图卷积重新表示为：using direct learning of a matrix

to reduce the amount of computation, instead of learning the matrix A first and then further computing the normalized adjacency matrix

which is

Among them, R(X) is the Rectified Linear Unit (ReLU) activation function, and N(X) is the normalization function, which can adaptively encode the dependencies between AUs, such as co-occurrence and mutual exclusion. The graph convolution is then re-expressed as:

F^out＝(I+N(R(UU^T)))Fⁱⁿ⊙QWF ^out = (I+N(R(UUT )))F ⁱⁿ ^⊙QW

其中，矩阵U和矩阵W为所有AU共享的参数矩阵，从而推理在空间域上AU间的依赖关系。Among them, matrix U and matrix W are parameter matrices shared by all AUs, so as to reason about the dependencies between AUs in the spatial domain.

(53)推理空间域上各帧之间的关系：门控循环单元(GRU)是一种流行的建模时间动态性的方法。一个GRU单元由更新门z和重置门r组成，其中时间步骤τ时刻的门控机制定义如下：(53) Inferring relationships between frames on the spatial domain: Gated Recurrent Units (GRUs) are a popular method for modeling temporal dynamics. A GRU unit consists of an update gate z and a reset gate r, where the gating mechanism at time step τ is defined as follows:

z_T＝σ(W_zC(h_T-1,x_T))z _T =σ(W _z C(h _T-1 ,x _T ))

r_T＝σ(W_rC(h_T-1,x_T))r _T =σ(W _r C(h _T-1 ,x _T ))

其中：h_T表示T时刻的最终隐藏状态(T时刻的输出)，

表示T时刻的初始隐藏状态，z_T用于决定T时刻需要保留的h_T-1的数量，r_T用于决定T时刻

与h_T-1的组合方式，“★”表示元素级相乘，C(·)表示拼接操作，σ(·)表示Sigmoid函数，tanh(·)表示双曲正切激活函数。where: h _T represents the final hidden state at time T (the output at time T),

Represents the initial hidden state at time T, z _T is used to determine the number of h _T-1 to be retained at time T, and r _T is used to determine time T

With the combination of h _T-1 , "★" means element-wise multiplication, C(·) means concatenation operation, σ(·) means Sigmoid function, and tanh(·) means hyperbolic tangent activation function.

由上面的过程可得每一层时空图卷积层的最终定义为：From the above process, the final definition of each spatiotemporal graph convolutional layer is:

其中：

表示T时刻的输入，h_T表示T时刻的最终隐藏状态(T时刻的输出)，且

和

分别表示自适应学习针对z_T、r_T和

的权重矩阵；其中输入矩阵

输出矩阵

是各图卷积层的输出结果h₁,h₂,…,h_t'在维度t上的拼接；t'为输入到时空图卷积层的帧总数，t'＝t＝48。in:

represents the input at time T, h _T represents the final hidden state at time T (the output at time T), and

and

represent adaptive learning for z _T , r _T and

The weight matrix of ; where the input matrix

output matrix

is the concatenation of the output results h ₁ , h ₂ ,...,h _t' of each graph convolutional layer in dimension t; t' is the total number of frames input to the spatiotemporal graph convolutional layer, t'=t=48.

S06：利用步骤S05提取的各AU的时空关联特征，构建一个全连接模块IV实现AU识别。S06: Using the spatiotemporal correlation feature of each AU extracted in step S05, construct a fully connected module IV to realize AU identification.

全连接模块IV采用一维全连接层后跟Sigmoid激活函数构成。将由步骤S05得到的维度为t×m×12c的整体特征进行逐帧逐AU分解，得到对应于每帧图像上每个AU的维度为12c的特征向量，第i帧图像的第j个AU的特征向量被输入到第j个全连接层，全连接层后再跟一个Sigmoid激活函数，进行AU出现概率预测，全连接模块IV对不同帧图像的同一个AU采用相同的全连接层。由于学习到的特征具有AU的时空关联信息，有利于最终的AU检测，并采用如下损失函数指导时空图卷积参数矩阵的学习：The fully connected module IV is composed of a one-dimensional fully connected layer followed by a Sigmoid activation function. The overall feature obtained in step S05 with a dimension of t×m×12c is decomposed frame by frame and AU, to obtain a feature vector corresponding to each AU on each frame image with a dimension of 12c, and the jth AU of the i frame image is obtained. The feature vector is input to the jth fully connected layer, followed by a Sigmoid activation function to predict the occurrence probability of AU. The fully connected module IV uses the same fully connected layer for the same AU in different frames of images. Since the learned features have the spatiotemporal correlation information of AUs, it is beneficial to the final AU detection, and the following loss function is used to guide the learning of the spatiotemporal graph convolution parameter matrix:

其中：

S07：使用训练数据集对由卷积神经网络模块I、卷积神经网络模块II、自适应时空图卷积神经网络模块III和全连接模块IV构成的整体AU识别网络模型进行训练，以基于梯度的优化方法对整体AU识别网络模型的参数进行更新。S07: Use the training data set to train the overall AU recognition network model composed of the convolutional neural network module I, the convolutional neural network module II, the adaptive spatiotemporal graph convolutional neural network module III, and the fully connected module IV to be based on the gradient The optimization method updates the parameters of the overall AU recognition network model.

通过端到端的方法训练整个卷积神经网络、图卷积神经网络模型(如图5)。首先训练卷积神经网络模型，进行AU特征提取并作为图卷积神经网络的输入，然后训练图卷积神经网络学习AU的特定模式与时空关联性，利用AU间的时空关联性促进面部动作单元的识别。The entire convolutional neural network and graph convolutional neural network model is trained through an end-to-end method (see Figure 5). First, the convolutional neural network model is trained, AU features are extracted and used as the input of the graph convolutional neural network, and then the graph convolutional neural network is trained to learn the specific pattern and spatiotemporal correlation of AUs, and the spatiotemporal correlation between AUs is used to promote facial action units. identification.

在进行预测时直接输出面部动作单元的预测结果。Directly output the prediction results of facial action units when making predictions.

本发明方法可以完全通过计算机实现，无需人工辅助处理；这表明，本案可以实现批量化自动处理，能够大大提高处理效率、降低人工成本。The method of the present invention can be completely realized by computer without manual auxiliary processing; this shows that batch automatic processing can be realized in this case, which can greatly improve processing efficiency and reduce labor costs.

在本发明的描述中，需要说明的是，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that the terms "installed", "connected" and "connected" should be understood in a broad sense, unless otherwise expressly specified and limited, for example, it may be a fixed connection or a detachable connection Connection, or integral connection; can be mechanical connection, can also be electrical connection; can be directly connected, can also be indirectly connected through an intermediate medium, can be internal communication between two elements. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood in specific situations.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

以上显示和描述了本发明的基本原理、主要特征和优点。本行业的技术人员应该了解，上述实施例不以任何形式限制本发明，凡采用等同替换或等效变换的方式所获得的技术方案，均落在本发明的保护范围内。The foregoing has shown and described the basic principles, main features and advantages of the present invention. Those skilled in the art should understand that the above-mentioned embodiments do not limit the present invention in any form, and all technical solutions obtained by means of equivalent replacement or equivalent transformation fall within the protection scope of the present invention.

Claims

1. a facial action unit identification method based on self-adaptive attention and space-time association, is characterized in that: comprise the steps:

S01: Extract the original continuous image frames required for training from the video to form a training data set;

S02: Preprocess the original continuous image frames to obtain an amplified image frame sequence;

S03: constructing the convolutional neural network module 1 to extract the hierarchical multi-scale regional features of each frame in the amplified image frame sequence;

S04: Using the hierarchical multi-scale regional features extracted in step S03, construct convolutional neural network module II to perform global attention map regression of AU and extract AU features, and supervise convolutional neural network module II through AU detection loss; AU represents facial action unit;

S05: use the AU features extracted in step S04 to construct an adaptive spatiotemporal graph convolutional neural network module III, infer the specific pattern of each AU and the spatiotemporal correlation between different AUs, thereby learning the spatiotemporal correlation feature of each AU;

S06: utilize the spatiotemporal correlation feature of each AU extracted in step S05 to construct a fully connected module IV to realize AU identification;

S07: Use the training data set to train the overall AU recognition network model composed of the convolutional neural network module I, the convolutional neural network module II, the adaptive spatiotemporal graph convolutional neural network module III, and the fully connected module IV to be based on the gradient The optimization method updates the parameters of the overall AU recognition network model;

S08: Input a video sequence with a given arbitrary number of frames into the trained overall AU recognition network model to predict the occurrence probability of AU.

2. the facial action unit identification method based on adaptive attention and space-time association according to claim 1, is characterized in that: in described step S03, adopt convolutional neural network module 1 to learn each local block under different scales The features of the convolutional neural network module I include two hierarchical multi-scale regional layers with the same structure, and the input of the convolutional neural network module I is used as the input of the first layer of hierarchical multi-scale regional layers, and the first layer of hierarchical multi-scale regional layers. The output of the regional layer is used as the input of the second layer of hierarchical multi-scale regional layer after the maximum pooling operation, and the output of the second layer of hierarchical multi-scale regional layer is used as the output of the convolutional neural network module I after the maximum pooling operation; The output of the convolutional neural network module 1 is the hierarchical multi-scale regional feature of the amplified image frame sequence;

Each hierarchical multi-scale region layer includes convolutional layer I-I, convolutional layer I-II-I, convolutional layer I-II-II and convolutional layer I-II-III. The whole input is convolved once, and the convolution result is used as the output of the convolutional layer I-I; the output of the convolutional layer I-I is used as the input of the convolutional layer I-II-I. In the convolutional layer I-II-I, the first The input is evenly divided into 8 × 8 local blocks for convolution respectively, and then all convolution results are spliced to form the output of convolutional layer I-II-I; the output of convolutional layer I-II-I is used as the volume For the input of layer I-II-II, in the convolution layer I-II-II, the input is firstly divided into 4×4 scale local blocks for convolution respectively, and then all convolution results are spliced to form convolution The output of layer I-II-II; the output of the convolutional layer I-II-II is used as the input of the convolutional layer I-II-III. In the convolutional layer I-II-III, the input is firstly divided into 2 The local blocks of ×2 scale are convolved separately, and then all convolution results are spliced to form the output of convolutional layers I-II-III; convolutional layers I-II-I, convolutional layers I-II-II and The outputs of the convolutional layers I-II-III are channel-level concatenated and then summed with the outputs of the convolutional layers I-I, and the result is used as the output of the hierarchical multi-scale region layer.

3. the facial action unit identification method based on adaptive attention and space-time association according to claim 1, is characterized in that: in described step S04, adopt convolutional neural network module II to predict the global attention map and AU of each AU The occurrence probability of AU, the global attention map of AU is adaptively regressed to the predefined attention map under the supervision of AU detection loss, and AU features are extracted at the same time; the loss function used by AU detection loss is:

where: L _a represents the loss function of global attention map regression,

Represents the weighted cross-entropy loss function of AU recognition, L _AA represents the total loss function of the convolutional neural network module I and the convolutional neural network module II as a whole; λ _a represents the weight of the global attention map regression loss; t represents the augmented image frame sequence m is the number of AUs in each frame of image; l/4×l/4 is the size of the global attention map; M _ijab is the jth AU of the i-th frame image at the coordinate position (a, b) The real attention weight; Mi _ijab represents the predicted attention weight of the jth AU of the ith frame image at the coordinate position (a, b); p _ij represents the true probability of the jth AU of the ith frame image appearing ;

Represents the predicted probability of the occurrence of the jth AU in the ith frame image; ω _j represents the weight of the jth AU; _vj represents the weight of the jth AU appearing.

4. the facial action unit identification method based on adaptive attention and space-time association according to claim 1, it is characterized in that: in described step S05, adopt adaptive space-time graph convolutional neural network module III to extract each AU's The spatiotemporal correlation between specific patterns and different AUs, so as to learn the spatiotemporal correlation features of each AU; the input of the adaptive spatiotemporal graph convolutional neural network module III is all AU features of all t frame images extracted in step S04, each frame The image contains m AU features, a total of t × m AU features, the dimension of each AU feature is c', c'=12c, c is a setting parameter related to the overall AU recognition network model; adaptive spatiotemporal map volume The product neural network module III contains two spatiotemporal graph convolution layers with the same structure. The parameters of the two spatiotemporal graph convolution layers are learned independently. Each spatiotemporal graph convolution layer is composed of a combination of spatial graph convolution and gated recurrent units. , the spatiotemporal convolutional layer is defined as:

in:

represents the input at time T, h _T represents the final hidden state at time T,

Combination with h _T-1 ;

Represents the identity matrix, m is the number of AUs in each frame of image;

and

represent adaptive learning for z _T , r _T and

The weight matrix of , c' represents the dimension of the AU feature, c'=12c, c is the setting parameter related to the overall AU recognition network model;

R(X) indicates that the element at each index position in the two-dimensional matrix X is subjected to de-negative processing. After the de-negative processing, the element X _ab at the (a, b) index position in the two-dimensional matrix X is updated to X _ab = max(0,X _ab );

Z=X⊙Y means that a two-dimensional matrix Z is obtained by performing a function operation on a two-dimensional matrix X and a three-dimensional matrix Y, and the element at the (a, b) index position in the two-dimensional matrix Z

X _ak represents the element at the (a, k) index position in the two-dimensional matrix X, and Y _akb represents the element at the (a, k, b) index position in the three-dimensional matrix Y;

"★" means element-wise multiplication, C(·) means concatenation operation, σ(·) means Sigmoid function, and tanh(·) means hyperbolic tangent activation function.

5. the facial action unit recognition method based on adaptive attention and space-time association according to claim 1, is characterized in that: in described step S06, adopt a fully connected module IV to identify the AU of each frame of image, to The spatiotemporal correlation features of all AUs included in the t frame image output in step S05

The spatiotemporal correlation feature of the jth AU of the image in the ith frame is

Input to fully connected module IV, fully connected module IV pair

The jth one-dimensional fully connected layer is followed by a sigmoid activation function to predict the final occurrence probability of the jth AU of the ith frame image. The fully connected module IV uses the same full connection for the same AU of different frame images Connection layer; the loss function used for AU recognition is:

in:

Represents the final prediction probability of the occurrence of the jth AU in the ith frame image; ω _j represents the weight of the jth AU; _vj represents the weight of the jth AU when it appears.

6. A device for recognizing any one of the facial action units based on adaptive attention and spatiotemporal association described in 1 to 5, characterized in that it comprises an image frame sequence acquisition unit, a hierarchical multi-scale area learning unit, an automatic Adaptive attention regression and feature extraction unit, adaptive spatiotemporal graph convolution learning unit, AU recognition unit and parameter optimization unit;

The image frame sequence acquisition unit is used for extracting the original continuous images required for training from the video data to form a training data set, and preprocessing the original continuous image frames to obtain the augmented image frame sequence;

The layered multi-scale region learning unit, including the convolutional neural network module I, adopts the layered multi-scale region layer to learn the features of each local block under different scales of the input image of each frame, and independently filter each local block. ;

The adaptive attention regression and feature extraction unit, including the convolutional neural network module II, is used to generate a global attention map of the image, and perform adaptive regression under the supervision of a predefined attention map and AU detection loss, while accurate to extract AU features.

The adaptive spatiotemporal graph convolution learning unit includes an adaptive spatiotemporal graph convolutional neural network module III, which performs the specific pattern learning of each AU, the spatiotemporal correlation learning between different AUs, and extracts the spatiotemporal correlation features of each AU ;

The AU identification unit, including the fully connected module IV, utilizes the space-time correlation feature of each AU to effectively carry out AU identification;

The parameter optimization unit calculates the parameters and loss function values of the overall AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the adaptive spatiotemporal graph convolutional neural network module III and the fully connected module IV , and update the parameters based on the gradient optimization method.