CN112633153A

CN112633153A - Facial expression motion unit identification method based on space-time graph convolutional network

Info

Publication number: CN112633153A
Application number: CN202011528440.6A
Authority: CN
Inventors: 刘志磊; 张庆阳; 董威龙; 陈浩阳; 都景舜
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-09

Abstract

The invention discloses a facial expression motion unit identification method based on a space-time graph convolution network. The invention applies a space-time graph convolution network to identify the face motion unit, models the space-time dependency relationship between AUs by using a directionless space-time graph model, and learns AU depth representation characteristics by using the space-time graph convolution network so as to improve the accuracy of AU identification. The method can effectively solve the problems of poor robustness, low accuracy and the like of an AU detection model, and can be widely applied to expression analysis, emotion calculation and man-machine interaction.

Description

Facial expression motion unit identification method based on space-time graph convolutional network

Technical Field

The invention relates to the technical field of computer vision and emotion calculation, in particular to human facial expression motion unit (AU) recognition based on a space-time Graph Convolutional network (ST-GCN).

Background

Facial expressions can reveal a person's mental activities, mental states, and social behaviors communicated outward. With the development of artificial intelligence, human-centered facial expression recognition has gradually received widespread attention from the industry and academia. Expression analysis using a facial motion coding system is one of the common methods of facial expression recognition.

The Facial motion Coding System (FACS) anatomically divides a human face into 44 Facial motion units according to the motion of muscles, and is used for representing the motion of muscles in different areas of the human face. For example AU7 indicates whether the eyelids are tightened, AU23 indicates whether the lips are tightened, etc. Compared with using six basic expressions (anger, disgust, fear, happiness, sadness and surprise), the AU-based facial expression description method is more objective, more granular, and can avoid annotation ambiguity introduced by subjective judgment of observers. The automatic AU detection based on the computer can accurately analyze the facial expression to further understand the individual emotion, and has good application prospect in the fields of driver fatigue detection, patient pain estimation, psychological research and the like.

With the development of deep learning related theories and technologies, the detection technology of AU also makes remarkable progress. However, many challenges still face to the AU detection, the data of the AU in the practical application scene has high complexity, and the factors such as the head posture, the shielding, the complex illumination and the like cause the obvious reduction of the performance of the AU identification model. In addition, due to individual differences in human race, skin color, age, gender and the like, the AU identification has huge intra-class differences, and the accuracy of the AU identification is also significantly influenced. Meanwhile, the marking of AU can be completed only by professional personnel spending more time, so that a data set for training can not meet the requirement of detection of each race in a complex scene far away, a data sample is small, and overfitting is easy to occur.

In the existing AU detection model, only the association relation between different AUs in the same time point is considered, but the correlation between space and time is ignored.

Disclosure of Invention

To overcome the deficiencies of the prior art, the present invention aims to utilize ST-GCN (space-time graph convolution model) for AU recognition.

The invention discloses a facial expression motion unit identification method based on a space-time graph convolutional network.

The method comprises the following specific steps:

first, feature extraction is performed on an AU local area by an auto encoder. The method comprises the steps of obtaining the center position of each AU for each frame image in an image frame sequence based on face key point (landmark) information, and dividing a region with the size of n x n according to the center position of the AU to serve as a local region where the corresponding AU is located. All local Regions (ROI) where AUs are located are input to a self-encoder (AEs) specific to each AU for encoding, thereby obtaining d₀The dimension fully contains the depth representation of the AU information.

Next, an undirected spatio-temporal relationship graph of the AU sequences is constructed, thereby modeling the spatio-temporal relationship between AUs. Each node in the AU space-time relationship graph is composed of depth expression vectors of one AU extracted in the step (1), and the nodes in the AU relationship graph are connected according to the degree of mutual connection.

And (1) constructing a spatial relationship. And constructing a relation matrix M to represent the closeness degree of the correlation between the AUs by counting the co-occurrence probability of the AUs in the training set. Further, a threshold value h is set, and AUs with the association closeness degree larger than h are connected, so that the spatial adjacency relation of AUs is modeled.

And constructing a time relation. And setting a time threshold tau, and connecting nodes which have time intervals not exceeding tau image frames and belong to the same AU in the image frame sequence, thereby modeling the relation of AU nodes in time.

Finally, AU recognition is performed based on ST-GCN (space-time graph convolution model). And (3) carrying out multiple graph convolution operations on the AU sequence space-time relation graph constructed in the step (2) by using ST-GCN to obtain depth AU characteristic representation containing space and time information. And finally, classifying the depth AU feature representation through a fully-connected neural network to obtain an AU identification result.

Step (1) specifically, the input video is divided to extract an AU local area (ROI) on each frame image. Firstly, taking each AU key point on each frame image as a center, and extracting a region with the size of n x n as a local region where the AU is located. Then, each extracted AU partial area is sent to a separate self-encoder (AEs) for encoding to obtain a feature vector containing specific AU related information. In the self-encoder learning process of each AU local region, the following two loss functions are used for constraint.

The first is the pixel level reconstruction loss function L_R：

Where n is the size of each AU ROI, I^GTIs the true value of AU ROI, I^RIs the reconstructed AU ROI image.

Second is ROI level multi-label AU detection loss function

Where C is the number of categories, R is the number of ROIs acquired in the previous step, Y^ROIE {0,1}, and R × C is the true value of the AU label.

Indicating that AU j is not active in AU ROI i,

indicating that AUj is active in AUROI i. For measuring whether the current ROI contains a particular AU.

Finally, the two loss functions are combined by using a trade-off parameter lambda, and the loss function finally used for AU depth representation extraction is obtained as follows:

L_ROI＝L_{ROI_softmax}+λ₁L_R。

and (2) further constructing an undirected spatiotemporal relationship graph of the AU sequence. The neighbor relations in the AU sequence spatio-temporal relation graph are divided into three categories: the neighbor relation between the AU node and the AU node, the spatial neighbor relation and the temporal neighbor relation. For a certain node v_tiIts neighbor set B (v)_ti) The definition is as follows:

v_tiis the ith node of the t frame, B (v)_ti) Is v is_tiThe node is a neighbor set of nodes, d (x, y) refers to the co-occurrence probability between two nodes in the same frame, | x-y | is the time interval distance between two nodes, and K and Γ are the threshold values of the co-occurrence probability and the time distance respectively. K and gamma serve as hyper-parameters of the model, and proper values are selected through training. Finally, a space-time relation adjacency matrix of the AU is obtained, each node in the relation graph represents an AU characteristic vector, and adjacent nodes represent AUs closely related to the current AU node in time or space. The space-time relationship diagram of the AU can be represented by the AU space-time relationship adjacency matrix.

Step (3) specifically, a graph convolution operation is performed on the AU spatio-temporal relationship graph constructed in step (3) using the ST-GCN model. Namely, graph convolution operation is respectively carried out on each node in the spatio-temporal relationship graph. In this step, in order to facilitate the convolution operation, the neighbor set is divided into different subsets, and a mapping function l of a single frame is defined first:

l：B(v_ti)→{0,1,…,K-1}

mapping function I will v_tiThe neighbor set B of (a) is mapped into K different subsets, each subset corresponding to a tag number. To compute subsets in the time dimension simultaneously, define l_STThe mapping function:

i.e. adding the time interval distance, l, to the original single-frame subset label_STThe parameters of (1) are the current node and the neighbor node to be mapped, respectively.

The formula for the graph convolution operation is as follows:

f_in(*),f_out(x) the convolved input eigenvalue and output eigenvalue of the node, respectively; z_ti(v_tj)＝|{v_tk|l_ti(v_tk)＝l_ti(v_tj) I.e. v_tjThe subset-to-subset potential, used here as a regularization term to balance the effect of different subset sizes on the result; w (, x) is the own weight function of each neighbor subset, and is obtained through learning.

Then, convolution operation is carried out on the adjacent matrixes of the time-space relational graph by using the formula, and after the graph convolution operation is carried out repeatedly, the depth feature vectors can fully represent the relation among all AUs and the features of the AUs.

And finally, classifying the vectors represented by the depth features by using a full convolution neural network to obtain the classification result of AU identification. The invention has the beneficial effects that the invention adopts a space-time graph convolution network to identify the facial motion unit, models the space-time dependence relationship between AUs by using a directionless space-time graph model, and learns AU depth representation characteristics by using the space-time graph convolution network so as to improve the accuracy of AU identification. The method can effectively solve the problems of poor robustness, low accuracy and the like of an AU detection model, and can be widely applied to expression analysis, emotion calculation, man-machine interaction and the like.

1. The biggest innovation of the method is that the spatio-temporal graph convolution network model is applied to the expression motion unit (AU) recognition, and the AU recognition can be carried out based on the image frame sequence. Compared with an AU identification method based on a single frame, the method considers the space-time relationship between AUs and has important research and application values;

2. the AU recognition algorithm constructed by the artificial intelligence frontier technology-deep learning method can realize AU detection through modeling of the time-space relation of the facial motion unit, and provides an important theoretical basis and a technical platform for facial expression recognition in the field of artificial intelligence.

3. The method can be simultaneously applied to the research of emotion calculation, man-machine interaction and facial expression recognition.

Drawings

FIG. 1 is a diagram illustrating the steps of the present invention.

Fig. 2 depth Autoencoder (AEs) model.

Fig. 3 is a diagram of the final effect of the present invention.

Detailed Description

The invention extracts the characteristics of the face motion unit AU area through the self-encoder, then constructs the space-time relationship graph of the AU sequence based on the AU space-time relationship, finally uses the space-time graph convolution network model to carry out graph convolution operation on the AU space-time relationship graph, and uses the full-connection network to carry out AU identification so as to detect the occurrence condition and the strength of the AU.

The method comprises the following specific steps:

first, an input image frame sequence is sliced and an AU local area (ROI) in each frame image is extracted. The extraction of depth features is performed on the key Regions (ROIs) of the face AU using a self-encoder.

And then, taking the depth expression vector of the AU extracted in the previous step as a node, and constructing an undirected spatiotemporal relationship graph of the AU sequence. The nodes are connected in space and time according to the contact degree of the nodes, and the space-time relation between the nodes is modeled. And constructing a relation matrix M to express the closeness degree of the association between the AUs by counting the conditional probability appearing between the AUs in the training set. Further, a threshold value h is set, and AU nodes with the association closeness degree larger than h are connected. And setting a time threshold tau, and connecting nodes which have time intervals not exceeding tau and belong to the same AU.

Finally, AU recognition is performed based on ST-GCN (space-time graph convolution model). And performing multiple graph convolution operations on an undirected space-time relation graph of an AU sequence by using ST-GCN to obtain a depth AU characteristic representation containing space and time information, and classifying the depth AU characteristic representation through a fully-connected neural network to obtain an AU identification result.

The present invention will be described in further detail with reference to the accompanying drawings and specific examples.

The specific implementation steps of the invention are shown in fig. 1, and mainly comprise the following three steps:

(1) AU local region depth representation feature extraction based on convolution self-encoder:

the input image frame sequence is divided to extract AU local Regions (ROIs) on each frame image. Firstly, taking each AU key point on each frame image as a center, and extracting a region with the size of n x n as a local region where the AU is located. Then, each extracted AU partial area is sent to a separate self-encoder (AEs) for encoding to obtain a feature vector containing specific AU related information. In the self-encoder learning process of each AU local region, the following two loss functions are used for constraint.

The first is a reconstruction loss function L at the pixel level_R：

The second is a multi-label AU detection loss function at ROI level:

Indicating that AU j is not active in AU ROI i,

L_ROI＝L_{ROI_softmax}+λ₁L_R

(2) and (3) construction of an AU space-time relation graph model:

modeling the spatio-temporal relationship of the AU by using a undirected graph model, wherein each node in the AU spatio-temporal relationship graph is composed of one AU depth representation vector in (1). The neighborhood in the spatio-temporal relationship diagram of AU sequences is divided into three categories: the neighbor relation between the AU node and the AU node, the spatial neighbor relation and the temporal neighbor relation. For a certain node v_tiIts neighbor set B (v)_ti) The definition is as follows:

v_tiis the ith node of the t frame, B (v)_ti) Is v is_tiThe node is a neighbor set of nodes, d (x, y) refers to the co-occurrence probability between two nodes in the same frame, | x-y | is the time interval distance between two nodes, and K and Γ are the threshold values of the co-occurrence probability and the time distance respectively. K and gamma serve as hyper-parameters of the model, and proper values are selected through training. Finally, a space-time relation adjacency matrix of the AU is obtained, and each node in the relation graph represents one AU characteristic directionVolume, the neighboring nodes represent AUs that have an affinity in time or space with the current AU node. The space-time relationship diagram of the AU can be represented by the AU space-time relationship adjacency matrix.

(3) AU identification based on space-time graph convolutional network

And performing graph convolution operation on the AU space-time relation graph constructed in the previous step through a space-time graph convolution model. Namely, graph convolution operation is respectively carried out on each node in the spatio-temporal relationship graph. In this step, in order to facilitate the convolution operation, the neighbor set is divided into different subsets, and a mapping function l of a single frame is defined first:

l：B(v_ti)→{0,1,…,K-1}

The formula for the graph convolution operation is as follows:

And then carrying out convolution operation on the adjacent matrixes of the time-space relationship graph by using the formula, and after the graph convolution operation is repeatedly carried out, the depth feature vectors can fully represent the connection among AUs and the features of the AUs.

And finally, classifying the vectors represented by the depth features by using a full convolution neural network to obtain the classification result of AU identification.

To summarize:

the invention provides a detection and extraction method of a facial expression unit based on space-time image convolution, which can be used for application of emotion recognition and the like by using AU (AU). The method comprises the steps of obtaining an AU center through a face key point, constructing a spatiotemporal relation graph of an AU sequence, and then identifying and extracting intensity values of the AU by using a spatiotemporal graph convolution model, so that the fast identification and detection of the AU can be realized, and the problems of emotion identification by using the AU and the like are solved. The method can be widely applied to the emotion recognition of people by machines in different scenes, so that different interactions can be made according to the emotion types of people, and the method has important value in popularization and application of emotion recognition interaction based on facial expressions.

Claims

1. The facial expression motion unit identification method based on the space-time graph convolution network is characterized in that feature extraction is carried out on an AU region of a facial motion unit through a convolution self-encoder, then a space-time relation graph of an AU sequence is constructed according to the close degree of the space-time relation of the AU, and finally AU identification is carried out based on ST-GCN.

2. The facial expression motion unit identification method based on the space-time graph convolutional network as claimed in claim 1, characterized in that the specific steps are as follows:

1) feature extraction of AU local areas by the self-encoder:

acquiring the central position of each AU for each frame image in the image frame sequence based on the face key point information, and dividing an n x n area according to the central position of the AU as a local area where the corresponding AU is located;

all local areas where AUs are positioned are input into a self-encoder specific to each AU for encoding, so that d is obtained₀Dimension fully contains depth representation of AU information;

2) constructing an undirected spatio-temporal relationship graph of AU sequences, thereby modeling the spatio-temporal relationship between AUs: each node in the AU space-time relationship graph is composed of depth expression vectors of one AU extracted in 1), and the nodes in the AU relationship graph are connected according to the degree of closeness of mutual connection;

constructing a spatial relationship: constructing a relationship matrix M to represent the association affinity degree between AUs by counting the co-occurrence probability of AUs in a training set, setting a threshold value h, and connecting AUs with the association affinity degree larger than h, thereby modeling the spatial adjacency relation of the AUs;

constructing a time relation: setting a time threshold tau, and connecting nodes which have time intervals not exceeding tau image frames and belong to the same AU in an image frame sequence, so as to model the relation of AU nodes in time;

3) AU identification is carried out based on an ST-GCN space-time diagram convolution model: performing multiple graph convolution operations on the AU sequence space-time relation graph constructed in the step 2) by using ST-GCN to obtain depth AU characteristic representation containing space and time information, and classifying the depth AU characteristic representation through a full-connection neural network to obtain an AU identification result.

3. The method of recognizing facial expression motion units of a space-time graph convolutional network as claimed in claim 2, wherein step 1) specifically divides the input video to extract AU local regions on each frame image:

firstly, taking each AU key point on each frame image as a center, and extracting an n x n area as a local area where the AU is located;

then, each extracted AU local area is sent to a separate self-encoder to be encoded so as to obtain a feature vector containing specific AU related information, and in the self-encoder learning process of each AU local area, the following two loss functions are used for constraint:

the first is the pixel level reconstruction loss function L_R：

Where n is the size of each AU ROI, I^GTIs the true value of AU ROI, I^RIs a reconstructed AU ROI image;

second is ROI level multi-label AU detection loss function

Wherein: c is the number of categories, R is the number of ROIs acquired in the previous step, Y^ROIE.g. {0,1}, R × C is the true value of the AU label,

indicating that AU j is not active in AU ROI i,

indicating that AUj is active in AUROI i, and used for measuring whether the current ROI contains a specific AU;

L_ROI＝L_{ROI_softmax}+λ₁L_R。

4. the method for recognizing facial expression motion units of a space-time graph convolutional network as claimed in claim 2, wherein the step 2) is specifically: the neighbor relations in the AU sequence spatio-temporal relation graph are divided into three categories: the neighbor relation between the AU node and the AU node, the spatial neighbor relation and the temporal neighbor relation;

for a certain node v_tiIts neighbor set B (v)_ti) The definition is as follows:

v_tiis the ith node of the t frame, B (v)_ti) Is v is_tiA neighbor set of nodes, d (x, y) refers to the co-occurrence probability between two nodes in the same frame, | x-y | is the spacing distance between two nodes in time, and K and Γ are the threshold values of the co-occurrence probability and the time distance respectively;

k, gamma is used as a hyper-parameter of the model, and a proper value is selected through training;

finally, a space-time relation adjacency matrix of the AU is obtained, each node in the relation graph represents an AU characteristic vector, adjacent nodes represent the AU which has close relation with the current AU node in time or space, and the space-time relation graph of the AU can be represented through the AU space-time relation adjacency matrix.

5. The method for recognizing facial expression motion units of a space-time graph convolutional network as claimed in claim 2, wherein step 3) is specifically: using ST-GCN model to perform graph convolution operation on the AU space-time relationship graph constructed in the step 3), namely performing graph convolution operation on each node in the space-time relationship graph respectively, dividing the neighbor set into different subsets, and firstly defining a mapping function l of a single frame:

l∶B(v_ti)→{0,1,…,K-1}

mapping function I will v_tiThe neighbor set B is mapped into K different subsets, and each subset corresponds to a label number;

to compute subsets in the time dimension simultaneously, define l_STThe mapping function:

i.e. adding the time interval distance, l, to the original single-frame subset label_STThe parameters of (1) are respectively a current node and a neighbor node to be mapped;

the formula for the graph convolution operation is as follows:

f_in(*),f_out(x) the convolved input eigenvalue and output eigenvalue of the node, respectively; z_ti(v_tj)＝|{v_tk|l_ti(v_tk)＝l_ti(v_tj) I.e. v_tjThe subset-to-subset potential, used here as a regularization term to balance the effect of different subset sizes on the result; w (—) is the own weight function of each neighbor subset, and is obtained through learning;

then, carrying out convolution operation on the adjacent matrix of the time-space relationship graph by using the formula, and after repeatedly carrying out graph convolution operation, fully representing the relation among AUs and the characteristics of AUs by depth characteristic vectors;