CN114842542B

CN114842542B - Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Info

Publication number: CN114842542B
Application number: CN202210606040.5A
Authority: CN
Inventors: 邵志文; 周勇; 陈浩; 于清
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2023-06-13
Anticipated expiration: 2042-05-31
Also published as: CN114842542A

Abstract

The invention discloses a facial action unit recognition method and a facial action unit recognition device based on self-adaptive attention and space-time correlation. The invention adopts the end-to-end deep learning frame to learn the identification of the action units, and can effectively identify the movement condition of facial muscles in the two-dimensional image by utilizing the mutual dependency relationship and the time-space correlation between the facial action units, thereby realizing the construction of a facial action unit identification system.

Description

Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Technical Field

The invention relates to a facial action unit recognition method and device based on self-adaptive attention and space-time correlation, belonging to the computer vision technology.

Background

In order to study human facial expressions more finely, the facial motion coding system (Facial Action Coding System, FACS) was first proposed by the american famous emotional psychologist Ekman equal to 1978, and an important improvement was made in 2002. The facial motion coding system is divided into a plurality of mutually independent and mutually connected facial motion units according to the anatomical characteristics of the human face, and the facial expression can be reflected through the motion characteristics of the facial motion units and the main areas controlled by the facial motion units.

With the development of computer technology and information technology, deep learning technology has been widely used. In AU (facial action unit) recognition, AU recognition has been the mainstream in research based on a deep learning model. Currently, AU identification is largely divided into two research routes: region learning is associated with AU learning. If the association between AUs is not considered, only a few sparse regions where the corresponding facial muscles are located will generally contribute to the recognition, and other regions do not need to pay much attention, so that those regions needing attention are found and focused on for better AU recognition, and a solution focusing on this problem is generally called Region Learning (RL). Furthermore, AU is defined on the basis of facial muscle anatomy, describing the movement of one or several muscles, some muscles will drive several AUs to appear simultaneously during the movement, so there is a certain degree of correlation between AUs, obviously, the correlation information between AUs will help to improve the model recognition performance, so how to mine the correlation between AUs and to improve the AU model recognition performance based on the correlation is generally called AU correlation learning.

Although the automatic recognition of facial action units has made impressive progress, current AU detection methods based on region learning often capture irrelevant regions by using AU labels only to supervise neural networks to adaptively learn implicit attention, since AU has no obvious contours and textures and may change with changes in people and expressions. In the AU detection method based on relation reasoning, all AUs share parameters during reasoning, and the specificity and the dynamic property of each AU are ignored, so that the identification accuracy is not high, and a space for further improvement exists.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, the invention provides a facial action unit recognition method and device based on self-adaptive attention and space-time correlation, which can adapt to different samples of uncontrolled scenes which randomly and variously change in illumination, shielding, gesture and the like, and is expected to have stronger robustness while keeping higher recognition precision.

The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:

a facial action unit recognition method based on adaptive attention and space-time correlation, comprising the steps of:

S01: extracting original continuous image frames required by training from any video to form a training data set; for a video sequence, the number of original consecutive image frames may be 48 frames;

s02: preprocessing original continuous image frames to obtain an amplified image frame sequence; the method for preprocessing the original continuous image frames comprises random translation, random rotation, random scaling, random horizontal overturning or random cutting and the like, and the generalization capability of the model can be improved to a certain extent by preprocessing the images;

s03: constructing a convolutional neural network module I to extract layered multi-scale region features of each frame in the amplified image frame sequence;

s04: constructing a convolutional neural network module II by utilizing the layered multi-scale region features extracted in the step S03, carrying out global attention map merging on AU to extract AU features, and supervising the convolutional neural network module II through AU detection loss; AU denotes a face action unit;

s05: constructing a self-adaptive space-time diagram convolutional neural network module III by utilizing the AU characteristics extracted in the step S04, and reasoning the specific mode of each AU and the space-time correlation (such as co-occurrence and mutual exclusion) among different AUs so as to learn the space-time correlation characteristics of each AU;

S06: constructing a full connection module IV to realize AU identification by utilizing the space-time correlation characteristics of all AUs extracted in the step S05;

s07: training an integral AU recognition network model formed by a convolutional neural network module I, a convolutional neural network module II, a self-adaptive space-time diagram convolutional neural network module III and a full-connection module IV by using a training data set so as to update parameters of the integral AU recognition network model by using a gradient-based optimization method;

s08: and inputting the video sequence with any given frame number into the trained integral AU recognition network model, and predicting the occurrence probability of the AU.

Specifically, in the step S03, since AUs of different local blocks have different facial structures and texture information, each local block needs to be subjected to independent filtering processing, and different local blocks use different filtering weights; in order to obtain multi-scale region characteristics, a convolutional neural network module I is adopted to learn the characteristics of each local block under different scales, the convolutional neural network module I comprises two layers of layered multi-scale region layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale region layer, the output of the first layered multi-scale region layer is used as the input of a second layered multi-scale region layer after the maximum pooling operation, and the output of the second layered multi-scale region layer is used as the output of the convolutional neural network module I after the maximum pooling operation; and respectively inputting each frame image of the amplified image frame sequence into a convolutional neural network module I, and outputting the images as layered multi-scale region characteristics of each frame image.

Each layering multi-scale regional layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, wherein in the convolution layer I-I, the input is integrally convolved once, and a convolution result is used as the output of the convolution layer I-I; taking the output of the convolution layer I-I as the input of the convolution layer I-II-I, uniformly dividing the input into 8X 8 scale local blocks in the convolution layer I-II-I, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-I; taking the output of the convolution layer I-II-I as the input of the convolution layer I-II-II, uniformly dividing the input into local blocks with 4X 4 scale in the convolution layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-II; taking the output of the convolution layer I-II-II as the input of the convolution layer I-II-III, uniformly dividing the input into local blocks with the 2X 2 scale in the convolution layer I-II-III for convolution respectively, and then splicing all convolution results to form the output of the convolution layer I-II-III; and after channel-level series connection is carried out on the outputs of the convolution layers I-II-I, I-II-II and I-II-III (the number of channels output after channel-level series connection is the same as that of the output of the convolution layer I-I), the outputs of the convolution layers I-I are summed, and the result is used as the output of the layered multi-scale regional layer.

Specifically, in step S04, the convolutional neural network module II is used as an adaptive attention mechanics learning module, the input is a layered multi-scale region feature of the image, the global attention map of each AU on each predicted image is obtained through the convolutional neural network module II, AU feature extraction and AU prediction are performed, and the whole process is performed under the supervision of a predefined attention map and AU detection loss. The method comprises the following steps:

(41) Generating a predictive attention profile for each AU: the layered multi-scale region features are input into an adaptive attention mechanics learning module, the number of AUs on each frame of picture is M, each AU corresponds to an independent branch, and four layers of convolution layers are adopted to learn global attention force map M of the AU _ij And extracts AU features.

(42) Generating a true attention profile for each AU: each AU has two centers, which are designated by two related face feature points; the true attention map is generated by a gaussian distribution with center points, i.e. AU center point coordinates, e.g. AU center point coordinates

The true attention weight at the position coordinates on the attention map (a, b) is:

then, the larger of the attention weights is selected in the two center positions of each AU to merge the predefined attention patterns of the two AU centers, i.e

And employs attention regression loss to encourage M _ij Near M _ij ：

Wherein: l (L) _a Representing globalAttention is paid to the loss function L of the force diagram regression _a The method comprises the steps of carrying out a first treatment on the surface of the t represents the length of the amplified image frame sequence; m represents the number of AUs in each frame of image; l/4×l/4 represents the size of the global attention map; m is M _ijab Representing a true attention weight of a jth AU of the ith frame image at the coordinate position (a, b); m is M _ijab Representing the predicted attention weight of the jth AU of the ith frame image at the coordinate position (a, b).

(43) Extracting AU characteristics and carrying out AU detection: global attention map M to predict _ij The characteristic of the region with larger attention weight can be enhanced by multiplying the characteristic map of the face obtained by the fourth convolution layer II-II by elements; then inputting the obtained output characteristics into convolution layers II-III, and extracting AU characteristics through a global average pooling layer; the AU detection cross entropy loss is adopted to promote attention seeking self-adaptive training, the learned AU characteristics are input into a one-dimensional full-connected layer, and then the Sigmoid function delta (x) =1/(1+e) ^-x ) The probability of occurrence of each AU is predicted.

The weighted cross entropy loss function of the adopted AU identification is as follows:

wherein:

a weighted cross entropy loss function representing AU identification; p is p _ij Representing the true probability of occurrence of the jth AU of the ith frame image; / >

Representing a prediction probability of occurrence of a j-th AU of the i-th frame image; omega _j A weight indicating the jth AU; v _j The weight at which the jth AU appears (the first term used for cross entropy, the appearance term) is represented.

(44) Overall loss function of convolutional neural network module I and convolutional neural network module II as a whole: the overall loss function can be obtained by combining the striving loss and the AU detection loss.

Wherein: l (L) _AA Representing the overall loss function of the convolutional neural network module I and the convolutional neural network module II as a whole.

Specifically, in step S05, the adaptive space-time diagram convolutional neural network module III is used to extract the space-time correlation between the specific mode of each AU and different AUs, the adaptive space-time diagram convolutional neural network module III includes two space-time diagram convolutional layers with the same structure, the AU features of m 12c on the t frame image and each frame image are spliced into the integral feature of t×m×12c, the integral feature is used as the input of the space-time diagram convolutional layer III-I, the output obtained by the space-time diagram convolutional layer III-I is used as the input of the space-time diagram convolutional layer III-II, and the output feature is obtained through the space-time diagram convolutional layer III-II, and the feature includes the space-time correlation information between the specific mode of AUs and different AUs.

Parameters of two space-time diagram convolution layers are independently learned, each space-time diagram convolution layer is formed by combining a space diagram convolution unit and a gating circulation unit, and the definition of the space-time diagram convolution layers is as follows:

Wherein:

input of time T is shown, h _T Represents the final hidden state at time T (output at time T),>

representing an initial hidden state at the time T; z _T H for determining that time T needs to be reserved _T-1 Number of r _T For determining T moment->

And h _T-1 Is combined with the combination of the above.

Representing an identity matrix, m being the number of AUs in each frame of image; />

Matrix representing adaptive learning for AU relationship diagram, U ^T Represents the transpose of U; />

Dissociation matrix representing adaptive learning, c ^e The number of columns set for Q; />

And->

Representing adaptive learning for z, respectively _T 、r _T And->

C 'represents the dimension of the AU feature, c' =12c, c is a set parameter related to the overall AU-recognition network model; for the jth AU of the ith frame image, the ith line component +.>

Can be respectively from W _z 、W _r And->

A parameter of size 2c 'x c' is dissociated.

R (X) represents the element X at each index position in the two-dimensional matrix X, and after the negation process, the element X at the index position (a, b) in the two-dimensional matrix X _ab Updated to X _ab ＝max(0,X _ab )。

N (X) represents the normalization of the elements at each index position in the two-dimensional matrix X, after which the elements X at the (a, b) index positions in the two-dimensional matrix X _ab Updated to

Z=XAs, Y represents the element at the (a, b) index position in the two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y

X _ak Representing elements at (a, k) index positions in a two-dimensional matrix X, Y _akb Representing the element at the (a, k, b) index position in the three-dimensional matrix Y.

"∈" represents element level multiplication, C (·) represents a concatenation operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function.

Specifically, in the step S06, a full connection module IV is adopted to identify the AUs of each frame of image by adopting a one-dimensional full connection layer, and the spatio-temporal correlation characteristics of all AUs contained in the t frame of image output by the spatio-temporal convolution layer are adopted

Performing frame-by-frame and AU-by-AU decomposition to obtain AU feature vector with dimension of 12c, and including space-time correlation feature of jth AU of ith frame image +.>

Input to a full-connection dieBlock IV, full connection Module IV pair->

Predicting the final occurrence probability of the jth AU of the ith frame image by adopting a mode that the jth one-dimensional full-connection layer is followed by a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame images; the obtained characteristics have space-time associated information among AUs, so that the final AU identification is facilitated, and the loss function adopted by the AU identification is as follows:

Wherein:

a loss function representing AU identification; t represents the length of the amplified image frame sequence; p is p _ij Representing the true probability of occurrence of the jth AU of the ith frame image; />

Representing a final prediction probability of occurrence of a j-th AU of the i-th frame image; omega _j A weight indicating the jth AU; v _j The weight at which the jth AU appears (the first term used for cross entropy, the appearance term) is represented.

Specifically, in the step S07, an overall AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the adaptive space-time diagram convolutional neural network module III, and the full connection module IV is trained by an end-to-end method; firstly training a convolutional neural network model, extracting accurate AU characteristics as input of a graph convolution neural network, then training the graph convolution neural network to learn the specific mode and the space-time correlation characteristics of the AU, and utilizing the space-time correlation characteristics among the AU to promote the recognition of a facial action unit.

The facial action unit recognition device based on the self-adaptive attention and space-time correlation comprises an image frame sequence acquisition unit, a layered multi-scale region learning unit, a self-adaptive attention regression and feature extraction unit, a self-adaptive space-time diagram convolution learning unit, an AU recognition unit and a parameter optimization unit;

The image frame sequence acquisition unit is used for extracting a large number of original continuous images required by training from any video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;

the hierarchical multi-scale region learning unit comprises a convolutional neural network module I, wherein characteristics of each local block under different scales of each frame of input image are learned by adopting a hierarchical multi-scale region layer, and each local block is independently filtered;

the adaptive attention regression and feature extraction unit comprises a convolutional neural network module II for generating a global attention map of the image and performing adaptive regression under the supervision of a predefined attention map and AU detection loss while accurately extracting AU features.

The self-adaptive space-time diagram convolutional learning unit comprises a self-adaptive space-time diagram convolutional neural network module III, performs specific mode learning of each AU, space-time correlation learning among different AUs and extracts space-time correlation characteristics of each AU;

the AU identification unit comprises a full connection module IV, and AU identification can be effectively performed by utilizing the space-time correlation characteristic of each AU;

the parameter optimizing unit calculates parameters and loss function values of an integral AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time diagram convolutional neural network module III and the full-connection module IV, and updates the parameters based on a gradient optimizing method.

Inputting each frame image in an input image frame sequence into a convolutional neural network module I and a convolutional neural network module II respectively to obtain m AU characteristics of each frame, and beginning to splice the characteristics of all frames as input when the self-adaptive space-time diagram convolutional neural network module III is reached; namely, the convolution neural network module I and the convolution neural network module II process a single image, time is not involved, t frames can be treated respectively, and the self-adaptive space-time diagram convolution neural network module III processes t frames simultaneously; the full connection module IV processes the spatio-temporal correlation characteristics of m AUs of a single image, and does not involve time.

The beneficial effects are that: the facial action unit recognition method and device based on the self-adaptive attention and space-time correlation provided by the invention have the following advantages compared with the prior art: (1) In the self-adaptive attention regression neural network, attention self-adaptive regression is promoted by AU detection loss, local features associated with AU can be accurately captured, and the attention distribution is optimized by using position prior, so that the robustness of uncontrolled scenes is improved; (2) In the self-adaptive space-time diagram convolutional neural network, each space-time diagram convolutional layer fully learns AU relations in the space domain of a single frame; in the time domain, general convolution operation is carried out, the correlation among frames is mined, the identification of the face AU of each frame is promoted, and finally the probability of each AU of each frame is output by the network; (3) The invention can adapt to different samples of random diversity change of uncontrolled scenes in illumination, shielding, gesture and the like due to the adoption of self-adaptive learning instead of AU relation predefined based on priori knowledge, and is expected to have stronger robustness while keeping higher recognition precision.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a layered multi-scale region layer structure;

fig. 3 is a schematic structural diagram of a convolutional neural network module II;

FIG. 4 is a schematic diagram of the structure of a space-time convolutional layer;

fig. 5 is a schematic structural diagram of the whole adaptive attention-returning neural network and the adaptive space-time diagram convolutional neural network model.

Detailed Description

The invention is described in detail below with reference to the drawings and the specific embodiments.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The invention provides a facial action unit identification method and device based on self-adaptive attention and space-time correlation. In the self-adaptive attention regression neural network, the strong correlation region of AU specified by the face feature points and the weak correlation region distributed in the global are captured simultaneously by predefining the constraints of attention and AU detection, and the useful information of each AU can be extracted accurately by carrying out feature extraction after learning the correlation distribution of each region. In the self-adaptive space-time diagram convolutional neural network, each space-time diagram convolutional layer fully learns AU relations in the space domain of a single frame; and in the time domain, universal convolution operation is carried out to mine the inter-frame relevance, the face AU identification of each frame is promoted, and finally the probability of each AU of each frame is output by the network. Due to the adoption of self-adaptive learning, the method can adapt to different samples of the uncontrolled scene which randomly and variously change in illumination, shielding, gesture and the like, and is expected to have stronger robustness while keeping higher recognition accuracy.

Fig. 1 is a flowchart of a facial motion unit recognition method based on joint learning and optical flow estimation, and the following description will explain each specific step.

S01: the original continuous image frames required for training are extracted from any video to form a training data set, and the length of the extracted continuous image frame sequence is 48.

For video sequences, the video frame sequence length is selected to be 48 in order to avoid situations where too few acquisition frames result in difficulty in learning the correct action unit spatiotemporal correlation, or where too many acquisition frames result in too long a learning duration.

S02: and preprocessing the original continuous image frames to obtain an amplified image frame sequence.

The preprocessing mode of the original image comprises random translation, random rotation, random scaling, random horizontal overturning, random cutting and the like, and the generalization capability of the model can be improved to a certain extent by preprocessing the image.

S03: and constructing a convolutional neural network module I to extract layered multi-scale region features of the amplified image frame sequence.

Since the facial motion units of different local blocks have different facial structures and texture information, each local block needs to be independently filtered, and different local blocks use different filtering weights.

Since AUs of different local blocks have different facial structures and texture information, independent filtering processing is required for each local block, and different local blocks use different filtering weights; in order to obtain multi-scale region characteristics, a convolutional neural network module I is adopted to learn the characteristics of each local block under different scales, the convolutional neural network module I comprises two layers of layered multi-scale region layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale region layer, the output of the first layered multi-scale region layer is used as the input of a second layered multi-scale region layer after the maximum pooling operation, and the output of the second layered multi-scale region layer is used as the output of the convolutional neural network module I after the maximum pooling operation; and (3) carrying out channel-level serial connection on each frame image of the amplified image frame sequence, and then using the serial connection as the input of a convolutional neural network module I, wherein the output of the convolutional neural network module I is the layered multi-scale region characteristic of the amplified image frame sequence.

As shown in FIG. 2, each layered multi-scale region layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, wherein in the convolution layer I-I, the input is integrally convolved once, and the convolution result is taken as the output of the convolution layer I-I; taking the output of the convolution layer I-I as the input of the convolution layer I-II-I, uniformly dividing the input into 8X 8 scale local blocks in the convolution layer I-II-I, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-I; taking the output of the convolution layer I-II-I as the input of the convolution layer I-II-II, uniformly dividing the input into local blocks with 4X 4 scale in the convolution layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-II; taking the output of the convolution layer I-II-II as the input of the convolution layer I-II-III, uniformly dividing the input into local blocks with the 2X 2 scale in the convolution layer I-II-III for convolution respectively, and then splicing all convolution results to form the output of the convolution layer I-II-III; and after channel-level series connection is carried out on the outputs of the convolution layers I-II-I, I-II-II and I-II-III (the number of channels output after channel-level series connection is the same as that of the output of the convolution layer I-I), the outputs of the convolution layers I-I are summed, and the result is used as the output of the layered multi-scale regional layer.

A layer of maximum pooling layer is arranged behind each layer of layered multi-scale area layer in the convolutional neural network module I, the pooling core size of each layer of maximum pooling layer is 2 multiplied by 2, and the step length is 2; the channel numbers of the convolution layers I-I, I-II-II and I-II-III in the first layered multi-scale region layer are respectively 32, 16, 8 and 8, and the filter numbers of the convolution layers I-I, I-II-II and I-II-III in the first layered multi-scale region layer are respectively 32 multiplied by 1, 16 multiplied by 8 multiplied by 4 multiplied by 8 multiplied by 2; the channel numbers of the convolution layers I-I, I-II-II and I-II-III in the second layered multi-scale region layer are 64, 32, 16 and 16 respectively, and the filter numbers of the convolution layers I-I, I-II-II and I-II-III in the second layered multi-scale region layer are 64 multiplied by 1, 32 multiplied by 8, 16 multiplied by 4 and 16 multiplied by 2 respectively; the filter sizes in the convolutional layers are all 3×3, and the step sizes are all 1.

S04: constructing a convolutional neural network module II by utilizing the layered multi-scale region features extracted in the step S03, carrying out global attention map merging on AU to extract AU features, and supervising the convolutional neural network module II through AU detection loss; AU denotes a face action unit.

As shown in fig. 3, the convolutional neural network module II is a multi-layer convolutional layer containing m branches, each corresponding to an AU, and performs adaptive global attention-seeking regression and AU prediction simultaneously. The filter size of each convolution layer is 3×3, and the step size is 1.

(41) Generating a predictive attention profile for each AU: the characteristic of the layered multi-scale region is the input of a convolutional neural network module II, the convolutional neural network module II comprises M branches, each branch corresponds to one AU, and four layers of convolutional layers are adopted to learn the global attention map M of the AU _ij And extracts AU features.

And employs attention regression loss to encourage M _ij Near M _ij ：

Wherein: l (L) _a Loss function L representing global attention seeking regression _a The method comprises the steps of carrying out a first treatment on the surface of the t represents the length of the amplified image frame sequence; m represents the number of AUs in each frame of image; l/4×l/4 represents the size of the global attention map; m is M _ijab Representing a true attention weight of a jth AU of the ith frame image at the coordinate position (a, b); m is M _ijab Jth A representing ith frame imagePredicted attention weights for U at coordinate locations (a, b).

wherein:

a weighted cross entropy loss function representing AU identification; p is p _ij Representing the true probability of occurrence of the jth AU of the ith frame image; />

Since the occurrence rate of different AUs in the training dataset is significantly different and the occurrence rate of most AUs is much lower than the non-occurrence rate, ω will be used to suppress the two data imbalance problems _j And v _j The definition is as follows:

wherein: n and

the occurrence rate of the jth AU of the ith frame image can be expressed as +.>

S05: and (3) constructing a self-adaptive space-time diagram convolutional neural network module III by utilizing the AU characteristics extracted in the step S04, and reasoning the specific mode of each AU and the space-time correlation (such as co-occurrence and mutual exclusion) among different AUs so as to learn the space-time correlation characteristics of each AU.

The self-adaptive space-time diagram convolutional neural network module III is composed of two space-time diagram convolutional layers with the same structure, AU features of m 12c of each frame are spliced into integral features of t multiplied by m multiplied by 12c, the integral features are used as input of a space-time diagram convolutional layer III-I, output obtained by the space-time diagram convolutional layer III-I is used as input of a space-time diagram convolutional layer III-II, and output features are obtained through the space-time diagram convolutional layer III-II, and the features comprise specific modes of AUs and space-time relativity among AUs.

Parameters of two space-time diagram convolution layers are independently learned respectively, each space-time diagram convolution layer is formed by combining a space diagram convolution unit and a gating circulation unit, a structure diagram of each space-time diagram convolution layer is shown in fig. 4, and the method specifically comprises the following steps:

(51) Inferring the specific mode of each AU:

the typical convolution is calculated in the spectral domain and is well approximated by a first order chebyshev polynomial expansion:

wherein:

and->

Input and output of the graph convolution layer, respectively, < >>

Is an identity matrix; />

Is a symmetrical weighted adjacent matrix, and represents the strength of the connecting edges between the nodes; />

Is a degree matrix with

Is a parameter matrix. The graph convolution is essentially by learning all AU-shared parameter matrices a and Θ ⁽⁰⁾ Input +.>

Conversion to->

Although the above formula can learn the interrelationship of AUs, the specific mode of each AU is ignored, for this reason, the association between AUs is inferred by using the shared parameter matrix a, and an independent parameter is adopted for each AU

The resulting convolution operation is:

wherein: Z=XAs, Y represents the element at the (a, b) index position in the two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y

To reduce the number of parameters of the theta matrix, a feature decomposition matrix is introduced

And a shared parameter matrix->

The graph convolution is re-represented as:

wherein: with theta ⁽¹⁾ =qw, the intermediate dimension c calculated by this new formula ^e Typically less than m, for the ith AU, parameters

Can be separated by a feature separation matrix->

Extracted from the shared parameter matrix W, the use of the matrices Q and W facilitates reasoning about the particular mode of AU.

(52) Inferring the interrelationship between AUs over the spatial domain:

using direct learning of a matrix

To reduce the calculation amount instead of learning the matrix firstA further calculates a normalized adjacency matrix>

I.e. < ->

Where R (X) is a modified linear unit (Rectified Linear Unit, reLU) activation function and N (X) is a normalization function, dependencies between AUs, such as co-occurrence and mutual exclusion, can be adaptively encoded. The volume of the map is then re-expressed as:

F ^out ＝(I+N(R(UU ^T )))F ⁱⁿ ⊙QW

the matrix U and the matrix W are parameter matrices shared by all AUs, so that the dependence relationship among the AUs in the spatial domain is inferred.

(53) Inferring relationships between frames in the spatial domain: gating loop units (GRUs) are a popular approach to modeling temporal dynamics. One GRU unit consists of an update gate z and a reset gate r, where the gating mechanism at time step τ is defined as follows:

z _T ＝σ(W _z C(h _T-1 ,x _T ))

r _T ＝σ(W _r C(h _T-1 ,x _T ))

Wherein: h is a _T The final hidden state at time T (output at time T) is shown,

representing the initial hidden state at time T, z _T H for determining that time T needs to be reserved _T-1 Number of r _T For determining T moment->

And h _T-1 In the combination, "Σ" represents element level multiplication, C (·) represents a concatenation operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function.

The final definition of each space-time diagram convolution layer can be obtained by the above procedure:

wherein:

input of time T is shown, h _T Represents the final hidden state at time T (output at time T), and

and->

Representing adaptive learning for z, respectively _T 、r _T And->

Is a weight matrix of (2); wherein the input matrix->

Output matrix->

Is the output result h of each graph convolution layer ₁ ,h ₂ ,…,h _t' Splicing in dimension t; t 'is the total number of frames input to the space-time convolutional layer, t' =t=48.

S06: and (3) constructing a full-connection module IV to realize AU identification by utilizing the space-time correlation characteristics of all the AUs extracted in the step S05.

The full connection module IV is formed by a one-dimensional full connection layer followed by a Sigmoid activation function. And (3) decomposing the integral features with the dimensions of t multiplied by m multiplied by 12c obtained in the step S05 frame by frame and AU by AU to obtain feature vectors with the dimensions of 12c corresponding to each AU on each frame of image, inputting the feature vector of the jth AU of the ith frame of image into the jth full-connection layer, carrying out AU occurrence probability prediction by the full-connection layer and then following a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame of images. Because the learned features have the space-time correlation information of AU, the method is favorable for final AU detection, and the following loss function is adopted to guide the learning of a space-time diagram convolution parameter matrix:

Wherein:

S07: and training an integral AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time diagram convolutional neural network module III and the full-connection module IV by using a training data set so as to update parameters of the integral AU recognition network model by using a gradient-based optimization method.

The whole convolutional neural network, the graph convolutional neural network model (as in fig. 5) is trained by an end-to-end method. Firstly training a convolutional neural network model, extracting AU characteristics and taking the AU characteristics as input of a graph convolutional neural network, then training the graph convolutional neural network to learn the specific mode and the space-time correlation of AU, and utilizing the space-time correlation among AU to promote the recognition of a facial action unit.

And directly outputting the prediction result of the facial action unit when the prediction is performed.

The method can be completely realized by a computer without manual auxiliary treatment; this shows that this case can realize the automatic processing of batchization, can improve the treatment effeciency greatly, reduce the cost of labor.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.

Claims

1. A facial action unit identification method based on self-adaptive attention and space-time association is characterized in that: the method comprises the following steps:

s01: extracting original continuous image frames required by training from the video to form a training data set;

S02: preprocessing original continuous image frames to obtain an amplified image frame sequence;

the characteristic of each local block under different scales is learned by adopting a convolutional neural network module I, wherein the convolutional neural network module I comprises two layers of layered multi-scale area layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale area layer, the output of the first layered multi-scale area layer is used as the input of a second layered multi-scale area layer after the maximum pooling operation, and the output of the second layered multi-scale area layer is used as the output of the convolutional neural network module I after the maximum pooling operation; the output of the convolutional neural network module I is the layered multi-scale region characteristic of the amplified image frame sequence;

each layering multi-scale regional layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, wherein in the convolution layer I-I, the input is integrally convolved once, and a convolution result is used as the output of the convolution layer I-I; taking the output of the convolution layer I-I as the input of the convolution layer I-II-I, uniformly dividing the input into 8X 8 scale local blocks in the convolution layer I-II-I, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-I; taking the output of the convolution layer I-II-I as the input of the convolution layer I-II-II, uniformly dividing the input into local blocks with 4X 4 scale in the convolution layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-II; taking the output of the convolution layer I-II-II as the input of the convolution layer I-II-III, uniformly dividing the input into local blocks with the 2X 2 scale in the convolution layer I-II-III for convolution respectively, and then splicing all convolution results to form the output of the convolution layer I-II-III; the output of the convolution layers I-II-I, I-II-II and I-II-III are subjected to channel-level series connection and then added with the output of the convolution layers I-I, and the result is used as the output of a layered multi-scale regional layer;

predicting global attention force diagram of each AU and the occurrence probability of the AU by adopting a convolutional neural network module II, adaptively returning the global attention force diagram of the AU to a predefined attention force diagram under the supervision of AU detection loss, and extracting AU characteristics; the loss function used for AU detection loss is:

wherein: l (L) _a A loss function representing global attention seeking to regress,

weighted cross entropy loss function representing AU recognition, L _AA Representing the total loss function of the convolution neural network module I and the convolution neural network module II; lambda (lambda) _a Representing global attention stricken returnsWeight of loss; t represents the length of the amplified image frame sequence; m represents the number of AUs in each frame of image; l/4×l/4 represents the size of the global attention map; m is M _ijab Representing a true attention weight of a jth AU of the ith frame image at the coordinate position (a, b); m is M _ijab A predicted attention weight representing a jth AU of the ith frame image at the coordinate position (a, b); p is p _ij Representing the true probability of occurrence of the jth AU of the ith frame image; />

Representing a prediction probability of occurrence of a j-th AU of the i-th frame image; omega _j A weight indicating the jth AU; v _j Represents the weight at which the jth AU appears;

s05: constructing a self-adaptive space-time diagram convolutional neural network module III by utilizing the AU characteristics extracted in the step S04, and reasoning the space-time correlation between different AUs and the specific mode of each AU so as to learn the space-time correlation characteristics of each AU;

extracting the space-time correlation between the specific mode of each AU and different AUs by adopting a self-adaptive space-time diagram convolutional neural network module III so as to learn the space-time correlation characteristic of each AU; the input of the self-adaptive space-time diagram convolutional neural network module III is all AU features of all t frame images extracted in the step S04, each frame image comprises m AU features, t multiplied by m AU features are all included, the dimension of each AU feature is c ', c' =12c, and c is a setting parameter related to the overall AU recognition network model; the self-adaptive space-time diagram convolutional neural network module III comprises two space-time diagram convolutional layers with the same structure, parameters of the two space-time diagram convolutional layers are independently learned, each space-time diagram convolutional layer is formed by combining a space diagram convolutional unit and a gate control cyclic unit, and the definition of the space-time diagram convolutional layers is as follows:

Wherein:

input of time T is shown, h _T Represents the final hidden state at time T, +.>

And h _T-1 A combination of (2);

Matrix representing adaptive learning for AU relationship diagram, U ^Τ Represents the transpose of U; />

Dissociation matrix representing adaptive learning, c ^e The number of columns set for Q;

and->

Representing adaptive learning for z, respectively _T 、r _T And->

C 'represents the dimension of the AU feature, c' =12c, c is a set parameter related to the overall AU-recognition network model;

r (X) represents the element X at each index position in the two-dimensional matrix X, and after the negation process, the element X at the index position (a, b) in the two-dimensional matrix X _ab Updated to X _ab ＝max(0,X _ab )；

X _ak Representing elements at (a, k) index positions in a two-dimensional matrix X, Y _akb Representing elements at the (a, k, b) index positions in the three-dimensional matrix Y;

"∈" represents element-level multiplication, C (·) represents a concatenation operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function;

adopting a full connection module IV to identify AU of each frame image, and carrying out time-space correlation characteristics on all AUs contained in t frame images output in the step S05

Spatio-temporal correlation of the jth AU containing the ith frame image

Input to full connection module IV, full connection module IV is to +.>

Predicting the final occurrence probability of the jth AU of the ith frame image by adopting a mode that the jth one-dimensional full-connection layer is followed by a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame images; the loss function adopted by AU identification is:

wherein:

Representing a final prediction probability of occurrence of a j-th AU of the i-th frame image; omega _j A weight indicating the jth AU; v _j Represents the weight at which the jth AU appears;

2. A facial action unit recognition apparatus based on adaptive attention and spatiotemporal association as claimed in claim 1, wherein: the device comprises an image frame sequence acquisition unit, a layered multi-scale region learning unit, an adaptive attention regression and feature extraction unit, an adaptive space-time image convolution learning unit, an AU identification unit and a parameter optimization unit;

the image frame sequence acquisition unit is used for extracting original continuous images required by training from video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;

The self-adaptive attention regression and feature extraction unit comprises a convolutional neural network module II, a self-adaptive regression module and an AU feature extraction module, wherein the convolutional neural network module II is used for generating a global attention map of an image, carrying out self-adaptive regression under the supervision of a predefined attention map and AU detection loss, and accurately extracting the AU feature;