CN114842542A - Facial action unit identification method and device based on self-adaptive attention and space-time correlation - Google Patents

Facial action unit identification method and device based on self-adaptive attention and space-time correlation Download PDF

Info

Publication number
CN114842542A
CN114842542A CN202210606040.5A CN202210606040A CN114842542A CN 114842542 A CN114842542 A CN 114842542A CN 202210606040 A CN202210606040 A CN 202210606040A CN 114842542 A CN114842542 A CN 114842542A
Authority
CN
China
Prior art keywords
neural network
space
layer
network module
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210606040.5A
Other languages
Chinese (zh)
Other versions
CN114842542B (en
Inventor
邵志文
周勇
陈浩
于清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202210606040.5A priority Critical patent/CN114842542B/en
Publication of CN114842542A publication Critical patent/CN114842542A/en
Application granted granted Critical
Publication of CN114842542B publication Critical patent/CN114842542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a facial action unit identification method and device based on adaptive attention and space-time correlation. The invention adopts an end-to-end deep learning framework to learn the action unit identification, and can effectively identify the motion condition of facial muscles in a two-dimensional image by utilizing the interdependency and the time-space correlation among the facial action units, thereby realizing the construction of a facial action unit identification system.

Description

Facial action unit identification method and device based on self-adaptive attention and space-time correlation
Technical Field
The invention relates to a facial action unit identification method and device based on self-adaptive attention and space-time correlation, and belongs to the computer vision technology.
Background
In order to study human Facial expressions more finely, the Facial Action Coding System (FACS) was first proposed in 1978 by the american famous mood psychologist Ekman, and was significantly improved in 2002. The facial action coding system is divided into a plurality of facial action units which are independent and mutually connected according to the anatomical characteristics of the human face, and the facial expressions can be reflected through the action characteristics of the facial action units and the main areas controlled by the facial action units.
With the development of computer technology and information technology, deep learning technology has been widely used. In the field of AU (facial action unit) recognition, it has become mainstream to study AU recognition based on a deep learning model. Currently, AU identification is mainly divided into two research routes: the area learning is associated with the AU learning. Without considering the association between AUs, generally only a few sparse regions where their corresponding facial muscles are located contribute to its identification, and other regions do not require much attention, so finding those regions that need attention and doing focused Learning can better identify the AUs, and a solution that focuses on this problem is generally called Region Learning (RL). In addition, the AU is defined on the basis of facial muscle anatomy, describes the movement of one or several muscles, some muscles can drive several AUs to appear simultaneously in the movement process, so that a certain degree of correlation exists between AUs, obviously, the correlation information between AUs can help the improvement of model identification performance, and therefore, the solution of how to mine the correlation between AUs and improve the AU model identification performance based on the correlation is generally called as AU correlation learning.
Although the automatic recognition of facial action units makes impressive progress, current AU detection methods based on region learning often catch irrelevant regions since AUs have no obvious contours and textures and may vary from person to person and expression, using AU labels alone to supervise neural networks to adaptively learn implicit attention. The AU detection method based on the relational inference ignores the specificity and the dynamics of each AU because all AUs share parameters during the inference, so that the identification accuracy is not high, and the space is further improved.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a face action unit identification method and device based on adaptive attention and space-time association, which can adapt to different samples of random and various changes of an uncontrolled scene in the aspects of illumination, shielding, posture and the like, and is expected to have stronger robustness while keeping higher identification precision.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a facial action unit identification method based on adaptive attention and spatiotemporal correlation comprises the following steps:
s01: extracting original continuous image frames required by training from any video to form a training data set; for a video sequence, the number of original consecutive image frames may be 48 frames;
s02: preprocessing an original continuous image frame to obtain an amplified image frame sequence; the mode of preprocessing the original continuous image frame comprises random translation, random rotation, random scaling, random horizontal turnover or random cutting and the like, and the generalization capability of the model can be improved to a certain extent by preprocessing the image;
s03: constructing a convolutional neural network module I to extract the hierarchical multi-scale regional characteristics of each frame in the amplified image frame sequence;
s04: constructing a convolutional neural network module II to carry out global attention map regression of AU and extract AU characteristics by using the hierarchical multi-scale region characteristics extracted in the step S03, and supervising the convolutional neural network module II through AU detection loss; AU represents a face action unit;
s05: constructing a self-adaptive space-time graph convolutional neural network module III by utilizing the AU characteristics extracted in the step S04, and reasoning the specific mode of each AU and the space-time relevance (such as co-occurrence and mutual exclusion) among different AUs so as to learn the space-time relevance characteristics of each AU;
s06: constructing a full-connection module IV to realize AU identification by utilizing the space-time correlation characteristics of the AUs extracted in the step S05;
s07: training an integral AU recognition network model formed by a convolutional neural network module I, a convolutional neural network module II, a self-adaptive space-time diagram convolutional neural network module III and a full-connection module IV by using a training data set, and updating parameters of the integral AU recognition network model by using a gradient-based optimization method;
s08: and inputting the video sequence with any given frame number into the trained integral AU identification network model to predict the occurrence probability of AUs.
Specifically, in step S03, since AUs of different local blocks have different facial structures and texture information, each local block needs to be independently filtered, and different local blocks use different filtering weights; in order to obtain multi-scale regional features, learning the features of each local block under different scales by adopting a convolutional neural network module I, wherein the convolutional neural network module I comprises two layered multi-scale regional layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale regional layer, the output of the first layered multi-scale regional layer is used as the input of a second layered multi-scale regional layer after maximum pooling operation, and the output of the second layered multi-scale regional layer is used as the output of the convolutional neural network module I after maximum pooling operation; and respectively inputting each frame image of the amplified image frame sequence into a convolutional neural network module I, and outputting the frame images as the hierarchical multi-scale regional characteristics of each frame image.
Each layered multi-scale region layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, in the convolution layer I-I, the input whole is subjected to convolution once, and the convolution result is used as the output of the convolution layer I-I; taking the output of the convolutional layer I-I as the input of the convolutional layer I-II-I, firstly, uniformly dividing the input into local blocks with the scale of 8 multiplied by 8 in the convolutional layer I-II-I, respectively performing convolution, and then splicing all convolution results to form the output of the convolutional layer I-II-I; taking the output of the convolutional layer I-II-I as the input of the convolutional layer I-II-II, firstly, uniformly dividing the input into local blocks with the scale of 4 multiplied by 4 in the convolutional layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolutional layer I-II-II; taking the output of the convolutional layer I-II-II as the input of the convolutional layer I-II-III, firstly uniformly dividing the input into local blocks with the scale of 2 multiplied by 2 in the convolutional layer I-II-III to be respectively convolved, and then splicing all convolution results to form the output of the convolutional layer I-II-III; and after the outputs of the convolutional layers I-II-I, I-II-II and I-II-III are subjected to channel-level series connection (the number of output channels after the channel-level series connection is the same as that of output channels of the convolutional layers I-I), the outputs of the convolutional layers I-I and the output channels of the convolutional layers I-II-III are summed, and the result is used as the output of the hierarchical multi-scale regional layer.
Specifically, in step S04, the convolutional neural network module II serves as an adaptive attention learning module, the input of which is hierarchical multi-scale region features of the image, the convolutional neural network module II obtains a global attention map of each AU on each predicted image, and performs AU feature extraction and AU prediction, and the whole process is performed under the supervision of a predefined attention map and AU detection loss. The method comprises the following steps:
(41) generating a predictive attention map for each AU: inputting the characteristics of the hierarchical multi-scale region into an adaptive attention-mechanical learning module, wherein the number of AUs on each frame of picture is M, each AU corresponds to an independent branch, and learning the global attention-mechanical diagram M of the AUs by adopting four layers of convolutional layers ij And extracts AU features.
(42) Generate a true attention map for each AU: each AU has two centers, designated by two related face feature points; the true attention map is generated from a gaussian distribution with a center point, i.e. the coordinates of the center point of the AU, e.g. the coordinates of the center point of the AU
Figure BDA0003671346430000031
Then the true attention weight at location coordinate (a, b) on the attention map is:
Figure BDA0003671346430000032
then, the larger of the attention weights is selected among the two central positions of each AU to merge the predefined attention intentions of the two AU centers, i.e. to merge
Figure BDA0003671346430000033
And adopt attention regression loss to encourage M ij Approach to M ij
Figure BDA0003671346430000041
Wherein: l is a Loss function L representing a global attention-graph regression a (ii) a t represents the length of the sequence of amplified image frames; m represents the number of AUs in each frame image; l/4 × l/4 represents the size of the global attention map; m ijab Representing the true attention weight of the jth AU of the ith frame image at coordinate position (a, b); m ijab Represents the predicted attention weight of the jth AU of the ith frame image at coordinate position (a, b).
(43) Extracting AU features and carrying out AU detection: global attention map M to be predicted ij Multiplying the face feature map obtained by the fourth convolution layer II-II by elements to strengthen the features of the region with larger attention weight; inputting the obtained output characteristics into the convolution layers II-III, and then extracting AU characteristics through a global average pooling layer; the cross entropy loss detection is adopted to promote attention force diagram self-adaptive training, the learned AU features are input into a one-dimensional fully-connected layer, and then a Sigmoid function delta (x) is used for being 1/(1+ e) -x ) The probability of occurrence of each AU is predicted.
The weighted cross entropy loss function of the AU identification employed is:
Figure BDA0003671346430000042
wherein:
Figure BDA0003671346430000043
a weighted cross entropy loss function representing AU identification; p is a radical of ij Representing the true probability of the occurrence of the jth AU of the ith frame image;
Figure BDA0003671346430000044
representing the prediction probability of the occurrence of the jth AU of the ith frame image; omega j Represents the weight of the jth AU; v. of j The weight at the occurrence of the jth AU (the first term used for cross entropy, the occurrence).
(44) Overall total loss function of the convolutional neural network module I and the convolutional neural network module II: combining the attention-seeking loss and the AU detection loss results in an overall loss function.
Figure BDA0003671346430000045
Wherein: l is AA And (3) representing the total loss function of the convolutional neural network module I and the convolutional neural network module II as a whole.
Specifically, in step S05, an adaptive space-time map convolutional neural network module III is used to extract the space-time correlation between a specific mode of each AU and different AUs, where the adaptive space-time map convolutional neural network module III includes two layers of space-time map convolutional layers with the same structure, and the AU features of m 12c frames of t frames of images and each frame of images are spliced into an overall feature of t × m × 12c, which is used as an input of the space-time map convolutional layer III-I, and an output obtained by the space-time map convolutional layer III-I is used as an input of the space-time map convolutional layer III-II, and an output feature is obtained by passing through the space-time map convolutional layer III-II, where the feature includes the specific mode of an AU and the space-time correlation information between different AUs.
The parameters of the two layers of space-time map convolutional layers are independently learned, each layer of space-time map convolutional layer is formed by combining a space map convolutional layer and a gating circulating unit, and the space-time convolutional layer is defined as follows:
Figure BDA0003671346430000051
Figure BDA0003671346430000052
Figure BDA0003671346430000053
Figure BDA0003671346430000054
wherein:
Figure BDA0003671346430000055
input representing time T, h T Representing the final hidden state at time T (output at time T),
Figure BDA0003671346430000056
indicating an initial hidden state at time T; z is a radical of T H for deciding that reservation is needed at time T T-1 Number of (2), r T For determining the time of T
Figure BDA0003671346430000057
And h T-1 The combination of (1).
Figure BDA0003671346430000058
Representing a unit matrix, wherein m is the number of AUs in each frame of image;
Figure BDA0003671346430000059
matrix, U, representing adaptive learning for AU relation maps T Represents the transpose of U;
Figure BDA00036713464300000510
dissociation matrix representing adaptive learning, c e Is the number of columns set for Q;
Figure BDA00036713464300000511
and
Figure BDA00036713464300000512
separately representing adaptive learning for z T 、r T And
Figure BDA00036713464300000513
c 'represents the dimension of the AU feature, c' is 12c, and c is a setting parameter related to the whole AU identification network model; for the jth AU of the ith frame image, the ith line component in Q is utilized
Figure BDA00036713464300000514
Can be respectively selected from W z 、W r And
Figure BDA00036713464300000515
one parameter with a size of 2c 'x c' was isolated.
R (X) represents that the element X at the index position of (a, b) in the two-dimensional matrix X is subjected to negation removal processing after the negation removal processing ab Is updated to X ab =max(0,X ab )。
N (X) indicates that the element X at the index position of (a, b) in the two-dimensional matrix X is normalized after the element X at each index position in the two-dimensional matrix X is normalized ab Is updated to
Figure BDA00036713464300000516
Z ═ X [ ] Y denotes a two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y, the element at the index position (a, b) in the two-dimensional matrix Z
Figure BDA00036713464300000517
X ak Representing the element at the (a, k) index position in a two-dimensional matrix X, Y akb Representing the element at the (a, k, b) index position in the three-dimensional matrix Y.
″) represents element-level multiplication, C (·) represents a splicing operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function.
Specifically, in step S06, a full link module IV is used to identify the AUs of each frame of image using a one-dimensional full link layer, and the spatio-temporal correlation characteristics of all AUs included in the t frame of image output by the spatio-temporal convolution layer are identified
Figure BDA0003671346430000061
Decomposing frame by frame and AU by AU to obtain AU feature vector with dimension of 12c, and including the space-time associated feature of jth AU of ith frame image
Figure BDA0003671346430000062
Input to full-connection module IV, full-connection module IV pair
Figure BDA0003671346430000063
Predicting the final appearance probability of the jth AU of the ith frame image by adopting a mode that the jth one-dimensional full-connection layer is followed by a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame images; the obtained features have space-time correlation information among AUs, so that the method is beneficial to final AU identification, and the loss function adopted by AU identification is as follows:
Figure BDA0003671346430000064
wherein:
Figure BDA0003671346430000065
a loss function representing AU identification; t represents the length of the sequence of amplified image frames; p is a radical of ij Representing the true probability of occurrence of the jth AU of the ith frame image;
Figure BDA0003671346430000066
representing the final prediction probability of the occurrence of the jth AU of the ith frame image; omega j Represents the weight of the jth AU; v. of j Representing the weight of the occurrence of the jth AU(the first term used for cross entropy is the occurrence term).
Specifically, in step S07, training an entire AU recognition network model composed of a convolutional neural network module I, a convolutional neural network module II, an adaptive space-time diagram convolutional neural network module III, and a full connection module IV by an end-to-end method; firstly, training a convolutional neural network model, extracting accurate AU characteristics as the input of a graph convolutional neural network, then training the graph convolutional neural network to learn the specific mode and the space-time relevance characteristics of AUs, and promoting the identification of facial action units by utilizing the space-time relevance characteristics among AUs.
A facial action unit recognition device for realizing the self-adaptive attention and space-time association comprises an image frame sequence acquisition unit, a hierarchical multi-scale region learning unit, a self-adaptive attention regression and feature extraction unit, a self-adaptive space-time image convolution learning unit, an AU recognition unit and a parameter optimization unit;
the image frame sequence acquisition unit is used for extracting a large number of original continuous images required by training from any video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;
the hierarchical multi-scale area learning unit comprises a convolutional neural network module I, learns the characteristics of each local block under different scales of each frame of input image by adopting a hierarchical multi-scale area layer, and independently filters each local block;
the self-adaptive attention regression and feature extraction unit comprises a convolution neural network module II, and is used for generating a global attention diagram of an image and performing self-adaptive regression under the supervision of a predefined attention diagram and AU detection loss, and simultaneously extracting AU features accurately.
The self-adaptive space-time graph convolution learning unit comprises a self-adaptive space-time graph convolution neural network module III, and is used for learning a specific mode of each AU and learning the space-time correlation among different AUs and extracting the space-time correlation characteristics of each AU;
the AU identification unit comprises a full connection module IV, and can effectively identify AUs by utilizing the space-time association characteristics of each AU;
the parameter optimization unit calculates parameters and loss function values of an overall AU identification network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time graph convolutional neural network module III and the full-connection module IV, and updates the parameters based on a gradient optimization method.
Inputting each frame image in the input image frame sequence into a convolutional neural network module I and a convolutional neural network module II respectively to obtain m AU characteristics of each frame, and starting to piece together the characteristics of all the frames as input only when the self-adaptive space-time image convolutional neural network module III; the convolutional neural network module I and the convolutional neural network module II process a single image, time is not involved, t frames can be considered to be processed respectively, and the adaptive space-time image convolutional neural network module III processes the t frames simultaneously; the full-connection module IV processes the space-time associated characteristics of m AUs of a single image, and does not relate to time.
Has the advantages that: compared with the prior art, the facial action unit identification method and device based on the self-adaptive attention and space-time association have the following advantages: (1) in the adaptive attention regression neural network, attention adaptive regression is promoted through AU detection loss, local features related to AU can be accurately captured, attention distribution is optimized by using position prior, and robustness of an uncontrolled scene is improved; (2) in the self-adaptive space-time graph convolution neural network, each space-time graph convolution layer fully learns AU relation in a single-frame space domain; in the time domain, general convolution operation is carried out, inter-frame relevance is mined, the face AU identification of each frame is promoted, and finally the probability of each AU of each frame is output through a network; (3) due to the fact that self-adaptive learning is adopted instead of the AU relation predefined based on the priori knowledge, the method can adapt to different samples of random and diverse changes of an uncontrolled scene in the aspects of illumination, shielding, posture and the like, and is expected to have strong robustness while keeping high identification precision.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a schematic structural diagram of a hierarchical multi-scale regional layer;
FIG. 3 is a schematic structural diagram of a convolutional neural network module II;
FIG. 4 is a schematic structural diagram of a space-time diagram convolutional layer;
FIG. 5 is a schematic structural diagram of the entire adaptive attention regression neural network, adaptive space-time graph convolutional neural network model.
Detailed Description
The invention is described in detail below with reference to the figures and the embodiments.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The invention provides a facial action unit identification method and device based on adaptive attention and space-time correlation. In the adaptive attention regression neural network, by predefining attention and constraint of AU detection, an AU strong correlation area designated by human face characteristic points and a globally distributed weak correlation area are captured at the same time, and because the correlation distribution of each area is learned, useful information of each AU can be accurately extracted by performing feature extraction. In the self-adaptive space-time graph convolution neural network, each space-time graph convolution layer fully learns AU relation in a single-frame space domain; and in the time domain, general convolution operation is carried out to mine the inter-frame relevance, the human face AU recognition of each frame is promoted, and finally the network outputs the probability of each AU of each frame. Due to the adoption of the self-adaptive learning, the method can adapt to different samples of random and various changes of an uncontrolled scene in the aspects of illumination, shielding, posture and the like, and is expected to have stronger robustness while keeping higher identification precision.
Fig. 1 is a flow chart of a method for identifying a facial action unit based on joint learning and optical flow estimation, and the following steps are described.
S01: the original continuous image frames required for training are extracted from any video to form a training data set, and the length of the extracted continuous image frame sequence is 48.
For video sequences, the video frame sequence length is selected to be 48 in order to avoid situations where too few frames are collected to make it difficult to learn the correct action unit spatio-temporal association, or where too many frames are collected to make the learning time too long.
S02: the original continuous image frames are preprocessed to obtain an amplified image frame sequence.
The mode of preprocessing the original image comprises random translation, random rotation, random scaling, random horizontal turnover or random cutting and the like, and the generalization capability of the model can be improved to a certain extent by preprocessing the image.
S03: and constructing a convolutional neural network module I to extract the characteristics of the hierarchical multi-scale region of the amplified image frame sequence.
Since the face action units of different local blocks have different face structures and texture information, each local block needs to be subjected to independent filtering processing, and different local blocks use different filtering weights.
Since AUs of different local blocks have different facial structures and texture information, each local block needs to be independently filtered, and different local blocks use different filtering weights; in order to obtain multi-scale regional features, learning the features of each local block under different scales by adopting a convolutional neural network module I, wherein the convolutional neural network module I comprises two layered multi-scale regional layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale regional layer, the output of the first layered multi-scale regional layer is used as the input of a second layered multi-scale regional layer after maximum pooling operation, and the output of the second layered multi-scale regional layer is used as the output of the convolutional neural network module I after maximum pooling operation; and connecting the images of each frame of the amplified image frame sequence in series at channel level to serve as the input of a convolutional neural network module I, wherein the output of the convolutional neural network module I is the hierarchical multi-scale regional characteristic of the amplified image frame sequence.
As shown in fig. 2, each layered multi-scale zone layer comprises a convolutional layer I-I, a convolutional layer I-II, and a convolutional layer I-II-III, and in the convolutional layer I-I, the input whole is convolved once, and the convolution result is used as the output of the convolutional layer I-I; taking the output of the convolutional layer I-I as the input of the convolutional layer I-II-I, firstly, uniformly dividing the input into local blocks with the scale of 8 multiplied by 8 in the convolutional layer I-II-I, respectively performing convolution, and then splicing all convolution results to form the output of the convolutional layer I-II-I; taking the output of the convolutional layer I-II-I as the input of the convolutional layer I-II-II, firstly, uniformly dividing the input into local blocks with the scale of 4 multiplied by 4 in the convolutional layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolutional layer I-II-II; taking the output of the convolutional layer I-II-II as the input of the convolutional layer I-II-III, firstly uniformly dividing the input into local blocks with the scale of 2 multiplied by 2 in the convolutional layer I-II-III to be respectively convolved, and then splicing all convolution results to form the output of the convolutional layer I-II-III; and after the outputs of the convolutional layers I-II-I, I-II-II and I-II-III are subjected to channel-level series connection (the number of output channels after the channel-level series connection is the same as that of output channels of the convolutional layers I-I), the outputs of the convolutional layers I-I and the output channels of the convolutional layers I-II-III are summed, and the result is used as the output of the hierarchical multi-scale regional layer.
Each layer of the convolutional neural network module I is provided with a maximum pooling layer behind each layered multi-scale area layer, the pooling kernel size of each layer of the maximum pooling layer is 2 multiplied by 2, and the step length is 2; the number of channels of the convolution layer I-I, the convolution layer I-II-II and the convolution layer I-II-III in the first layered multi-scale region layer is respectively 32, 16, 8 and 8, and the number of filters of the convolution layer I-I, the convolution layer I-II-II and the convolution layer I-II-III in the first layered multi-scale region layer is respectively 32 multiplied by 1, 16 multiplied by 8, 8 multiplied by 4 and 8 multiplied by 2; the number of channels of the convolution layer I-I, the convolution layer I-II-II and the convolution layer I-II-III in the second layered multi-scale area layer is respectively 64, 32, 16 and 16, and the number of filters of the convolution layer I-I, the convolution layer I-II-II and the convolution layer I-II-III in the second layered multi-scale area layer is respectively 64 multiplied by 1, 32 multiplied by 8, 16 multiplied by 4 and 16 multiplied by 2; the filter sizes in the convolutional layers are all 3 × 3, and the step sizes are all 1.
S04: constructing a convolutional neural network module II to carry out global attention map regression of AU and extract AU characteristics by using the hierarchical multi-scale region characteristics extracted in the step S03, and supervising the convolutional neural network module II through AU detection loss; AU denotes a face action unit.
As shown in fig. 3, the convolutional neural network module II is a multi-layer convolutional layer including m branches, each branch corresponding to an AU, and performs adaptive global attention-graph regression and AU prediction simultaneously. The filter size of each convolutional layer is 3 × 3, and the step size is 1.
(41) Generating a predictive attention map for each AU: the hierarchical multi-scale region features are input into a convolutional neural network module II, the convolutional neural network module II comprises M branches, each branch corresponds to one AU, and a global attention map M of the AU is learned by adopting four layers of convolutional layers ij And extracts AU features.
(42) Generate a true attention map for each AU: each AU has two centers, designated by two related face feature points; the true attention map is generated from a gaussian distribution with a center point, i.e. the coordinates of the center point of the AU, e.g. the coordinates of the center point of the AU
Figure BDA0003671346430000101
Then the true attention weight at location coordinate (a, b) on the attention map is:
Figure BDA0003671346430000102
then, the larger of the attention weights is selected among the two central positions of each AU to merge the predefined attention intentions of the two AU centers, i.e. to merge
Figure BDA0003671346430000103
And adopt attention regression loss to encourage M ij Approach to M ij
Figure BDA0003671346430000104
Wherein: l is a Loss function L representing a global attention-graph regression a (ii) a t represents the length of the sequence of amplified image frames; m represents the number of AUs in each frame image; l/4 × l/4 represents the size of the global attention map; m ijab Representing the true attention weight of the jth AU of the ith frame image at coordinate position (a, b); m ijab Represents the predicted attention weight of the jth AU of the ith frame image at coordinate position (a, b).
(43) Extracting AU features and performing AU detection: global attention map M to be predicted ij Multiplying the face feature map obtained by the fourth convolution layer II-II by elements to strengthen the features of the region with larger attention weight; inputting the obtained output characteristics into the convolution layers II-III, and then extracting AU characteristics through a global average pooling layer; the cross entropy loss detection is adopted to promote attention force diagram self-adaptive training, the learned AU features are input into a one-dimensional fully-connected layer, and then a Sigmoid function delta (x) is used for being 1/(1+ e) -x ) The probability of each AU occurrence is predicted.
The weighted cross entropy loss function of the AU identification employed is:
Figure BDA0003671346430000111
wherein:
Figure BDA0003671346430000112
a weighted cross entropy loss function representing AU identification; p is a radical of ij Representing the true probability of the occurrence of the jth AU of the ith frame image;
Figure BDA0003671346430000113
representing the prediction probability of the occurrence of the jth AU of the ith frame image; omega j Represents the weight of the jth AU; v. of j The weight at the occurrence of the jth AU (the first term used for cross entropy, the occurrence).
Since the occurrence rates of different AUs in the training data set are significantly different and the occurrence rate of most AUs is much lower than the non-occurrence rate, to suppress the two data imbalance problem, ω is set j And v j Is defined as:
Figure BDA0003671346430000114
wherein: n and
Figure BDA0003671346430000115
the total number of samples of the training set and the total number of the jth AU sample of the ith frame image, respectively, and the occurrence rate of the jth AU of the ith frame image can be expressed as
Figure BDA0003671346430000116
(44) Overall total loss function of the convolutional neural network module I and the convolutional neural network module II: the overall loss function is derived by combining the attention map loss and the AU detection loss.
Figure BDA0003671346430000117
Wherein: l is AA And (3) representing the total loss function of the convolutional neural network module I and the convolutional neural network module II as a whole.
S05: and (3) constructing an adaptive space-time graph convolutional neural network module III by using the AU characteristics extracted in the step S04, and reasoning the specific mode of each AU and the space-time relevance (such as co-occurrence and mutual exclusion) among different AUs so as to learn the space-time relevance characteristics of each AU.
The self-adaptive space-time graph convolutional neural network module III is composed of two layers of space-time graph convolutional layers with the same structure, the AU features of m 12c of each frame are spliced into overall features of t multiplied by m multiplied by 12c and used as the input of a space-time graph convolutional layer III-I, the output obtained by the space-time graph convolutional layer III-I is used as the input of a space-time graph convolutional layer III-II, and the output features are obtained through the space-time graph convolutional layer III-II and comprise specific modes of AUs and space-time correlation among AUs.
The parameters of the two space-time map convolutional layers are independently learned respectively, each space-time map convolutional layer is formed by combining a space map convolutional unit and a gating cycle unit, the structural diagram of each space-time convolutional layer is shown in fig. 4, and the method specifically comprises the following steps:
(51) reasoning about the specific mode of each AU:
a typical graph convolution is computed in the spectral domain and is well approximated by a first order chebyshev polynomial expansion:
Figure BDA0003671346430000121
wherein:
Figure BDA0003671346430000122
and
Figure BDA0003671346430000123
respectively, the input and output of the graph convolution layer,
Figure BDA0003671346430000124
is an identity matrix;
Figure BDA0003671346430000125
is a symmetric weighted adjacency matrix representing the strength of the connecting edges between nodes;
Figure BDA0003671346430000126
is a degree matrix, having
Figure BDA0003671346430000127
Is a parameter matrix. Graph convolution is essentially a parameter matrix a and Θ shared by learning all AUs (0) To be inputted
Figure BDA0003671346430000128
Is converted into
Figure BDA0003671346430000129
Although the above equation is able to learn the interrelations of AUs, it ignores the specific patterns of each AU, and for this purpose, a shared parameter matrix A is used to infer the inter-AU relationships, and an independent parameter is used for each AU
Figure BDA00036713464300001210
The resulting graph convolution operation is:
Figure BDA00036713464300001211
wherein: z ═ X [ ] Y denotes a two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y, the element at the index position (a, b) in the two-dimensional matrix Z
Figure BDA00036713464300001212
X ak Representing the element at the (a, k) index position in the two-dimensional matrix X, Y akb Representing the element at the (a, k, b) index position in the three-dimensional matrix Y.
In order to reduce the parameter quantity of the theta matrix, a characteristic decomposition matrix is introduced
Figure BDA00036713464300001213
And a shared parameter matrix
Figure BDA00036713464300001214
The graph convolution is re-expressed as:
Figure BDA00036713464300001215
wherein: has a theta (1) QW, the middle dimension c of this new formula calculation e Usually less than m, for the ith AUNumber of
Figure BDA00036713464300001216
The matrix may be separated by features
Figure BDA00036713464300001217
Extracted from the shared parameter matrix W, the use of the matrices Q and W facilitates reasoning about the specific mode of the AU.
(52) Inference of interrelations between AUs in the spatial domain:
by directly learning a matrix
Figure BDA0003671346430000131
To reduce the amount of computation, rather than learning the matrix A and then further computing the normalized adjacency matrix
Figure BDA0003671346430000132
Namely, it is
Figure BDA0003671346430000133
Where r (x) is a modified Linear Unit (ReLU) activation function and n (x) is a normalization function, dependencies between AUs, such as co-occurrences and mutual exclusions, can be adaptively encoded. The graph convolution is then re-expressed as:
F out =(I+N(R(UU T )))F in ⊙QW
the matrix U and the matrix W are parameter matrixes shared by all AUs, so that the dependency relationship between AUs in a spatial domain is inferred.
(53) Reasoning the relation between the frames in the spatial domain: gated round robin unit (GRU) is a popular method of modeling temporal dynamics. A GRU unit consists of an update gate z and a reset gate r, where the gating mechanism at time step τ is defined as follows:
z T =σ(W z C(h T-1 ,x T ))
r T =σ(W r C(h T-1 ,x T ))
Figure BDA0003671346430000134
Figure BDA0003671346430000135
wherein: h is T Representing the final hidden state at time T (output at time T),
Figure BDA0003671346430000136
indicating an initial hidden state at time T, z T H for deciding that reservation is needed at time T T-1 Number of (2), r T For determining the time of T
Figure BDA0003671346430000137
And h T-1 In the combination scheme of (2), ″, represents element-level multiplication, C (·) represents splicing operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function.
The final definition of each space-time diagram convolutional layer obtained from the above process is:
Figure BDA0003671346430000138
Figure BDA0003671346430000139
Figure BDA00036713464300001310
Figure BDA00036713464300001311
wherein:
Figure BDA00036713464300001312
input representing time T, h T To representThe final hidden state at time T (output at time T), an
Figure BDA00036713464300001313
And
Figure BDA00036713464300001314
separately representing adaptive learning for z T 、r T And
Figure BDA0003671346430000141
a weight matrix of (a); wherein the input matrix
Figure BDA0003671346430000142
Output matrix
Figure BDA0003671346430000143
Is the output result h of each graph convolution layer 1 ,h 2 ,…,h t' Splicing in a dimension t; t 'is the total number of frames input to the space-time map convolutional layer, and t' is 48.
S06: and (4) constructing a full connection module IV to realize AU identification by utilizing the space-time association characteristics of the AUs extracted in the step S05.
And the full-connection module IV is formed by adopting a one-dimensional full-connection layer followed by a Sigmoid activation function. And (4) carrying out frame-by-frame AU decomposition on the overall feature with the dimensionality of t multiplied by m multiplied by 12c obtained in the step (S05) to obtain a feature vector with the dimensionality of 12c corresponding to each AU on each frame image, inputting the feature vector of the jth AU of the ith frame image into the jth full-link layer, carrying out AU occurrence probability prediction on the full-link layer by a Sigmoid activating function, and adopting the same full-link layer by the full-link module IV for the same AU of different frame images. The learned characteristics have the space-time correlation information of AUs, so that the method is beneficial to the final AU detection, and the following loss function is adopted to guide the learning of the convolution parameter matrix of the space-time diagram:
Figure BDA0003671346430000144
wherein:
Figure BDA0003671346430000145
a loss function representing AU identification; t represents the length of the sequence of amplified image frames; p is a radical of ij Representing the true probability of the occurrence of the jth AU of the ith frame image;
Figure BDA0003671346430000146
representing the final prediction probability of the occurrence of the jth AU of the ith frame image; omega j Represents the weight of the jth AU; v. of j The weight at the occurrence of the jth AU (the first term used for cross entropy, the occurrence).
S07: and training the whole AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time graph convolutional neural network module III and the full-connection module IV by using a training data set, and updating the parameters of the whole AU recognition network model by using a gradient-based optimization method.
The whole convolutional neural network, graph convolutional neural network model (as figure 5) is trained through an end-to-end method. Firstly, training a convolutional neural network model, extracting AU characteristics and using the AU characteristics as the input of a graph convolutional neural network, then training the graph convolutional neural network to learn the specific mode and the space-time relevance of AUs, and promoting the identification of facial action units by utilizing the space-time relevance between AUs.
S08: and inputting the video sequence with any given frame number into the trained integral AU identification network model to predict the occurrence probability of AUs.
When prediction is performed, the prediction result of the face motion unit is directly output.
The method can be completely realized by a computer without manual auxiliary treatment; the method and the device have the advantages that batch automatic processing can be realized, processing efficiency can be greatly improved, and labor cost can be reduced.
A facial action unit recognition device for realizing the self-adaptive attention and space-time association comprises an image frame sequence acquisition unit, a hierarchical multi-scale region learning unit, a self-adaptive attention regression and feature extraction unit, a self-adaptive space-time image convolution learning unit, an AU recognition unit and a parameter optimization unit;
the image frame sequence acquisition unit is used for extracting a large number of original continuous images required by training from any video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;
the hierarchical multi-scale area learning unit comprises a convolutional neural network module I, learns the characteristics of each local block under different scales of each frame of input image by adopting a hierarchical multi-scale area layer, and independently filters each local block;
the self-adaptive attention regression and feature extraction unit comprises a convolution neural network module II, and is used for generating a global attention diagram of an image and performing self-adaptive regression under the supervision of a predefined attention diagram and AU detection loss, and simultaneously extracting AU features accurately.
The self-adaptive space-time graph convolution learning unit comprises a self-adaptive space-time graph convolution neural network module III, and is used for learning a specific mode of each AU and learning the space-time correlation among different AUs and extracting the space-time correlation characteristics of each AU;
the AU identification unit comprises a full connection module IV, and can effectively identify AUs by utilizing the space-time association characteristics of each AU;
the parameter optimization unit calculates parameters and loss function values of an overall AU identification network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time graph convolutional neural network module III and the full-connection module IV, and updates the parameters based on a gradient optimization method.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims (6)

1. A facial action unit identification method based on adaptive attention and spatiotemporal correlation is characterized in that: the method comprises the following steps:
s01: extracting original continuous image frames required by training from a video to form a training data set;
s02: preprocessing an original continuous image frame to obtain an amplified image frame sequence;
s03: constructing a convolutional neural network module I to extract the hierarchical multi-scale regional characteristics of each frame in the amplified image frame sequence;
s04: constructing a convolutional neural network module II to carry out global attention map regression of AU and extract AU characteristics by using the hierarchical multi-scale region characteristics extracted in the step S03, and supervising the convolutional neural network module II through AU detection loss; AU represents a face action unit;
s05: constructing a self-adaptive space-time graph convolutional neural network module III by using the AU characteristics extracted in the step S04, and reasoning the specific mode of each AU and the space-time relevance among different AUs so as to learn the space-time relevance characteristics of each AU;
s06: constructing a full-connection module IV to realize AU identification by utilizing the space-time correlation characteristics of the AUs extracted in the step S05;
s07: training an integral AU recognition network model formed by a convolutional neural network module I, a convolutional neural network module II, a self-adaptive space-time diagram convolutional neural network module III and a full-connection module IV by using a training data set, and updating parameters of the integral AU recognition network model by using a gradient-based optimization method;
s08: and inputting the video sequence with any given frame number into the trained integral AU identification network model to predict the occurrence probability of AUs.
2. The adaptive attention and spatiotemporal association based facial action unit recognition method according to claim 1, characterized by: in step S03, learning the features of each local block at different scales by using a convolutional neural network module I, where the convolutional neural network module I includes two hierarchical multi-scale region layers with the same structure, the input of the convolutional neural network module I is used as the input of a first hierarchical multi-scale region layer, the output of the first hierarchical multi-scale region layer is used as the input of a second hierarchical multi-scale region layer after maximum pooling operation, and the output of the second hierarchical multi-scale region layer is used as the output of the convolutional neural network module I after maximum pooling operation; the output of the convolutional neural network module I is the layered multi-scale regional characteristics of the amplified image frame sequence;
each layered multi-scale region layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, in the convolution layer I-I, the input whole is subjected to convolution once, and the convolution result is used as the output of the convolution layer I-I; taking the output of the convolutional layer I-I as the input of the convolutional layer I-II-I, firstly, uniformly dividing the input into local blocks with the scale of 8 multiplied by 8 in the convolutional layer I-II-I, respectively performing convolution, and then splicing all convolution results to form the output of the convolutional layer I-II-I; taking the output of the convolutional layer I-II-I as the input of the convolutional layer I-II-II, firstly, uniformly dividing the input into local blocks with the scale of 4 multiplied by 4 in the convolutional layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolutional layer I-II-II; taking the output of the convolution layer I-II-II as the input of the convolution layer I-II-III, firstly uniformly dividing the input into local blocks with the scale of 2 multiplied by 2 to be respectively convoluted in the convolution layer I-II-III, and then splicing all convolution results to form the output of the convolution layer I-II-III; and performing channel-level series connection on the outputs of the convolutional layers I-II-I, I-II-II and I-II-III, and then summing the outputs with the output of the convolutional layer I-I, wherein the result is used as the output of the layered multi-scale region layer.
3. The adaptive attention and spatiotemporal association based facial action unit recognition method according to claim 1, characterized by: in the step S04, predicting a global attention diagram and an occurrence probability of each AU by using a convolutional neural network module II, where the global attention diagram of the AU is adaptively regressed to a predefined attention diagram under the supervision of AU detection loss, and simultaneously extracting AU features; the loss function used for AU detection loss is:
Figure FDA0003671346420000021
Figure FDA0003671346420000022
Figure FDA0003671346420000023
wherein: l is a A loss function representing a regression of the global attention map,
Figure FDA0003671346420000024
weighted cross entropy loss function, L, representing AU identification AA The overall total loss function of the convolutional neural network module I and the convolutional neural network module II is represented; lambda [ alpha ] a A weight representing a regression loss of the global attention map; t represents the length of the sequence of amplified image frames; m isRepresenting the number of AUs in each frame of image; l/4 × l/4 represents the size of the global attention map; m ijab Representing the true attention weight of the jth AU of the ith frame image at coordinate position (a, b); m ijab Representing the predicted attention weight of the jth AU of the ith frame image at the coordinate position (a, b); p is a radical of ij Representing the true probability of the occurrence of the jth AU of the ith frame image;
Figure FDA0003671346420000025
representing the prediction probability of the occurrence of the jth AU of the ith frame image; omega j Represents the weight of the jth AU; v. of j Indicating the weight of the jth AU occurrence.
4. The adaptive attention and spatiotemporal association based facial action unit recognition method according to claim 1, characterized by: in the step S05, the adaptive space-time graph convolutional neural network module III is used to extract the space-time correlation between the specific mode of each AU and different AUs, so as to learn the space-time correlation characteristics of each AU; the input of the adaptive space-time image convolutional neural network module III is all AU features of all t frame images extracted in step S04, each frame image includes m AU features, t × m AU features, the dimension of each AU feature is c', c ═ 12c, and c is a setting parameter related to the whole AU identification network model; the self-adaptive space-time graph convolutional neural network module III comprises two space-time graph convolutional layers with the same structure, the parameters of the two space-time graph convolutional layers are independently learned, each space-time graph convolutional layer is formed by combining a space graph convolutional layer and a gating circulating unit, and the space-time convolutional layer is defined as follows:
Figure FDA0003671346420000031
Figure FDA0003671346420000032
Figure FDA0003671346420000033
Figure FDA0003671346420000034
wherein:
Figure FDA0003671346420000035
input representing time T, h T Representing the final hidden state at time T,
Figure FDA0003671346420000036
indicating an initial hidden state at time T; z is a radical of T H for deciding that reservation is needed at time T T-1 Number of (2), r T For determining the time of T
Figure FDA0003671346420000037
And h T-1 The combination of (1);
Figure FDA0003671346420000038
representing a unit matrix, wherein m is the number of AUs in each frame of image;
Figure FDA0003671346420000039
matrix, U, representing adaptive learning for AU relation maps T Represents the transpose of U;
Figure FDA00036713464200000310
dissociation matrix representing adaptive learning, c e Is the number of columns set for Q;
Figure FDA00036713464200000311
and
Figure FDA00036713464200000312
individual watchAdaptive learning for z T 、r T And
Figure FDA00036713464200000313
c 'represents the dimension of the AU feature, c' is 12c, and c is a setting parameter related to the whole AU identification network model;
r (X) represents that the element X at the index position of (a, b) in the two-dimensional matrix X is subjected to negation removal processing after the negation removal processing ab Is updated to X ab =max(0,X ab );
N (X) indicates that the element X at the index position of (a, b) in the two-dimensional matrix X is normalized after the element X at each index position in the two-dimensional matrix X is normalized ab Is updated to
Figure FDA00036713464200000314
Z ═ X [ ] Y denotes a two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y, the element at the index position (a, b) in the two-dimensional matrix Z
Figure FDA00036713464200000315
X ak Representing the element at the (a, k) index position in a two-dimensional matrix X, Y akb Representing an element at an (a, k, b) index position in the three-dimensional matrix Y;
″) represents element-level multiplication, C (·) represents a splicing operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function.
5. The adaptive attention and spatiotemporal association based facial action unit recognition method according to claim 1, characterized by: in the step S06, a full connection module IV is used to identify the AU of each frame image, and the spatio-temporal correlation characteristics of all AUs included in the t frame image output in the step S05 are determined
Figure FDA0003671346420000041
Wrap it inSpatio-temporal correlation characteristics of jth AU containing ith frame image
Figure FDA0003671346420000042
Input to full-connection module IV, full-connection module IV pair
Figure FDA0003671346420000043
Predicting the final appearance probability of the jth AU of the ith frame image by adopting a mode that the jth one-dimensional full-connection layer is followed by a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame images; the loss function employed for AU identification is:
Figure FDA0003671346420000044
wherein:
Figure FDA0003671346420000045
a loss function representing AU identification; t represents the length of the sequence of amplified image frames; p is a radical of formula ij Representing the true probability of occurrence of the jth AU of the ith frame image;
Figure FDA0003671346420000046
representing the final prediction probability of the occurrence of the jth AU of the ith frame image; omega j Represents the weight of the jth AU; v. of j Representing the weight at the occurrence of the jth AU.
6. A facial action unit recognition device for implementing any of the adaptive attention and spatiotemporal correlations described in 1-5, characterized by: the system comprises an image frame sequence acquisition unit, a hierarchical multi-scale region learning unit, a self-adaptive attention regression and feature extraction unit, a self-adaptive space-time diagram convolution learning unit, an AU (AU) identification unit and a parameter optimization unit;
the image frame sequence acquisition unit is used for extracting original continuous images required by training from the video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;
the hierarchical multi-scale area learning unit comprises a convolutional neural network module I, learns the characteristics of each local block under different scales of each frame of input image by adopting a hierarchical multi-scale area layer, and independently filters each local block;
the self-adaptive attention regression and feature extraction unit comprises a convolution neural network module II, and is used for generating a global attention diagram of an image and performing self-adaptive regression under the supervision of a predefined attention diagram and AU detection loss, and simultaneously extracting AU features accurately.
The self-adaptive space-time graph convolution learning unit comprises a self-adaptive space-time graph convolution neural network module III, and is used for learning a specific mode of each AU and learning the space-time correlation among different AUs and extracting the space-time correlation characteristics of each AU;
the AU identification unit comprises a full connection module IV, and can effectively identify AUs by utilizing the space-time association characteristics of each AU;
the parameter optimization unit calculates parameters and loss function values of an overall AU identification network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time graph convolutional neural network module III and the full-connection module IV, and updates the parameters based on a gradient optimization method.
CN202210606040.5A 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation Active CN114842542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210606040.5A CN114842542B (en) 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210606040.5A CN114842542B (en) 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Publications (2)

Publication Number Publication Date
CN114842542A true CN114842542A (en) 2022-08-02
CN114842542B CN114842542B (en) 2023-06-13

Family

ID=82572471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210606040.5A Active CN114842542B (en) 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Country Status (1)

Country Link
CN (1) CN114842542B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071809A (en) * 2023-03-22 2023-05-05 鹏城实验室 Face space-time representation generation method based on multi-class representation space-time interaction
CN116416667A (en) * 2023-04-25 2023-07-11 天津大学 Facial action unit detection method based on dynamic association information embedding
CN118277607A (en) * 2024-04-12 2024-07-02 山东万高电子科技有限公司 Video monitoring data storage device and method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228211A1 (en) * 2017-08-17 2019-07-25 Ping An Technology (Shenzhen) Co., Ltd. Au feature recognition method and device, and storage medium
CN110363156A (en) * 2019-07-17 2019-10-22 北京师范大学 A kind of Facial action unit recognition methods that posture is unrelated
CN112149504A (en) * 2020-08-21 2020-12-29 浙江理工大学 Motion video identification method combining residual error network and attention of mixed convolution
CN112633153A (en) * 2020-12-22 2021-04-09 天津大学 Facial expression motion unit identification method based on space-time graph convolutional network
CN112990077A (en) * 2021-04-02 2021-06-18 中国矿业大学 Face action unit identification method and device based on joint learning and optical flow estimation
WO2021196389A1 (en) * 2020-04-03 2021-10-07 平安科技(深圳)有限公司 Facial action unit recognition method and apparatus, electronic device, and storage medium
CN113496217A (en) * 2021-07-08 2021-10-12 河北工业大学 Method for identifying human face micro expression in video image sequence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228211A1 (en) * 2017-08-17 2019-07-25 Ping An Technology (Shenzhen) Co., Ltd. Au feature recognition method and device, and storage medium
CN110363156A (en) * 2019-07-17 2019-10-22 北京师范大学 A kind of Facial action unit recognition methods that posture is unrelated
WO2021196389A1 (en) * 2020-04-03 2021-10-07 平安科技(深圳)有限公司 Facial action unit recognition method and apparatus, electronic device, and storage medium
CN112149504A (en) * 2020-08-21 2020-12-29 浙江理工大学 Motion video identification method combining residual error network and attention of mixed convolution
CN112633153A (en) * 2020-12-22 2021-04-09 天津大学 Facial expression motion unit identification method based on space-time graph convolutional network
CN112990077A (en) * 2021-04-02 2021-06-18 中国矿业大学 Face action unit identification method and device based on joint learning and optical flow estimation
CN113496217A (en) * 2021-07-08 2021-10-12 河北工业大学 Method for identifying human face micro expression in video image sequence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YONG LI ET AL: "Self-supervised representation learning from videos for facial action unit detection", pages 10916 - 10925 *
贺强: "深度神经网络在视频行为识别中的应用研究", vol. 2020, no. 1 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071809A (en) * 2023-03-22 2023-05-05 鹏城实验室 Face space-time representation generation method based on multi-class representation space-time interaction
CN116416667A (en) * 2023-04-25 2023-07-11 天津大学 Facial action unit detection method based on dynamic association information embedding
CN116416667B (en) * 2023-04-25 2023-10-24 天津大学 Facial action unit detection method based on dynamic association information embedding
CN118277607A (en) * 2024-04-12 2024-07-02 山东万高电子科技有限公司 Video monitoring data storage device and method

Also Published As

Publication number Publication date
CN114842542B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
CN105095862B (en) A kind of human motion recognition method based on depth convolution condition random field
CN107492121B (en) Two-dimensional human body bone point positioning method of monocular depth video
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
CN114842542A (en) Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN104268594B (en) A kind of video accident detection method and device
CN111462191B (en) Non-local filter unsupervised optical flow estimation method based on deep learning
CN109949255A (en) Image rebuilding method and equipment
CN110111366A (en) A kind of end-to-end light stream estimation method based on multistage loss amount
CN112990077B (en) Face action unit identification method and device based on joint learning and optical flow estimation
CN111832592B (en) RGBD significance detection method and related device
CN112434608B (en) Human behavior identification method and system based on double-current combined network
CN104537684A (en) Real-time moving object extraction method in static scene
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN112580545B (en) Crowd counting method and system based on multi-scale self-adaptive context network
CN113361549A (en) Model updating method and related device
CN116189281B (en) End-to-end human behavior classification method and system based on space-time self-adaptive fusion
US10643092B2 (en) Segmenting irregular shapes in images using deep region growing with an image pyramid
CN117854135A (en) Micro expression recognition method based on quaternary supercomplex network
US10776923B2 (en) Segmenting irregular shapes in images using deep region growing
CN109615640B (en) Related filtering target tracking method and device
CN116665300A (en) Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network
WO2019243910A1 (en) Segmenting irregular shapes in images using deep region growing
CN109583584A (en) The CNN with full articulamentum can be made to receive the method and system of indefinite shape input
CN113449193A (en) Information recommendation method and device based on multi-classification images
CN113673411A (en) Attention mechanism-based lightweight shift graph convolution behavior identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant