CN114842542B - Facial action unit identification method and device based on self-adaptive attention and space-time correlation - Google Patents

Facial action unit identification method and device based on self-adaptive attention and space-time correlation Download PDF

Info

Publication number
CN114842542B
CN114842542B CN202210606040.5A CN202210606040A CN114842542B CN 114842542 B CN114842542 B CN 114842542B CN 202210606040 A CN202210606040 A CN 202210606040A CN 114842542 B CN114842542 B CN 114842542B
Authority
CN
China
Prior art keywords
neural network
space
convolutional neural
network module
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210606040.5A
Other languages
Chinese (zh)
Other versions
CN114842542A (en
Inventor
邵志文
周勇
陈浩
于清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202210606040.5A priority Critical patent/CN114842542B/en
Publication of CN114842542A publication Critical patent/CN114842542A/en
Application granted granted Critical
Publication of CN114842542B publication Critical patent/CN114842542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a facial action unit recognition method and a facial action unit recognition device based on self-adaptive attention and space-time correlation. The invention adopts the end-to-end deep learning frame to learn the identification of the action units, and can effectively identify the movement condition of facial muscles in the two-dimensional image by utilizing the mutual dependency relationship and the time-space correlation between the facial action units, thereby realizing the construction of a facial action unit identification system.

Description

Facial action unit identification method and device based on self-adaptive attention and space-time correlation
Technical Field
The invention relates to a facial action unit recognition method and device based on self-adaptive attention and space-time correlation, belonging to the computer vision technology.
Background
In order to study human facial expressions more finely, the facial motion coding system (Facial Action Coding System, FACS) was first proposed by the american famous emotional psychologist Ekman equal to 1978, and an important improvement was made in 2002. The facial motion coding system is divided into a plurality of mutually independent and mutually connected facial motion units according to the anatomical characteristics of the human face, and the facial expression can be reflected through the motion characteristics of the facial motion units and the main areas controlled by the facial motion units.
With the development of computer technology and information technology, deep learning technology has been widely used. In AU (facial action unit) recognition, AU recognition has been the mainstream in research based on a deep learning model. Currently, AU identification is largely divided into two research routes: region learning is associated with AU learning. If the association between AUs is not considered, only a few sparse regions where the corresponding facial muscles are located will generally contribute to the recognition, and other regions do not need to pay much attention, so that those regions needing attention are found and focused on for better AU recognition, and a solution focusing on this problem is generally called Region Learning (RL). Furthermore, AU is defined on the basis of facial muscle anatomy, describing the movement of one or several muscles, some muscles will drive several AUs to appear simultaneously during the movement, so there is a certain degree of correlation between AUs, obviously, the correlation information between AUs will help to improve the model recognition performance, so how to mine the correlation between AUs and to improve the AU model recognition performance based on the correlation is generally called AU correlation learning.
Although the automatic recognition of facial action units has made impressive progress, current AU detection methods based on region learning often capture irrelevant regions by using AU labels only to supervise neural networks to adaptively learn implicit attention, since AU has no obvious contours and textures and may change with changes in people and expressions. In the AU detection method based on relation reasoning, all AUs share parameters during reasoning, and the specificity and the dynamic property of each AU are ignored, so that the identification accuracy is not high, and a space for further improvement exists.
Disclosure of Invention
The invention aims to: in order to overcome the defects in the prior art, the invention provides a facial action unit recognition method and device based on self-adaptive attention and space-time correlation, which can adapt to different samples of uncontrolled scenes which randomly and variously change in illumination, shielding, gesture and the like, and is expected to have stronger robustness while keeping higher recognition precision.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:
a facial action unit recognition method based on adaptive attention and space-time correlation, comprising the steps of:
S01: extracting original continuous image frames required by training from any video to form a training data set; for a video sequence, the number of original consecutive image frames may be 48 frames;
s02: preprocessing original continuous image frames to obtain an amplified image frame sequence; the method for preprocessing the original continuous image frames comprises random translation, random rotation, random scaling, random horizontal overturning or random cutting and the like, and the generalization capability of the model can be improved to a certain extent by preprocessing the images;
s03: constructing a convolutional neural network module I to extract layered multi-scale region features of each frame in the amplified image frame sequence;
s04: constructing a convolutional neural network module II by utilizing the layered multi-scale region features extracted in the step S03, carrying out global attention map merging on AU to extract AU features, and supervising the convolutional neural network module II through AU detection loss; AU denotes a face action unit;
s05: constructing a self-adaptive space-time diagram convolutional neural network module III by utilizing the AU characteristics extracted in the step S04, and reasoning the specific mode of each AU and the space-time correlation (such as co-occurrence and mutual exclusion) among different AUs so as to learn the space-time correlation characteristics of each AU;
S06: constructing a full connection module IV to realize AU identification by utilizing the space-time correlation characteristics of all AUs extracted in the step S05;
s07: training an integral AU recognition network model formed by a convolutional neural network module I, a convolutional neural network module II, a self-adaptive space-time diagram convolutional neural network module III and a full-connection module IV by using a training data set so as to update parameters of the integral AU recognition network model by using a gradient-based optimization method;
s08: and inputting the video sequence with any given frame number into the trained integral AU recognition network model, and predicting the occurrence probability of the AU.
Specifically, in the step S03, since AUs of different local blocks have different facial structures and texture information, each local block needs to be subjected to independent filtering processing, and different local blocks use different filtering weights; in order to obtain multi-scale region characteristics, a convolutional neural network module I is adopted to learn the characteristics of each local block under different scales, the convolutional neural network module I comprises two layers of layered multi-scale region layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale region layer, the output of the first layered multi-scale region layer is used as the input of a second layered multi-scale region layer after the maximum pooling operation, and the output of the second layered multi-scale region layer is used as the output of the convolutional neural network module I after the maximum pooling operation; and respectively inputting each frame image of the amplified image frame sequence into a convolutional neural network module I, and outputting the images as layered multi-scale region characteristics of each frame image.
Each layering multi-scale regional layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, wherein in the convolution layer I-I, the input is integrally convolved once, and a convolution result is used as the output of the convolution layer I-I; taking the output of the convolution layer I-I as the input of the convolution layer I-II-I, uniformly dividing the input into 8X 8 scale local blocks in the convolution layer I-II-I, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-I; taking the output of the convolution layer I-II-I as the input of the convolution layer I-II-II, uniformly dividing the input into local blocks with 4X 4 scale in the convolution layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-II; taking the output of the convolution layer I-II-II as the input of the convolution layer I-II-III, uniformly dividing the input into local blocks with the 2X 2 scale in the convolution layer I-II-III for convolution respectively, and then splicing all convolution results to form the output of the convolution layer I-II-III; and after channel-level series connection is carried out on the outputs of the convolution layers I-II-I, I-II-II and I-II-III (the number of channels output after channel-level series connection is the same as that of the output of the convolution layer I-I), the outputs of the convolution layers I-I are summed, and the result is used as the output of the layered multi-scale regional layer.
Specifically, in step S04, the convolutional neural network module II is used as an adaptive attention mechanics learning module, the input is a layered multi-scale region feature of the image, the global attention map of each AU on each predicted image is obtained through the convolutional neural network module II, AU feature extraction and AU prediction are performed, and the whole process is performed under the supervision of a predefined attention map and AU detection loss. The method comprises the following steps:
(41) Generating a predictive attention profile for each AU: the layered multi-scale region features are input into an adaptive attention mechanics learning module, the number of AUs on each frame of picture is M, each AU corresponds to an independent branch, and four layers of convolution layers are adopted to learn global attention force map M of the AU ij And extracts AU features.
(42) Generating a true attention profile for each AU: each AU has two centers, which are designated by two related face feature points; the true attention map is generated by a gaussian distribution with center points, i.e. AU center point coordinates, e.g. AU center point coordinates
Figure BDA0003671346430000031
The true attention weight at the position coordinates on the attention map (a, b) is:
Figure BDA0003671346430000032
then, the larger of the attention weights is selected in the two center positions of each AU to merge the predefined attention patterns of the two AU centers, i.e
Figure BDA0003671346430000033
And employs attention regression loss to encourage M ij Near M ij
Figure BDA0003671346430000041
Wherein: l (L) a Representing globalAttention is paid to the loss function L of the force diagram regression a The method comprises the steps of carrying out a first treatment on the surface of the t represents the length of the amplified image frame sequence; m represents the number of AUs in each frame of image; l/4×l/4 represents the size of the global attention map; m is M ijab Representing a true attention weight of a jth AU of the ith frame image at the coordinate position (a, b); m is M ijab Representing the predicted attention weight of the jth AU of the ith frame image at the coordinate position (a, b).
(43) Extracting AU characteristics and carrying out AU detection: global attention map M to predict ij The characteristic of the region with larger attention weight can be enhanced by multiplying the characteristic map of the face obtained by the fourth convolution layer II-II by elements; then inputting the obtained output characteristics into convolution layers II-III, and extracting AU characteristics through a global average pooling layer; the AU detection cross entropy loss is adopted to promote attention seeking self-adaptive training, the learned AU characteristics are input into a one-dimensional full-connected layer, and then the Sigmoid function delta (x) =1/(1+e) -x ) The probability of occurrence of each AU is predicted.
The weighted cross entropy loss function of the adopted AU identification is as follows:
Figure BDA0003671346430000042
wherein:
Figure BDA0003671346430000043
a weighted cross entropy loss function representing AU identification; p is p ij Representing the true probability of occurrence of the jth AU of the ith frame image; / >
Figure BDA0003671346430000044
Representing a prediction probability of occurrence of a j-th AU of the i-th frame image; omega j A weight indicating the jth AU; v j The weight at which the jth AU appears (the first term used for cross entropy, the appearance term) is represented.
(44) Overall loss function of convolutional neural network module I and convolutional neural network module II as a whole: the overall loss function can be obtained by combining the striving loss and the AU detection loss.
Figure BDA0003671346430000045
Wherein: l (L) AA Representing the overall loss function of the convolutional neural network module I and the convolutional neural network module II as a whole.
Specifically, in step S05, the adaptive space-time diagram convolutional neural network module III is used to extract the space-time correlation between the specific mode of each AU and different AUs, the adaptive space-time diagram convolutional neural network module III includes two space-time diagram convolutional layers with the same structure, the AU features of m 12c on the t frame image and each frame image are spliced into the integral feature of t×m×12c, the integral feature is used as the input of the space-time diagram convolutional layer III-I, the output obtained by the space-time diagram convolutional layer III-I is used as the input of the space-time diagram convolutional layer III-II, and the output feature is obtained through the space-time diagram convolutional layer III-II, and the feature includes the space-time correlation information between the specific mode of AUs and different AUs.
Parameters of two space-time diagram convolution layers are independently learned, each space-time diagram convolution layer is formed by combining a space diagram convolution unit and a gating circulation unit, and the definition of the space-time diagram convolution layers is as follows:
Figure BDA0003671346430000051
Figure BDA0003671346430000052
Figure BDA0003671346430000053
Figure BDA0003671346430000054
Wherein:
Figure BDA0003671346430000055
input of time T is shown, h T Represents the final hidden state at time T (output at time T),>
Figure BDA0003671346430000056
representing an initial hidden state at the time T; z T H for determining that time T needs to be reserved T-1 Number of r T For determining T moment->
Figure BDA0003671346430000057
And h T-1 Is combined with the combination of the above.
Figure BDA0003671346430000058
Representing an identity matrix, m being the number of AUs in each frame of image; />
Figure BDA0003671346430000059
Matrix representing adaptive learning for AU relationship diagram, U T Represents the transpose of U; />
Figure BDA00036713464300000510
Dissociation matrix representing adaptive learning, c e The number of columns set for Q; />
Figure BDA00036713464300000511
And->
Figure BDA00036713464300000512
Representing adaptive learning for z, respectively T 、r T And->
Figure BDA00036713464300000513
C 'represents the dimension of the AU feature, c' =12c, c is a set parameter related to the overall AU-recognition network model; for the jth AU of the ith frame image, the ith line component +.>
Figure BDA00036713464300000514
Can be respectively from W z 、W r And->
Figure BDA00036713464300000515
A parameter of size 2c 'x c' is dissociated.
R (X) represents the element X at each index position in the two-dimensional matrix X, and after the negation process, the element X at the index position (a, b) in the two-dimensional matrix X ab Updated to X ab =max(0,X ab )。
N (X) represents the normalization of the elements at each index position in the two-dimensional matrix X, after which the elements X at the (a, b) index positions in the two-dimensional matrix X ab Updated to
Figure BDA00036713464300000516
Z=XAs, Y represents the element at the (a, b) index position in the two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y
Figure BDA00036713464300000517
X ak Representing elements at (a, k) index positions in a two-dimensional matrix X, Y akb Representing the element at the (a, k, b) index position in the three-dimensional matrix Y.
"∈" represents element level multiplication, C (·) represents a concatenation operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function.
Specifically, in the step S06, a full connection module IV is adopted to identify the AUs of each frame of image by adopting a one-dimensional full connection layer, and the spatio-temporal correlation characteristics of all AUs contained in the t frame of image output by the spatio-temporal convolution layer are adopted
Figure BDA0003671346430000061
Performing frame-by-frame and AU-by-AU decomposition to obtain AU feature vector with dimension of 12c, and including space-time correlation feature of jth AU of ith frame image +.>
Figure BDA0003671346430000062
Input to a full-connection dieBlock IV, full connection Module IV pair->
Figure BDA0003671346430000063
Predicting the final occurrence probability of the jth AU of the ith frame image by adopting a mode that the jth one-dimensional full-connection layer is followed by a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame images; the obtained characteristics have space-time associated information among AUs, so that the final AU identification is facilitated, and the loss function adopted by the AU identification is as follows:
Figure BDA0003671346430000064
Wherein:
Figure BDA0003671346430000065
a loss function representing AU identification; t represents the length of the amplified image frame sequence; p is p ij Representing the true probability of occurrence of the jth AU of the ith frame image; />
Figure BDA0003671346430000066
Representing a final prediction probability of occurrence of a j-th AU of the i-th frame image; omega j A weight indicating the jth AU; v j The weight at which the jth AU appears (the first term used for cross entropy, the appearance term) is represented.
Specifically, in the step S07, an overall AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the adaptive space-time diagram convolutional neural network module III, and the full connection module IV is trained by an end-to-end method; firstly training a convolutional neural network model, extracting accurate AU characteristics as input of a graph convolution neural network, then training the graph convolution neural network to learn the specific mode and the space-time correlation characteristics of the AU, and utilizing the space-time correlation characteristics among the AU to promote the recognition of a facial action unit.
The facial action unit recognition device based on the self-adaptive attention and space-time correlation comprises an image frame sequence acquisition unit, a layered multi-scale region learning unit, a self-adaptive attention regression and feature extraction unit, a self-adaptive space-time diagram convolution learning unit, an AU recognition unit and a parameter optimization unit;
The image frame sequence acquisition unit is used for extracting a large number of original continuous images required by training from any video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;
the hierarchical multi-scale region learning unit comprises a convolutional neural network module I, wherein characteristics of each local block under different scales of each frame of input image are learned by adopting a hierarchical multi-scale region layer, and each local block is independently filtered;
the adaptive attention regression and feature extraction unit comprises a convolutional neural network module II for generating a global attention map of the image and performing adaptive regression under the supervision of a predefined attention map and AU detection loss while accurately extracting AU features.
The self-adaptive space-time diagram convolutional learning unit comprises a self-adaptive space-time diagram convolutional neural network module III, performs specific mode learning of each AU, space-time correlation learning among different AUs and extracts space-time correlation characteristics of each AU;
the AU identification unit comprises a full connection module IV, and AU identification can be effectively performed by utilizing the space-time correlation characteristic of each AU;
the parameter optimizing unit calculates parameters and loss function values of an integral AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time diagram convolutional neural network module III and the full-connection module IV, and updates the parameters based on a gradient optimizing method.
Inputting each frame image in an input image frame sequence into a convolutional neural network module I and a convolutional neural network module II respectively to obtain m AU characteristics of each frame, and beginning to splice the characteristics of all frames as input when the self-adaptive space-time diagram convolutional neural network module III is reached; namely, the convolution neural network module I and the convolution neural network module II process a single image, time is not involved, t frames can be treated respectively, and the self-adaptive space-time diagram convolution neural network module III processes t frames simultaneously; the full connection module IV processes the spatio-temporal correlation characteristics of m AUs of a single image, and does not involve time.
The beneficial effects are that: the facial action unit recognition method and device based on the self-adaptive attention and space-time correlation provided by the invention have the following advantages compared with the prior art: (1) In the self-adaptive attention regression neural network, attention self-adaptive regression is promoted by AU detection loss, local features associated with AU can be accurately captured, and the attention distribution is optimized by using position prior, so that the robustness of uncontrolled scenes is improved; (2) In the self-adaptive space-time diagram convolutional neural network, each space-time diagram convolutional layer fully learns AU relations in the space domain of a single frame; in the time domain, general convolution operation is carried out, the correlation among frames is mined, the identification of the face AU of each frame is promoted, and finally the probability of each AU of each frame is output by the network; (3) The invention can adapt to different samples of random diversity change of uncontrolled scenes in illumination, shielding, gesture and the like due to the adoption of self-adaptive learning instead of AU relation predefined based on priori knowledge, and is expected to have stronger robustness while keeping higher recognition precision.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a layered multi-scale region layer structure;
fig. 3 is a schematic structural diagram of a convolutional neural network module II;
FIG. 4 is a schematic diagram of the structure of a space-time convolutional layer;
fig. 5 is a schematic structural diagram of the whole adaptive attention-returning neural network and the adaptive space-time diagram convolutional neural network model.
Detailed Description
The invention is described in detail below with reference to the drawings and the specific embodiments.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The invention provides a facial action unit identification method and device based on self-adaptive attention and space-time correlation. In the self-adaptive attention regression neural network, the strong correlation region of AU specified by the face feature points and the weak correlation region distributed in the global are captured simultaneously by predefining the constraints of attention and AU detection, and the useful information of each AU can be extracted accurately by carrying out feature extraction after learning the correlation distribution of each region. In the self-adaptive space-time diagram convolutional neural network, each space-time diagram convolutional layer fully learns AU relations in the space domain of a single frame; and in the time domain, universal convolution operation is carried out to mine the inter-frame relevance, the face AU identification of each frame is promoted, and finally the probability of each AU of each frame is output by the network. Due to the adoption of self-adaptive learning, the method can adapt to different samples of the uncontrolled scene which randomly and variously change in illumination, shielding, gesture and the like, and is expected to have stronger robustness while keeping higher recognition accuracy.
Fig. 1 is a flowchart of a facial motion unit recognition method based on joint learning and optical flow estimation, and the following description will explain each specific step.
S01: the original continuous image frames required for training are extracted from any video to form a training data set, and the length of the extracted continuous image frame sequence is 48.
For video sequences, the video frame sequence length is selected to be 48 in order to avoid situations where too few acquisition frames result in difficulty in learning the correct action unit spatiotemporal correlation, or where too many acquisition frames result in too long a learning duration.
S02: and preprocessing the original continuous image frames to obtain an amplified image frame sequence.
The preprocessing mode of the original image comprises random translation, random rotation, random scaling, random horizontal overturning, random cutting and the like, and the generalization capability of the model can be improved to a certain extent by preprocessing the image.
S03: and constructing a convolutional neural network module I to extract layered multi-scale region features of the amplified image frame sequence.
Since the facial motion units of different local blocks have different facial structures and texture information, each local block needs to be independently filtered, and different local blocks use different filtering weights.
Since AUs of different local blocks have different facial structures and texture information, independent filtering processing is required for each local block, and different local blocks use different filtering weights; in order to obtain multi-scale region characteristics, a convolutional neural network module I is adopted to learn the characteristics of each local block under different scales, the convolutional neural network module I comprises two layers of layered multi-scale region layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale region layer, the output of the first layered multi-scale region layer is used as the input of a second layered multi-scale region layer after the maximum pooling operation, and the output of the second layered multi-scale region layer is used as the output of the convolutional neural network module I after the maximum pooling operation; and (3) carrying out channel-level serial connection on each frame image of the amplified image frame sequence, and then using the serial connection as the input of a convolutional neural network module I, wherein the output of the convolutional neural network module I is the layered multi-scale region characteristic of the amplified image frame sequence.
As shown in FIG. 2, each layered multi-scale region layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, wherein in the convolution layer I-I, the input is integrally convolved once, and the convolution result is taken as the output of the convolution layer I-I; taking the output of the convolution layer I-I as the input of the convolution layer I-II-I, uniformly dividing the input into 8X 8 scale local blocks in the convolution layer I-II-I, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-I; taking the output of the convolution layer I-II-I as the input of the convolution layer I-II-II, uniformly dividing the input into local blocks with 4X 4 scale in the convolution layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-II; taking the output of the convolution layer I-II-II as the input of the convolution layer I-II-III, uniformly dividing the input into local blocks with the 2X 2 scale in the convolution layer I-II-III for convolution respectively, and then splicing all convolution results to form the output of the convolution layer I-II-III; and after channel-level series connection is carried out on the outputs of the convolution layers I-II-I, I-II-II and I-II-III (the number of channels output after channel-level series connection is the same as that of the output of the convolution layer I-I), the outputs of the convolution layers I-I are summed, and the result is used as the output of the layered multi-scale regional layer.
A layer of maximum pooling layer is arranged behind each layer of layered multi-scale area layer in the convolutional neural network module I, the pooling core size of each layer of maximum pooling layer is 2 multiplied by 2, and the step length is 2; the channel numbers of the convolution layers I-I, I-II-II and I-II-III in the first layered multi-scale region layer are respectively 32, 16, 8 and 8, and the filter numbers of the convolution layers I-I, I-II-II and I-II-III in the first layered multi-scale region layer are respectively 32 multiplied by 1, 16 multiplied by 8 multiplied by 4 multiplied by 8 multiplied by 2; the channel numbers of the convolution layers I-I, I-II-II and I-II-III in the second layered multi-scale region layer are 64, 32, 16 and 16 respectively, and the filter numbers of the convolution layers I-I, I-II-II and I-II-III in the second layered multi-scale region layer are 64 multiplied by 1, 32 multiplied by 8, 16 multiplied by 4 and 16 multiplied by 2 respectively; the filter sizes in the convolutional layers are all 3×3, and the step sizes are all 1.
S04: constructing a convolutional neural network module II by utilizing the layered multi-scale region features extracted in the step S03, carrying out global attention map merging on AU to extract AU features, and supervising the convolutional neural network module II through AU detection loss; AU denotes a face action unit.
As shown in fig. 3, the convolutional neural network module II is a multi-layer convolutional layer containing m branches, each corresponding to an AU, and performs adaptive global attention-seeking regression and AU prediction simultaneously. The filter size of each convolution layer is 3×3, and the step size is 1.
(41) Generating a predictive attention profile for each AU: the characteristic of the layered multi-scale region is the input of a convolutional neural network module II, the convolutional neural network module II comprises M branches, each branch corresponds to one AU, and four layers of convolutional layers are adopted to learn the global attention map M of the AU ij And extracts AU features.
(42) Generating a true attention profile for each AU: each AU has two centers, which are designated by two related face feature points; the true attention map is generated by a gaussian distribution with center points, i.e. AU center point coordinates, e.g. AU center point coordinates
Figure BDA0003671346430000101
The true attention weight at the position coordinates on the attention map (a, b) is:
Figure BDA0003671346430000102
then, the larger of the attention weights is selected in the two center positions of each AU to merge the predefined attention patterns of the two AU centers, i.e
Figure BDA0003671346430000103
And employs attention regression loss to encourage M ij Near M ij
Figure BDA0003671346430000104
Wherein: l (L) a Loss function L representing global attention seeking regression a The method comprises the steps of carrying out a first treatment on the surface of the t represents the length of the amplified image frame sequence; m represents the number of AUs in each frame of image; l/4×l/4 represents the size of the global attention map; m is M ijab Representing a true attention weight of a jth AU of the ith frame image at the coordinate position (a, b); m is M ijab Jth A representing ith frame imagePredicted attention weights for U at coordinate locations (a, b).
(43) Extracting AU characteristics and carrying out AU detection: global attention map M to predict ij The characteristic of the region with larger attention weight can be enhanced by multiplying the characteristic map of the face obtained by the fourth convolution layer II-II by elements; then inputting the obtained output characteristics into convolution layers II-III, and extracting AU characteristics through a global average pooling layer; the AU detection cross entropy loss is adopted to promote attention seeking self-adaptive training, the learned AU characteristics are input into a one-dimensional full-connected layer, and then the Sigmoid function delta (x) =1/(1+e) -x ) The probability of occurrence of each AU is predicted.
The weighted cross entropy loss function of the adopted AU identification is as follows:
Figure BDA0003671346430000111
wherein:
Figure BDA0003671346430000112
a weighted cross entropy loss function representing AU identification; p is p ij Representing the true probability of occurrence of the jth AU of the ith frame image; />
Figure BDA0003671346430000113
Representing a prediction probability of occurrence of a j-th AU of the i-th frame image; omega j A weight indicating the jth AU; v j The weight at which the jth AU appears (the first term used for cross entropy, the appearance term) is represented.
Since the occurrence rate of different AUs in the training dataset is significantly different and the occurrence rate of most AUs is much lower than the non-occurrence rate, ω will be used to suppress the two data imbalance problems j And v j The definition is as follows:
Figure BDA0003671346430000114
wherein: n and
Figure BDA0003671346430000115
the occurrence rate of the jth AU of the ith frame image can be expressed as +.>
Figure BDA0003671346430000116
(44) Overall loss function of convolutional neural network module I and convolutional neural network module II as a whole: the overall loss function can be obtained by combining the striving loss and the AU detection loss.
Figure BDA0003671346430000117
Wherein: l (L) AA Representing the overall loss function of the convolutional neural network module I and the convolutional neural network module II as a whole.
S05: and (3) constructing a self-adaptive space-time diagram convolutional neural network module III by utilizing the AU characteristics extracted in the step S04, and reasoning the specific mode of each AU and the space-time correlation (such as co-occurrence and mutual exclusion) among different AUs so as to learn the space-time correlation characteristics of each AU.
The self-adaptive space-time diagram convolutional neural network module III is composed of two space-time diagram convolutional layers with the same structure, AU features of m 12c of each frame are spliced into integral features of t multiplied by m multiplied by 12c, the integral features are used as input of a space-time diagram convolutional layer III-I, output obtained by the space-time diagram convolutional layer III-I is used as input of a space-time diagram convolutional layer III-II, and output features are obtained through the space-time diagram convolutional layer III-II, and the features comprise specific modes of AUs and space-time relativity among AUs.
Parameters of two space-time diagram convolution layers are independently learned respectively, each space-time diagram convolution layer is formed by combining a space diagram convolution unit and a gating circulation unit, a structure diagram of each space-time diagram convolution layer is shown in fig. 4, and the method specifically comprises the following steps:
(51) Inferring the specific mode of each AU:
the typical convolution is calculated in the spectral domain and is well approximated by a first order chebyshev polynomial expansion:
Figure BDA0003671346430000121
wherein:
Figure BDA0003671346430000122
and->
Figure BDA0003671346430000123
Input and output of the graph convolution layer, respectively, < >>
Figure BDA0003671346430000124
Is an identity matrix; />
Figure BDA0003671346430000125
Is a symmetrical weighted adjacent matrix, and represents the strength of the connecting edges between the nodes; />
Figure BDA0003671346430000126
Is a degree matrix with
Figure BDA0003671346430000127
Is a parameter matrix. The graph convolution is essentially by learning all AU-shared parameter matrices a and Θ (0) Input +.>
Figure BDA0003671346430000128
Conversion to->
Figure BDA0003671346430000129
Although the above formula can learn the interrelationship of AUs, the specific mode of each AU is ignored, for this reason, the association between AUs is inferred by using the shared parameter matrix a, and an independent parameter is adopted for each AU
Figure BDA00036713464300001210
The resulting convolution operation is:
Figure BDA00036713464300001211
wherein: Z=XAs, Y represents the element at the (a, b) index position in the two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y
Figure BDA00036713464300001212
X ak Representing elements at (a, k) index positions in a two-dimensional matrix X, Y akb Representing the element at the (a, k, b) index position in the three-dimensional matrix Y.
To reduce the number of parameters of the theta matrix, a feature decomposition matrix is introduced
Figure BDA00036713464300001213
And a shared parameter matrix->
Figure BDA00036713464300001214
The graph convolution is re-represented as:
Figure BDA00036713464300001215
wherein: with theta (1) =qw, the intermediate dimension c calculated by this new formula e Typically less than m, for the ith AU, parameters
Figure BDA00036713464300001216
Can be separated by a feature separation matrix->
Figure BDA00036713464300001217
Extracted from the shared parameter matrix W, the use of the matrices Q and W facilitates reasoning about the particular mode of AU.
(52) Inferring the interrelationship between AUs over the spatial domain:
using direct learning of a matrix
Figure BDA0003671346430000131
To reduce the calculation amount instead of learning the matrix firstA further calculates a normalized adjacency matrix>
Figure BDA0003671346430000132
I.e. < ->
Figure BDA0003671346430000133
Where R (X) is a modified linear unit (Rectified Linear Unit, reLU) activation function and N (X) is a normalization function, dependencies between AUs, such as co-occurrence and mutual exclusion, can be adaptively encoded. The volume of the map is then re-expressed as:
F out =(I+N(R(UU T )))F in ⊙QW
the matrix U and the matrix W are parameter matrices shared by all AUs, so that the dependence relationship among the AUs in the spatial domain is inferred.
(53) Inferring relationships between frames in the spatial domain: gating loop units (GRUs) are a popular approach to modeling temporal dynamics. One GRU unit consists of an update gate z and a reset gate r, where the gating mechanism at time step τ is defined as follows:
z T =σ(W z C(h T-1 ,x T ))
r T =σ(W r C(h T-1 ,x T ))
Figure BDA0003671346430000134
Figure BDA0003671346430000135
Wherein: h is a T The final hidden state at time T (output at time T) is shown,
Figure BDA0003671346430000136
representing the initial hidden state at time T, z T H for determining that time T needs to be reserved T-1 Number of r T For determining T moment->
Figure BDA0003671346430000137
And h T-1 In the combination, "Σ" represents element level multiplication, C (·) represents a concatenation operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function.
The final definition of each space-time diagram convolution layer can be obtained by the above procedure:
Figure BDA0003671346430000138
Figure BDA0003671346430000139
Figure BDA00036713464300001310
Figure BDA00036713464300001311
wherein:
Figure BDA00036713464300001312
input of time T is shown, h T Represents the final hidden state at time T (output at time T), and
Figure BDA00036713464300001313
and->
Figure BDA00036713464300001314
Representing adaptive learning for z, respectively T 、r T And->
Figure BDA0003671346430000141
Is a weight matrix of (2); wherein the input matrix->
Figure BDA0003671346430000142
Output matrix->
Figure BDA0003671346430000143
Is the output result h of each graph convolution layer 1 ,h 2 ,…,h t' Splicing in dimension t; t 'is the total number of frames input to the space-time convolutional layer, t' =t=48.
S06: and (3) constructing a full-connection module IV to realize AU identification by utilizing the space-time correlation characteristics of all the AUs extracted in the step S05.
The full connection module IV is formed by a one-dimensional full connection layer followed by a Sigmoid activation function. And (3) decomposing the integral features with the dimensions of t multiplied by m multiplied by 12c obtained in the step S05 frame by frame and AU by AU to obtain feature vectors with the dimensions of 12c corresponding to each AU on each frame of image, inputting the feature vector of the jth AU of the ith frame of image into the jth full-connection layer, carrying out AU occurrence probability prediction by the full-connection layer and then following a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame of images. Because the learned features have the space-time correlation information of AU, the method is favorable for final AU detection, and the following loss function is adopted to guide the learning of a space-time diagram convolution parameter matrix:
Figure BDA0003671346430000144
Wherein:
Figure BDA0003671346430000145
a loss function representing AU identification; t represents the length of the amplified image frame sequence; p is p ij Representing the true probability of occurrence of the jth AU of the ith frame image; />
Figure BDA0003671346430000146
Representing a final prediction probability of occurrence of a j-th AU of the i-th frame image; omega j A weight indicating the jth AU; v j The weight at which the jth AU appears (the first term used for cross entropy, the appearance term) is represented.
S07: and training an integral AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time diagram convolutional neural network module III and the full-connection module IV by using a training data set so as to update parameters of the integral AU recognition network model by using a gradient-based optimization method.
The whole convolutional neural network, the graph convolutional neural network model (as in fig. 5) is trained by an end-to-end method. Firstly training a convolutional neural network model, extracting AU characteristics and taking the AU characteristics as input of a graph convolutional neural network, then training the graph convolutional neural network to learn the specific mode and the space-time correlation of AU, and utilizing the space-time correlation among AU to promote the recognition of a facial action unit.
S08: and inputting the video sequence with any given frame number into the trained integral AU recognition network model, and predicting the occurrence probability of the AU.
And directly outputting the prediction result of the facial action unit when the prediction is performed.
The method can be completely realized by a computer without manual auxiliary treatment; this shows that this case can realize the automatic processing of batchization, can improve the treatment effeciency greatly, reduce the cost of labor.
The facial action unit recognition device based on the self-adaptive attention and space-time correlation comprises an image frame sequence acquisition unit, a layered multi-scale region learning unit, a self-adaptive attention regression and feature extraction unit, a self-adaptive space-time diagram convolution learning unit, an AU recognition unit and a parameter optimization unit;
the image frame sequence acquisition unit is used for extracting a large number of original continuous images required by training from any video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;
the hierarchical multi-scale region learning unit comprises a convolutional neural network module I, wherein characteristics of each local block under different scales of each frame of input image are learned by adopting a hierarchical multi-scale region layer, and each local block is independently filtered;
the adaptive attention regression and feature extraction unit comprises a convolutional neural network module II for generating a global attention map of the image and performing adaptive regression under the supervision of a predefined attention map and AU detection loss while accurately extracting AU features.
The self-adaptive space-time diagram convolutional learning unit comprises a self-adaptive space-time diagram convolutional neural network module III, performs specific mode learning of each AU, space-time correlation learning among different AUs and extracts space-time correlation characteristics of each AU;
the AU identification unit comprises a full connection module IV, and AU identification can be effectively performed by utilizing the space-time correlation characteristic of each AU;
the parameter optimizing unit calculates parameters and loss function values of an integral AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time diagram convolutional neural network module III and the full-connection module IV, and updates the parameters based on a gradient optimizing method.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.

Claims (2)

1. A facial action unit identification method based on self-adaptive attention and space-time association is characterized in that: the method comprises the following steps:
s01: extracting original continuous image frames required by training from the video to form a training data set;
S02: preprocessing original continuous image frames to obtain an amplified image frame sequence;
s03: constructing a convolutional neural network module I to extract layered multi-scale region features of each frame in the amplified image frame sequence;
the characteristic of each local block under different scales is learned by adopting a convolutional neural network module I, wherein the convolutional neural network module I comprises two layers of layered multi-scale area layers with the same structure, the input of the convolutional neural network module I is used as the input of a first layered multi-scale area layer, the output of the first layered multi-scale area layer is used as the input of a second layered multi-scale area layer after the maximum pooling operation, and the output of the second layered multi-scale area layer is used as the output of the convolutional neural network module I after the maximum pooling operation; the output of the convolutional neural network module I is the layered multi-scale region characteristic of the amplified image frame sequence;
each layering multi-scale regional layer comprises a convolution layer I-I, a convolution layer I-II-II and a convolution layer I-II-III, wherein in the convolution layer I-I, the input is integrally convolved once, and a convolution result is used as the output of the convolution layer I-I; taking the output of the convolution layer I-I as the input of the convolution layer I-II-I, uniformly dividing the input into 8X 8 scale local blocks in the convolution layer I-II-I, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-I; taking the output of the convolution layer I-II-I as the input of the convolution layer I-II-II, uniformly dividing the input into local blocks with 4X 4 scale in the convolution layer I-II-II, respectively carrying out convolution, and then splicing all convolution results to form the output of the convolution layer I-II-II; taking the output of the convolution layer I-II-II as the input of the convolution layer I-II-III, uniformly dividing the input into local blocks with the 2X 2 scale in the convolution layer I-II-III for convolution respectively, and then splicing all convolution results to form the output of the convolution layer I-II-III; the output of the convolution layers I-II-I, I-II-II and I-II-III are subjected to channel-level series connection and then added with the output of the convolution layers I-I, and the result is used as the output of a layered multi-scale regional layer;
S04: constructing a convolutional neural network module II by utilizing the layered multi-scale region features extracted in the step S03, carrying out global attention map merging on AU to extract AU features, and supervising the convolutional neural network module II through AU detection loss; AU denotes a face action unit;
predicting global attention force diagram of each AU and the occurrence probability of the AU by adopting a convolutional neural network module II, adaptively returning the global attention force diagram of the AU to a predefined attention force diagram under the supervision of AU detection loss, and extracting AU characteristics; the loss function used for AU detection loss is:
Figure QLYQS_1
Figure QLYQS_2
Figure QLYQS_3
wherein: l (L) a A loss function representing global attention seeking to regress,
Figure QLYQS_4
weighted cross entropy loss function representing AU recognition, L AA Representing the total loss function of the convolution neural network module I and the convolution neural network module II; lambda (lambda) a Representing global attention stricken returnsWeight of loss; t represents the length of the amplified image frame sequence; m represents the number of AUs in each frame of image; l/4×l/4 represents the size of the global attention map; m is M ijab Representing a true attention weight of a jth AU of the ith frame image at the coordinate position (a, b); m is M ijab A predicted attention weight representing a jth AU of the ith frame image at the coordinate position (a, b); p is p ij Representing the true probability of occurrence of the jth AU of the ith frame image; />
Figure QLYQS_5
Representing a prediction probability of occurrence of a j-th AU of the i-th frame image; omega j A weight indicating the jth AU; v j Represents the weight at which the jth AU appears;
s05: constructing a self-adaptive space-time diagram convolutional neural network module III by utilizing the AU characteristics extracted in the step S04, and reasoning the space-time correlation between different AUs and the specific mode of each AU so as to learn the space-time correlation characteristics of each AU;
extracting the space-time correlation between the specific mode of each AU and different AUs by adopting a self-adaptive space-time diagram convolutional neural network module III so as to learn the space-time correlation characteristic of each AU; the input of the self-adaptive space-time diagram convolutional neural network module III is all AU features of all t frame images extracted in the step S04, each frame image comprises m AU features, t multiplied by m AU features are all included, the dimension of each AU feature is c ', c' =12c, and c is a setting parameter related to the overall AU recognition network model; the self-adaptive space-time diagram convolutional neural network module III comprises two space-time diagram convolutional layers with the same structure, parameters of the two space-time diagram convolutional layers are independently learned, each space-time diagram convolutional layer is formed by combining a space diagram convolutional unit and a gate control cyclic unit, and the definition of the space-time diagram convolutional layers is as follows:
Figure QLYQS_6
Figure QLYQS_7
Figure QLYQS_8
Figure QLYQS_9
Wherein:
Figure QLYQS_10
input of time T is shown, h T Represents the final hidden state at time T, +.>
Figure QLYQS_11
Representing an initial hidden state at the time T; z T H for determining that time T needs to be reserved T-1 Number of r T For determining T moment->
Figure QLYQS_12
And h T-1 A combination of (2);
Figure QLYQS_13
representing an identity matrix, m being the number of AUs in each frame of image; />
Figure QLYQS_14
Matrix representing adaptive learning for AU relationship diagram, U Τ Represents the transpose of U; />
Figure QLYQS_15
Dissociation matrix representing adaptive learning, c e The number of columns set for Q;
Figure QLYQS_16
and->
Figure QLYQS_17
Representing adaptive learning for z, respectively T 、r T And->
Figure QLYQS_18
C 'represents the dimension of the AU feature, c' =12c, c is a set parameter related to the overall AU-recognition network model;
r (X) represents the element X at each index position in the two-dimensional matrix X, and after the negation process, the element X at the index position (a, b) in the two-dimensional matrix X ab Updated to X ab =max(0,X ab );
N (X) represents the normalization of the elements at each index position in the two-dimensional matrix X, after which the elements X at the (a, b) index positions in the two-dimensional matrix X ab Updated to
Figure QLYQS_19
Z=XAs, Y represents the element at the (a, b) index position in the two-dimensional matrix Z obtained by performing a function operation on the two-dimensional matrix X and the three-dimensional matrix Y
Figure QLYQS_20
X ak Representing elements at (a, k) index positions in a two-dimensional matrix X, Y akb Representing elements at the (a, k, b) index positions in the three-dimensional matrix Y;
"∈" represents element-level multiplication, C (·) represents a concatenation operation, σ (·) represents a Sigmoid function, and tanh (·) represents a hyperbolic tangent activation function;
s06: constructing a full connection module IV to realize AU identification by utilizing the space-time correlation characteristics of all AUs extracted in the step S05;
adopting a full connection module IV to identify AU of each frame image, and carrying out time-space correlation characteristics on all AUs contained in t frame images output in the step S05
Figure QLYQS_21
Spatio-temporal correlation of the jth AU containing the ith frame image
Figure QLYQS_22
Input to full connection module IV, full connection module IV is to +.>
Figure QLYQS_23
Predicting the final occurrence probability of the jth AU of the ith frame image by adopting a mode that the jth one-dimensional full-connection layer is followed by a Sigmoid activation function, wherein the full-connection module IV adopts the same full-connection layer for the same AU of different frame images; the loss function adopted by AU identification is:
Figure QLYQS_24
wherein:
Figure QLYQS_25
a loss function representing AU identification; t represents the length of the amplified image frame sequence; p is p ij Representing the true probability of occurrence of the jth AU of the ith frame image; />
Figure QLYQS_26
Representing a final prediction probability of occurrence of a j-th AU of the i-th frame image; omega j A weight indicating the jth AU; v j Represents the weight at which the jth AU appears;
S07: training an integral AU recognition network model formed by a convolutional neural network module I, a convolutional neural network module II, a self-adaptive space-time diagram convolutional neural network module III and a full-connection module IV by using a training data set so as to update parameters of the integral AU recognition network model by using a gradient-based optimization method;
s08: and inputting the video sequence with any given frame number into the trained integral AU recognition network model, and predicting the occurrence probability of the AU.
2. A facial action unit recognition apparatus based on adaptive attention and spatiotemporal association as claimed in claim 1, wherein: the device comprises an image frame sequence acquisition unit, a layered multi-scale region learning unit, an adaptive attention regression and feature extraction unit, an adaptive space-time image convolution learning unit, an AU identification unit and a parameter optimization unit;
the image frame sequence acquisition unit is used for extracting original continuous images required by training from video data to form a training data set, and preprocessing the original continuous image frames to obtain an amplified image frame sequence;
the hierarchical multi-scale region learning unit comprises a convolutional neural network module I, wherein characteristics of each local block under different scales of each frame of input image are learned by adopting a hierarchical multi-scale region layer, and each local block is independently filtered;
The self-adaptive attention regression and feature extraction unit comprises a convolutional neural network module II, a self-adaptive regression module and an AU feature extraction module, wherein the convolutional neural network module II is used for generating a global attention map of an image, carrying out self-adaptive regression under the supervision of a predefined attention map and AU detection loss, and accurately extracting the AU feature;
the self-adaptive space-time diagram convolutional learning unit comprises a self-adaptive space-time diagram convolutional neural network module III, performs specific mode learning of each AU, space-time correlation learning among different AUs and extracts space-time correlation characteristics of each AU;
the AU identification unit comprises a full connection module IV, and AU identification can be effectively performed by utilizing the space-time correlation characteristic of each AU;
the parameter optimizing unit calculates parameters and loss function values of an integral AU recognition network model formed by the convolutional neural network module I, the convolutional neural network module II, the self-adaptive space-time diagram convolutional neural network module III and the full-connection module IV, and updates the parameters based on a gradient optimizing method.
CN202210606040.5A 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation Active CN114842542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210606040.5A CN114842542B (en) 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210606040.5A CN114842542B (en) 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Publications (2)

Publication Number Publication Date
CN114842542A CN114842542A (en) 2022-08-02
CN114842542B true CN114842542B (en) 2023-06-13

Family

ID=82572471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210606040.5A Active CN114842542B (en) 2022-05-31 2022-05-31 Facial action unit identification method and device based on self-adaptive attention and space-time correlation

Country Status (1)

Country Link
CN (1) CN114842542B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071809B (en) * 2023-03-22 2023-07-14 鹏城实验室 Face space-time representation generation method based on multi-class representation space-time interaction
CN116416667B (en) * 2023-04-25 2023-10-24 天津大学 Facial action unit detection method based on dynamic association information embedding
CN118277607A (en) * 2024-04-12 2024-07-02 山东万高电子科技有限公司 Video monitoring data storage device and method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149504A (en) * 2020-08-21 2020-12-29 浙江理工大学 Motion video identification method combining residual error network and attention of mixed convolution

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633207B (en) * 2017-08-17 2018-10-12 平安科技(深圳)有限公司 AU characteristic recognition methods, device and storage medium
CN110363156A (en) * 2019-07-17 2019-10-22 北京师范大学 A kind of Facial action unit recognition methods that posture is unrelated
CN111597884A (en) * 2020-04-03 2020-08-28 平安科技(深圳)有限公司 Facial action unit identification method and device, electronic equipment and storage medium
CN112633153A (en) * 2020-12-22 2021-04-09 天津大学 Facial expression motion unit identification method based on space-time graph convolutional network
CN112990077B (en) * 2021-04-02 2021-10-01 中国矿业大学 Face action unit identification method and device based on joint learning and optical flow estimation
CN113496217B (en) * 2021-07-08 2022-06-21 河北工业大学 Method for identifying human face micro expression in video image sequence

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149504A (en) * 2020-08-21 2020-12-29 浙江理工大学 Motion video identification method combining residual error network and attention of mixed convolution

Also Published As

Publication number Publication date
CN114842542A (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN114842542B (en) Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
CN112990077B (en) Face action unit identification method and device based on joint learning and optical flow estimation
CN105095862B (en) A kind of human motion recognition method based on depth convolution condition random field
CN111753698B (en) Multi-mode three-dimensional point cloud segmentation system and method
CN110111366A (en) A kind of end-to-end light stream estimation method based on multistage loss amount
CN111462191B (en) Non-local filter unsupervised optical flow estimation method based on deep learning
CN110889375B (en) Hidden-double-flow cooperative learning network and method for behavior recognition
CN108960059A (en) A kind of video actions recognition methods and device
CN109583340A (en) A kind of video object detection method based on deep learning
CN112434608B (en) Human behavior identification method and system based on double-current combined network
CN112541877B (en) Defuzzification method, system, equipment and medium for generating countermeasure network based on condition
CN112507990A (en) Video time-space feature learning and extracting method, device, equipment and storage medium
CN111832592B (en) RGBD significance detection method and related device
CN105139004A (en) Face expression identification method based on video sequences
CN110781736A (en) Pedestrian re-identification method combining posture and attention based on double-current network
CN112507920B (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN114897728A (en) Image enhancement method and device, terminal equipment and storage medium
CN114333002A (en) Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face
CN113239866B (en) Face recognition method and system based on space-time feature fusion and sample attention enhancement
CN114764941A (en) Expression recognition method and device and electronic equipment
CN117576149A (en) Single-target tracking method based on attention mechanism
CN112613486A (en) Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
CN116665300A (en) Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant