CN112418074B - Coupled posture face recognition method based on self-attention - Google Patents

Coupled posture face recognition method based on self-attention Download PDF

Info

Publication number
CN112418074B
CN112418074B CN202011308968.2A CN202011308968A CN112418074B CN 112418074 B CN112418074 B CN 112418074B CN 202011308968 A CN202011308968 A CN 202011308968A CN 112418074 B CN112418074 B CN 112418074B
Authority
CN
China
Prior art keywords
image
face
posture
feature
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011308968.2A
Other languages
Chinese (zh)
Other versions
CN112418074A (en
Inventor
周丽芳
陈旭
李伟生
雷帮军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Gerite Technology Co ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011308968.2A priority Critical patent/CN112418074B/en
Publication of CN112418074A publication Critical patent/CN112418074A/en
Application granted granted Critical
Publication of CN112418074B publication Critical patent/CN112418074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Abstract

The invention requests to protect a coupled posture face recognition method based on self attention, and belongs to the technical field of pattern recognition. The method mainly comprises the following steps: step 1, carrying out preprocessing operation (face detection and feature point calibration) on a training image by using MTCNN; and providing a posture guiding strategy PGS based on a K-means algorithm, and determining a posture template. And 2, generating a plurality of human faces with different postures by using the image input into the network through a posture template, generating a network encoder to perform feature encoding on the human faces, performing weighted average to obtain fusion features, and restoring a decoder into a front human face image. And 3, constructing a posture-oriented dual-discriminator to generate an antagonistic network PGDD-GAN, and performing antagonistic training on the synthetic image. And 4, implanting a self-attention model in an encoder and a discriminator network in order to enhance the local texture information of the synthesized image. The method reduces the requirement of the model on the source data set, and improves the robustness of face recognition in an unsupervised environment.

Description

Coupled posture face recognition method based on self-attention
Technical Field
The invention belongs to the technical field of computer mode recognition, and particularly relates to a face recognition method based on invariant posture.
Background
In real life, such as special occasions as access control systems, airports, customs entrances and the like, the identity verification system requires that a target object actively cooperates with the acquisition of a front face image, so that a relatively ideal recognition result can be obtained. However, in real scenes, active fitting of the target object cannot be achieved in most cases, and even the target object is acquired by a video monitoring system without knowing the target object, that is, the target object is acquired under a non-ideal viewing angle such as a top view, a side view, a bottom view and the like. Under the non-ideal visual angles, people can still accurately identify the human face, and the identification performance is kept quite high; however, machine vision is difficult to achieve, and under a non-ideal viewing angle, factors such as illuminance, accessories such as glasses and the like, image resolution and the like have a particularly obvious influence on the recognition performance of the machine vision. Therefore, the active and deep research on the key problems of face recognition under posture change has important theoretical significance and wider application prospect.
Some researchers have been working on solving the problem of face recognition under pose change, and have achieved some research results. The main idea of the posture-invariant feature learning method is as follows: by extracting the features of the images with different postures, the front face is restored by using a network module, and the front face is stored as supervision information for network training. The most representative method is a deep neural network DNN, and the method starts from learning the face identity retention features and obtains good recognition results, but has some disadvantages: due to the deep structure of the model, the network has millions of parameters to adjust, and therefore a large amount of multi-pose training data is needed. Compared with a posture-invariant feature learning method, the method based on the face synthesis technology has more advantages in the application of actual life scenes. Face synthesis methods can be roughly divided into two categories: 2D face synthesis based and 3D face synthesis based. The 2D face synthesis based method tries to extract robust features of pose through a nonlinear regression model, and synthesizes a frontal pose image using a local warping strategy, and the representative methods of this type of method are: the method comprises the steps of Stack-flow, a disentanglement Representation Learning method (DR-GAN), a field Unsupervised Face regularization posture and Expression Recognition method (unscuperated Face Normalization with Expression position and Expression in the Wild, FNM) and the like, wherein the Face image synthesized by the method has a fuzzy effect and loses fine Face texture information. A3D face synthesis-based method normalizes a face image to a uniform posture by evaluating face depth information change, and the representative method comprises the following steps: 3D type variable model (3DMM), field Large-Pose Face normalization (facial-Face frontization in the Wild, FF-GAN). Such methods typically use limited information, such as dense facial keypoint coordinates, to estimate pose and shape parameters, and errors in pose and shape estimation can produce undesirable artifacts in subsequent texture mapping and face synthesis operations, adversely affecting face recognition. In order to solve the above problems, the present invention provides a coupled pose face recognition method based on self-attention.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A coupled pose face recognition method based on self attention is provided. The technical scheme of the invention is as follows:
a self-attention-based coupled pose face recognition method comprises the following steps:
101. inputting training data images into a multitask convolutional neural network for face detection and five feature point calibration, calculating attitude yaw angles corresponding to the feature points through a three-dimensional deformation model, then providing an attitude guidance strategy PGS based on a K-means algorithm, clustering the attitude yaw angles of all the images to obtain four optimal attitude yaw angles, acting on the three-dimensional deformation model, and generating four attitude templates;
102. using the posture guide strategy PGS constructed in the step 101, obtaining four posture faces from the target image through the posture guide strategy PGS, inputting the target image and the four posture faces into a generator network simultaneously, performing feature extraction on a plurality of posture face images by using a coding network, performing weighted average on a plurality of features to obtain a fusion feature, and restoring the fusion feature into a front face image by using a decoding network;
103. adopting a disentanglement representation learning method, and adopting a double-discriminator network to discriminate and learn the synthetic image to obtain a posture face recognition result;
the step 101 specifically includes the following steps:
a1, inputting the training data image into an MTCNN network for face detection and calibrating five key points, namely the centers of the left and right eyes, the nose tip and the left and right mouth corners;
b1 two-dimensional coordinates (x) through five keypoints i ,y i ) Calculating Euler angles, namely a pitch angle pitch, a yaw angle yaw and a roll angle, and forming the obtained yaw angle into a one-dimensional angle vector;
c1, clustering the vectors obtained in the step B1 by using a K-means algorithm, and dividing the vectors into four types by calculating Euclidean distances to obtain four posture templates;
in the step 102, the target image is subjected to a pose guidance strategy PGS to obtain four pose faces, the target image and the four pose faces are simultaneously input to a generator network, a coding network is used to perform feature extraction on multiple pose face images, a plurality of features are weighted and averaged to obtain a fusion feature, and a decoding network is used to restore the fusion feature to a front face image, which specifically includes the following steps:
a2, acting the four posture templates obtained in the step 101 on a single input image, and generating four face images with the same identity and different postures with the assistance of a three-dimensional metamorphic model 3 DMM;
b2, encoder G using the original image and the four posture images as input for generating network G enc Extracting the characteristics to obtain characteristics f 1 、f 2 、f 3 、f 4 、f 5 Encoder G enc Not only learning f (x), but also estimating a confidence coefficient w for each image to predict and represent the learning quality, and carrying out weighted average on the features to obtain a fusion feature f (x) in order to reduce the intra-class difference of the posture 1 ,x 2 ,x 3 ,x 4 ,x 5 ) The fusion feature is represented as G dec Generating an x and i new image G with same identity positive posture syn
C2, synthesizing the image G syn Input identity discriminator D d The identity of the composite image is classified and then the front image having the same identity as the original image x is classified
Figure GDA0003699185950000031
And G syn Simultaneous input posture discriminator D p Performing attitude classification, D d And D p Responsible for G syn Judging the feature to be a fake class, and simultaneously designing a self-attention encoder and a self-attention discriminator, wherein the self-attention encoder has the function of enabling the extracted feature to be more authentic and discriminative, and the self-attention discriminator is responsible for judging the extracted feature to be the fake class and providing the feature with more robustness for the learning of a generator;
the calculation formula of the step C2 is as follows:
Figure GDA0003699185950000032
wherein p is z (z) denotes a noise code, initialized to 50, p c (c) Representing a pose code, and G (x, c, z) representing a composite image, wherein the formula utilizes cross entropy to calculate identity loss and pose loss;
the step 103 adopts a de-entanglement representation learning method and adopts a double-discriminator network to discriminate and learn the synthetic image, and specifically comprises the following steps:
firstly, inputting the face image synthesized in the step 102 into an identity discriminator to calculate identity loss, then inputting the synthesized face and a real front face into a posture discriminator to calculate front posture loss, and finally synthesizing a vivid front face image according to a game mechanism for generating a confrontation network;
firstly, inputting the face image synthesized in the step 102 into an identity discriminator to calculate identity loss, wherein the calculation formula is as follows:
Figure GDA0003699185950000041
then inputting the synthesized face and the real face into a pose discriminator to calculate the loss of the face pose, wherein the calculation formula is as follows:
Figure GDA0003699185950000042
finally, according to a game mechanism for generating a confrontation network, the generator tries to synthesize an image of a deception discriminator, the discriminator tries to discriminate the synthesized image into a fake type, the two types of images resist and learn each other, and a vivid front face image is synthesized, wherein the calculation formula is as follows:
Figure GDA0003699185950000043
further, in the step 103, in the process of performing discriminant learning on the synthesized image by using the dual-discriminator network, a self-attention module is further implanted into the generator and the discriminator, and the weight of the feature channel is distributed, so that the local texture information is enhanced.
Further, the fusion feature f (x) of the step B2 1 ,x 2 ,x 3 ,x 4 ,x 5 ) The calculation method of (c) is as follows:
Figure GDA0003699185950000044
wherein, w i Is a weight parameter corresponding to each feature, n represents the number of features (n is 5), and w is constrained to be in the range of [0,1 ] by applying a Sigmod activation function through learning w]。
Further, the step C2 is a step of designing the self-attention discriminator and the self-attention encoder, which comprises the following steps:
a3, calculating an original image by a plurality of convolution layers to obtain a feature map x, respectively passing x through f (x), g (x), H (x) convolution layers to obtain f ' (x), g ' (x), H ' (x), f (x), g (x) and H (x) which are all 1 × 1 convolutions, respectively, wherein f (x), g (x) and H (x) respectively represent the convolutions of which all are 1 × 1, the sizes of channels are different, and the difference lies in that the sizes of output channels are different and are respectively (C/8, N), (C/8, N) and (C, N), wherein C represents the number of feature channels, and N ═ W × H, W, H respectively represents the width and height of the feature map;
b3, transposing f '(x) and multiplying g' (x), obtaining the output matrix S with size (N, N), normalizing the S matrix row by softmax to obtain β matrix, each row representing an attention mode, as shown in formula (3):
Figure GDA0003699185950000051
wherein beta is i,j Indicates the attention degree, s, to the ith position of the jth region in the composition ij The value, f (x), representing the ith position of the jth region of the output matrix i ) And g (x) j ) Respectively representing two feature spaces of x after primary transformation;
c3, multiplying the N row vectors by h' (x) pixel by pixel, that is, each pixel is related to the whole feature map, and obtaining N new pixels as the attention feature map O, as shown in formula (4):
Figure GDA0003699185950000052
wherein W h Represents the learned weight matrix as a1 × 1 convolution, h (x) i ) Representing the feature space of x after primary transformation;
and finally, performing feature fusion on the attention feature map O and the feature map x, as shown in formula (5):
y=γO+x (5)
where γ represents a learnable scalar, initialized to 0.
The invention has the following advantages and beneficial effects:
the innovation of the present invention is primarily the steps 101, 102, 103 of the claims.
The innovation of claim 101 is in data expansion of the original data set using a K-means clustering algorithm. In the common pose face data sets LFW and IJB-A, CFP, the poses are not uniformly distributed, and part of the poses do not exist, so that the manpower and financial resources consumed by large-scale data set acquisition cannot be measured, and the requirement of network training on equipment is very high. The invention provides a Posture Guidance Strategy (PGS), which is characterized in that five key points of a training image are obtained and mapped into a 3D model to calculate a deflection angle yaw, four optimal templates are obtained by clustering the deflection angles of all images, and the templates are acted on each image to obtain four face images with the same identity and different postures, so that the posture data expansion of an original data set is realized.
The innovative point of step 102 is to construct a multi-image generation network using the augmented pose face images. The generation network in the traditional GAN framework is single image input, and the characteristic difference of the pose face is not well solved. The invention provides a multi-image generation network, and a plurality of posture images are simultaneously input into the generation network by utilizing the expansion data in the step 101, so that the intra-class difference between postures is reduced, and the local texture information of the contour human face is used for supplementing and learning the local characteristics of the front human face by adopting a weighted average method, so that the network has the similar human distinguishing thinking and can be more suitable for the actual life scene.
The innovation of step 103 is to propose a self-attention dual-arbiter network. In the posture face recognition visual task based on the generated confrontation network, the local edge of the synthesized image is fuzzy, the posture characteristic is insensitive, and the network training convergence speed is low. The invention constructs a self-attention double-discriminator network framework, wherein an identity discriminator is responsible for identity feature calculation and classification, a posture discriminator is used for judging whether the face is positive, and a self-attention module is implanted for extracting deep feature representation and enhancing the quality of a synthetic image and local texture information. In conclusion, the synthesized front face image is more vivid and the recognition effect is better.
The invention mainly aims at the problems that the existing popular gesture-invariant face recognition method based on the generation countermeasure network lacks the robust feature extraction of face gesture change, the synthetic image is fuzzy, and the face texture information expression is insufficient, and designs a self-attention-based coupled gesture face recognition method. The method fully considers the situation that self-shielding phenomenon is caused by the posture deflection of the face in an actual scene, the face cannot be accurately identified, and a plurality of posture changes exist in a short time, and designs a set of self-attention double-discriminator network PGDD-GAN adopting posture guidance. In addition, the geometric symmetry property of the face is fully considered, and in order to improve the authenticity of the method when the front face is synthesized from the pose face, a face symmetry loss function is defined, and the tolerance capability of the network model to pose change is effectively improved. The method obtains better face recognition effect.
Drawings
FIG. 1 is a diagram of a network of generating network encoders;
fig. 2 is a flow chart of a self-attention module.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly in the following with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the method comprises the steps that based on the fact that a posture face recognition frame for generating a countermeasure network GAN is used as a basic frame, firstly, a posture yaw angle is obtained and calculated through an MTCNN, and four optimal posture templates are clustered through a K-means algorithm; then constructing a multi-image generation network structure; and finally, constructing a self-attention double-discriminant to carry out countermeasure training learning on the output image of the generated network.
The implementation process of the self-attention-based coupled pose face recognition framework provided by the embodiment of the invention is detailed as follows:
step 1, acquiring five key points (the centers of the left and right eyes, the nose tip and the left and right mouth corners) of a human face by using MTCNN, aligning the key points with a corresponding 3D model, calculating an attitude yaw angle yaw by using weak perspective projection,
1.1, inputting a training data image into an MTCNN network to carry out face detection and calibrating five key points (the centers of left and right eyes, the nose tip and the left and right mouth corners);
1.2 two-dimensional coordinates (x) of five key points i ,y i ) Mapping to a corresponding 3D face model, calculating Euler angles (a pitch angle pitch, a yaw angle yaw, and a roll angle) by using a weak perspective projection method, and forming the obtained yaw angle yaw into a one-dimensional angle vector by using the calculation method as follows:
[p 1] T =fA[R|t 3d ][P 1] T (1)
where f is a scale factor, A is an orthogonal projection matrix, R is a 3x3 attitude matrix consisting of pitch angle, yaw angle, roll angle, and t 3d Obtaining a yaw angle yaw as a human face posture angle by decomposing R, wherein the yaw is a balance vector;
and 1.3, clustering the yaw angle vectors obtained in the step 1.2 by using a K-means algorithm, and calculating Euclidean distances to divide the Euclidean distances into four classes to obtain four optimal attitude templates.
And 2, in order to reduce the intra-class difference among the postures and solve the problem of monitoring and identifying a face with a plurality of posture changes in the actual life, the method is different from the traditional single image input of the GAN, and a fused front face is generated by adopting a multi-image input mode. G enc Extracting the features of each input image, weighting and averaging a plurality of features to obtain a fusion feature, and finally obtaining a feature value G dec And synthesizing a front face image. The method comprises the following specific steps:
2.1, acting the four posture templates obtained in the step 1 on a single input image to generate four face images with the same identity and different postures;
2.2, using the original image and the four attitude images as the input for generating the network G, the encoder G enc Performing feature extraction to obtain a feature f 1 、f 2 、f 3 、f 4 、f 5 Encoder G enc Not only learning f (x), but also estimating a confidence coefficient w for each image to predict the representationQuality of learning. In order to reduce the intra-class difference of the postures, the features are weighted and averaged to obtain a fusion feature f (x) 1 ,x 2 ,x 3 ,x 4 ,x 5 ) The calculation method is as follows:
Figure GDA0003699185950000081
by learning w, the high quality image is made to contribute more to the fused representation. The invention applies Sigmod activation function to constrain the range of w to [0,1 ]]. Representing the fusion feature as G dec Generating an x and i new image G with same identity positive posture syn
And 3, performing countermeasure training by using the network output and the original image in the step 2 as the input of the self-attention double-discriminant to obtain a trained model. The method comprises the following specific steps:
synthesizing the image G syn Input identity discriminator D d The identity of the composite image is classified and then the front image having the same identity as the original image x is classified
Figure GDA0003699185950000082
And G syn Simultaneous input posture discriminator D p And carrying out posture classification. D d And D p Responsible for G syn Classifying the data into a pseudo class, wherein the calculation formula is as follows:
Figure GDA0003699185950000083
further, the step 2 and the step 3 construct and train a PGDD-GAN network framework, and the specific implementation steps are as follows:
A. as shown in fig. 1, the network structure of the encoder for generating the network is, from top to bottom: the first layer is divided into two sublayers, wherein the two sublayers are convolution layers, the sizes of convolution kernels are both 3x3, and the number of the convolution kernels is 32 and 64 respectively; the second layer is divided into three sublayers, which are convolution layers and convolution kernelsThe sizes are all 3x3, and the number of convolution kernels is 64, 64 and 128 respectively; the third layer is divided into three sublayers, wherein the three sublayers are convolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 128, 96 and 192 respectively; the fourth layer is divided into three sublayers, wherein the three sublayers are convolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 192, 128 and 256 respectively; the fifth layer is divided into three sublayers, wherein the three sublayers are convolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 256, 160 and 321 respectively; the sixth layer is an average pooling layer with a pooling interval of 6x 6; the seventh layer is a full-junction layer with a neuron number of N d +N p +1(N d Representing training data set number of people, N p Representing a discrete total number of poses);
B. the decoder network structure is the deconvolution process of the encoder network structure: the first layer is divided into three sublayers, namely a full-connection layer and two deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 320, 160 and 256 respectively; the second layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 256, 128 and 192 respectively; the third layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 192, 96 and 128 respectively; the fourth layer is a self-attention layer which is divided into three 1x1 convolution layers and a softmax layer, the number of convolution kernels is 128, and the weight parameters are continuously learned and updated, so that part of characteristic images needing attention are emphasized; the fifth layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 128, 64 and 64 respectively; the sixth layer is a self-attention layer and is divided into three 1x1 convolution layers and one softmax layer, the number of convolution kernels is 64, and the feature images are subjected to weighted summation again to generate an attention feature map; the seventh layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 64, 32 and 3 respectively; the image synthesized by the decoder is used as the input of the discrimination network;
C. the network structure of the identity discrimination network and the posture discrimination network is similar to that of a decoder, except that a self-attention layer is added behind the fourth layer and the fifth layer respectively. Identity discriminating network andthe posture discrimination network is distinguished in that the number of neurons in the seventh layer is N d 、N p
D. Testing stage, inputting with label y ═ y d ,y p The real face x, y of d Represents an identity tag, y p Representing the attitude tag, generating a network output composite image with an identity discriminator D d The attitude discriminator is D p
E. When the neural network is trained, the cross entropy function is used for calculating the loss of the two classification tasks of the real sample and the synthetic sample, and the two classification tasks are shown in a calculation formula (3):
Figure GDA0003699185950000101
wherein, x represents a real face of a person,
Figure GDA0003699185950000102
representing the synthesized face output by the generator.
Similarly, the identity loss and the attitude loss corresponding to the two discriminators are respectively calculated by using a cross entropy function, and are shown in a calculation formula (4):
Figure GDA0003699185950000103
inputting a real face x, two discriminators aim to estimate identity information and pose information, thus proposing the following goal, as shown in equation (5):
Figure GDA0003699185950000104
equation (5) the first term represents the maximization of the probability that x is classified as a true identity and pose, and the second term represents the maximization
Figure GDA0003699185950000105
Probability of being classified as a false class.
At the same time, the generator G consists of an encoder G enc And a decoder G dec Constitution G enc The identity representation is intended to be learned from the real image x: f (x) ═ G enc (x),G dec The purpose of the method is to synthesize a target gesture code c and an identity label y d Face image of (1)
Figure GDA0003699185950000106
z represents random noise, c is represented by target pose y t And generating a one-hot vector. The generator G is aimed at spoofing the arbiter D will
Figure GDA0003699185950000107
The true identity and target pose classified as input x are the same, thus proposing the following goal, as shown in equation (6):
Figure GDA0003699185950000108
g and D mutually improve the learning ability of the network in the alternate training, and strive to synthesize a face with a positive posture and real identity information. At G dec In the input of a separate gesture code c, training G enc The pose changes are separated from the feature map f (x), even if f (x) represents as much identity information as possible and as little pose information as possible. Meanwhile, in the phase of the discriminator, the identity discriminator can obtain more complete identity information, and the attitude discriminator can judge the target attitude more robustly, so that the attitude characteristic is more accurate, and the influence caused by the fusion of the identity characteristic and the attitude characteristic is reduced.
Finally, the geometric symmetry of the human face is fully considered, and the symmetric constraint is applied to the synthesized image, so that the self-shielding problem can be effectively relieved, and the performance under the condition of large posture change is greatly improved. The following objective is therefore proposed, as shown in equation (7):
Figure GDA0003699185950000111
for convenience of calculation, the input picture is selectively flipped so that the occluded parts are all on the right. Where W and H denote the width and length of the composite image, respectively.
F. In the present example, the CMU Multi-PIE dataset is used as the training set, which is the largest dataset to evaluate face synthesis and recognition. 337 subjects with neutral expression, 13 poses (+ -90 °), 20 lights were used in the setup of the Multi-PIE dataset for training and testing. The first 200 objects are used as training set, and the remaining 137 objects are used as test set.
Further, the self-attention module shown in fig. 2 is implemented as follows:
A. calculating an original image by a plurality of convolution layers to obtain a feature map x, and respectively performing convolution on x by f (x), g (x), H (x) and f ' (x), g ' (x), H ' (x), f (x), g (x) and H (x) to obtain convolutions of which the numbers are 1x1, wherein the differences are that output channels are different in size and are respectively (C/8, N), (C/8, N) and (C, N), wherein C represents the number of feature channels, and N is W H;
B. transposing f '(x) and multiplying g' (x) to obtain an output matrix S with the size of (N, N), normalizing the S matrix row by softmax to obtain a beta matrix, wherein each row represents an attention mode, as shown in formula (8):
Figure GDA0003699185950000112
C. multiplying the N row vectors by h' (x) pixel by pixel, that is, each pixel is related to the whole feature map, and obtaining N new pixels as an attention feature map O, as shown in formula (9):
Figure GDA0003699185950000113
and finally, performing feature fusion on the attention feature map O and the feature map x, as shown in formula (10):
y=γO+x (10)
according to the invention, the attitude data set is subjected to data expansion through an attitude guide strategy, so that the labor cost required by data acquisition is reduced, the requirement on the quantity of training data sets is reduced, then a multi-image generation network is formed by utilizing the generated attitude template, the intra-class difference among the attitudes is reduced, the applicability of attitude diversity recognition in actual life is improved, and finally, the synthesized image is effectively classified through a self-attention double-discriminator network, so that the texture information of the synthesized image is enhanced, and the identity information of the original image is better preserved. Compared with other posture face recognition methods, the method effectively improves the performance of front face synthesis under the condition of adopting a conventional data set, thereby improving the face recognition precision and saving the labor cost and the network computing cost.
The methods, systems, apparatuses, modules or units described in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure in any way whatsoever. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (1)

1. A coupled pose face recognition method based on self attention is characterized by comprising the following steps:
101. inputting training data images into a multitask convolutional neural network for face detection and five feature point calibration, calculating attitude yaw angles corresponding to the feature points through a three-dimensional deformation model, then providing an attitude guidance strategy PGS based on a K-means algorithm, clustering the attitude yaw angles of all the images to obtain four optimal attitude yaw angles, acting on the three-dimensional deformation model, and generating four attitude templates;
102. using the posture guide strategy PGS constructed in the step 101, obtaining four posture faces from the target image through the posture guide strategy PGS, inputting the target image and the four posture faces into a generator network simultaneously, performing feature extraction on a plurality of posture face images by using a coding network, performing weighted average on a plurality of features to obtain a fusion feature, and restoring the fusion feature into a front face image by using a decoding network;
103. adopting a de-entanglement representation learning method, and adopting a double-discriminator network to perform discrimination learning on the synthesized image to obtain a posture face recognition result;
the step 101 specifically includes the following steps:
step 1, acquiring five key points of a human face by using MTCNN (multiple-view transform neural network), aligning the five key points with corresponding 3D models, wherein the five key points comprise left and right eye centers, nose tips and left and right mouth angles, and calculating a posture yaw angle yaw by using weak perspective projection;
1.1, inputting a training data image into an MTCNN network for face detection and calibrating five key points;
1.2 two-dimensional coordinates (x) of five key points i ,y i ) Mapping to a corresponding 3D face model, calculating Euler angles by using a weak perspective projection method, wherein the Euler angles comprise a pitch angle pitch, a yaw angle yaw and a roll angle, and forming a one-dimensional angle vector by using the obtained yaw angle yaw, wherein the calculation mode is as follows:
[p 1] T =fA[R|t 3d ][P 1] T (1)
where f is a scale factor, A is an orthogonal projection matrix, R is a 3x3 attitude matrix consisting of pitch angle, yaw angle, roll angle, and t 3d The method comprises the following steps that a balance vector is adopted, P and P respectively represent a two-dimensional index point of a key point and an index point in a 3D face model, and a yaw angle yaw is obtained as a face posture angle through R decomposition;
1.3, clustering the yaw angle vectors obtained in the step 1.2 by using a K-means algorithm, and dividing the vectors into four classes by calculating Euclidean distances to obtain four optimal attitude templates;
the step 102 specifically includes the following steps:
a2, acting the four gesture templates obtained in the step 101 on a single input image, and generating four face images with the same identity and different gestures with the assistance of a three-dimensional model changing 3 DMM;
b2, encoder G using the original image and the four posture images as input for generating network G enc Extracting the characteristics to obtain characteristics f 1 、f 2 、f 3 、f 4 、f 5 Encoder G enc Not only learning f (x), but also estimating a confidence coefficient w for each image to predict and represent the learning quality, and carrying out weighted average on the features to obtain a fusion feature f (x) in order to reduce the intra-class difference of the posture 1 ,x 2 ,x 3 ,x 4 ,x 5 ) The fusion feature is represented as G dec Is input to generate a sheet and x i New image G with same identity positive posture syn (ii) a The fusion characteristic f (x) of the step B2 1 ,x 2 ,x 3 ,x 4 ,x 5 ) The calculation of (c) is as follows:
Figure FDA0003683235710000021
wherein, w i Is a weight parameter corresponding to each feature, n represents the number of features (n is 5), and w is constrained to be in the range of [0,1 ] by applying a Sigmod activation function through learning w];
C2, synthesizing the image G syn Input identity discriminator D d The identity of the composite image is classified and then the front image having the same identity as the original image x is classified
Figure FDA0003683235710000022
And G syn Simultaneous input posture discriminator D p Performing attitude classification, D d And D p Responsible for G syn Judging the feature to be a fake class, and simultaneously designing a self-attention encoder and a self-attention discriminator, wherein the self-attention encoder has the function of enabling the extracted feature to be more authentic and discriminative, and the self-attention discriminator is responsible for judging the extracted feature to be the fake class and providing the feature with more robustness for the learning of a generator;
the design steps of the step C2 self-attention discriminator and the self-attention encoder are as follows:
a3, calculating an original image by a plurality of convolution layers to obtain a feature map x, respectively passing x through f (x), g (x), H (x) convolution layers to obtain f ' (x), g ' (x), H ' (x), f (x), g (x) and H (x) which are all 1 × 1 convolutions, respectively, wherein f (x), g (x) and H (x) respectively represent the convolutions of which all are 1 × 1, the sizes of channels are different, and the difference lies in that the sizes of output channels are different and are respectively (C/8, N), (C/8, N) and (C, N), wherein C represents the number of feature channels, and N ═ W × H, W, H respectively represents the width and height of the feature map;
b3, transposing f '(x) and multiplying g' (x), obtaining the size of the output matrix S as (N, N), normalizing the S matrix row by softmax to obtain a β matrix, each row representing an attention mode, as shown in formula (3):
Figure FDA0003683235710000031
wherein beta is i,j Indicates the attention degree, s, to the ith position of the jth region in the composition ij The value, f (x), representing the ith position of the jth region of the output matrix i ) And g (x) j ) Respectively representing two feature spaces of x after primary transformation;
c3, multiplying the N row vectors by h' (x) pixel by pixel, that is, each pixel is related to the whole feature map, and obtaining N new pixels as the attention feature map O, as shown in formula (4):
Figure FDA0003683235710000032
wherein W h Represents the learned weight matrix as a1 × 1 convolution, h (x) i ) Representing the feature space of x after primary transformation;
and finally, performing feature fusion on the attention feature map O and the feature map x, as shown in formula (5):
y=γO+x (5)
where γ represents a learnable scalar, initialized to 0;
the step 103 specifically includes the following steps:
firstly, inputting the face image synthesized in the step 102 into an identity discriminator to calculate identity loss, wherein the calculation formula is as follows:
Figure FDA0003683235710000033
wherein, p (x) represents the gesture code of the original image, G (x, c, z) represents the composite image, then the composite face and the real face are input into the gesture discriminator to calculate the face gesture loss, and the calculation formula is as follows:
Figure FDA0003683235710000034
finally, according to a game mechanism for generating a confrontation network, the generator tries to synthesize an image of a deception discriminator D, the discriminator tries to discriminate the synthesized image into a fake class, the two classes compete with each other for learning, and a vivid front face image is synthesized, wherein the calculation formula is as follows:
Figure FDA0003683235710000035
in the step 103, in the process of performing discriminant learning on the synthesized image by using the dual-discriminator network, a self-attention module is further implanted into the generator and the discriminator, and the local texture information is enhanced by distributing the weight of the feature channel.
CN202011308968.2A 2020-11-20 2020-11-20 Coupled posture face recognition method based on self-attention Active CN112418074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011308968.2A CN112418074B (en) 2020-11-20 2020-11-20 Coupled posture face recognition method based on self-attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011308968.2A CN112418074B (en) 2020-11-20 2020-11-20 Coupled posture face recognition method based on self-attention

Publications (2)

Publication Number Publication Date
CN112418074A CN112418074A (en) 2021-02-26
CN112418074B true CN112418074B (en) 2022-08-23

Family

ID=74774075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011308968.2A Active CN112418074B (en) 2020-11-20 2020-11-20 Coupled posture face recognition method based on self-attention

Country Status (1)

Country Link
CN (1) CN112418074B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801069B (en) * 2021-04-14 2021-06-29 四川翼飞视科技有限公司 Face key feature point detection device, method and storage medium
CN113408351B (en) * 2021-05-18 2022-11-29 河南大学 Pedestrian re-recognition method for generating confrontation network based on attitude guidance
CN113222032B (en) * 2021-05-19 2023-03-10 西安电子科技大学 No-reference image quality evaluation method based on self-attention image coding
CN113674334B (en) * 2021-07-06 2023-04-18 复旦大学 Texture recognition method based on depth self-attention network and local feature coding
CN113553961B (en) * 2021-07-27 2023-09-05 北京京东尚科信息技术有限公司 Training method and device of face recognition model, electronic equipment and storage medium
US11803996B2 (en) 2021-07-30 2023-10-31 Lemon Inc. Neural network architecture for face tracking
CN113705358B (en) * 2021-08-02 2023-07-18 山西警察学院 Multi-angle side face normalization method based on feature mapping
CN113706404B (en) * 2021-08-06 2023-11-21 武汉大学 Depression angle face image correction method and system based on self-attention mechanism
CN114022930B (en) * 2021-10-28 2024-04-16 天津大学 Automatic generation method of portrait credentials
CN114330565A (en) * 2021-12-31 2022-04-12 深圳集智数字科技有限公司 Face recognition method and device
CN114005169B (en) * 2021-12-31 2022-03-22 中科视语(北京)科技有限公司 Face key point detection method and device, electronic equipment and storage medium
CN114360032B (en) * 2022-03-17 2022-07-12 北京启醒科技有限公司 Polymorphic invariance face recognition method and system
CN115083000B (en) * 2022-07-14 2023-09-05 北京百度网讯科技有限公司 Face model training method, face changing method, face model training device and electronic equipment
CN116152885B (en) * 2022-12-02 2023-08-01 南昌大学 Cross-modal heterogeneous face recognition and prototype restoration method based on feature decoupling
CN115862120B (en) * 2023-02-21 2023-11-10 天度(厦门)科技股份有限公司 Face action unit identification method and equipment capable of decoupling separable variation from encoder

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506717A (en) * 2017-08-17 2017-12-22 南京东方网信网络科技有限公司 Without the face identification method based on depth conversion study in constraint scene
CN108038474A (en) * 2017-12-28 2018-05-15 深圳云天励飞技术有限公司 Method for detecting human face, the training method of convolutional neural networks parameter, device and medium
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism
CN111738940A (en) * 2020-06-02 2020-10-02 大连理工大学 Human face image eye completing method for generating confrontation network based on self-attention mechanism model
CN111796681A (en) * 2020-07-07 2020-10-20 重庆邮电大学 Self-adaptive sight estimation method and medium based on differential convolution in man-machine interaction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8831299B2 (en) * 2007-05-22 2014-09-09 Intellectual Ventures Fund 83 Llc Capturing data for individual physiological monitoring
US10685262B2 (en) * 2015-03-20 2020-06-16 Intel Corporation Object recognition based on boosting binary convolutional neural network features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506717A (en) * 2017-08-17 2017-12-22 南京东方网信网络科技有限公司 Without the face identification method based on depth conversion study in constraint scene
CN108038474A (en) * 2017-12-28 2018-05-15 深圳云天励飞技术有限公司 Method for detecting human face, the training method of convolutional neural networks parameter, device and medium
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism
CN111738940A (en) * 2020-06-02 2020-10-02 大连理工大学 Human face image eye completing method for generating confrontation network based on self-attention mechanism model
CN111796681A (en) * 2020-07-07 2020-10-20 重庆邮电大学 Self-adaptive sight estimation method and medium based on differential convolution in man-machine interaction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XueMei Zhao.A real-time face recognition system based on the improved LBPH algorithm.《2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP)》.2017, *
陈冠豪.深度人脸特征提取及识别的应用研究_陈冠豪.《信息科技辑》.2018, *

Also Published As

Publication number Publication date
CN112418074A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112418074B (en) Coupled posture face recognition method based on self-attention
Zhu et al. Data Augmentation using Conditional Generative Adversarial Networks for Leaf Counting in Arabidopsis Plants.
CN110348330B (en) Face pose virtual view generation method based on VAE-ACGAN
CN101320484B (en) Three-dimensional human face recognition method based on human face full-automatic positioning
CN101561874B (en) Method for recognizing face images
Holte et al. A local 3-D motion descriptor for multi-view human action recognition from 4-D spatio-temporal interest points
CN100410963C (en) Two-dimensional linear discrimination human face analysis identificating method based on interblock correlation
Bongsoo Choy et al. Enriching object detection with 2d-3d registration and continuous viewpoint estimation
CN107784284B (en) Face recognition method and system
CN113870157A (en) SAR image synthesis method based on cycleGAN
Wang et al. Joint head pose and facial landmark regression from depth images
KR20130059212A (en) Robust face recognition method through statistical learning of local features
CN105740838A (en) Recognition method in allusion to facial images with different dimensions
CN111680579A (en) Remote sensing image classification method for adaptive weight multi-view metric learning
Fu et al. Personality trait detection based on ASM localization and deep learning
Sun et al. Perceptual multi-channel visual feature fusion for scene categorization
CN109284692A (en) Merge the face identification method of EM algorithm and probability two dimension CCA
Weiss et al. Representation of similarity as a goal of early visual processing
Abdelaziz et al. Few-shot learning with saliency maps as additional visual information
Zhang et al. Vision-Based Satellite Recognition and Pose Estimation Using Gaussian Process Regression
Asad et al. Low complexity hybrid holistic–landmark based approach for face recognition
Nouri et al. Global visual saliency: Geometric and colorimetrie saliency fusion and its applications for 3D colored meshes
CN113344110A (en) Fuzzy image classification method based on super-resolution reconstruction
Shahin et al. Human Face Recognition from Part of a Facial Image based on Image Stitching
Gao et al. Boosting Pseudo Census Transform Features for Face Alignment.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240318

Address after: 518000 1104, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Patentee after: Shenzhen Hongyue Information Technology Co.,Ltd.

Country or region after: China

Address before: 400065 Chongwen Road, Nanshan Street, Nanan District, Chongqing

Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS

Country or region before: China

TR01 Transfer of patent right

Effective date of registration: 20240331

Address after: Floor 5-7, Block A, Building 1, No. 166 Wuxing Fourth Road, Wuhou District, Chengdu City, Sichuan Province, 610045

Patentee after: SICHUAN GERITE TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: 518000 1104, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Patentee before: Shenzhen Hongyue Information Technology Co.,Ltd.

Country or region before: China