CN112418074B

CN112418074B - Coupled posture face recognition method based on self-attention

Info

Publication number: CN112418074B
Application number: CN202011308968.2A
Authority: CN
Inventors: 周丽芳; 陈旭; 李伟生; 雷帮军
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Sichuan Gerite Technology Co ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-08-23
Anticipated expiration: 2040-11-20
Also published as: CN112418074A

Abstract

The invention requests to protect a coupled posture face recognition method based on self attention, and belongs to the technical field of pattern recognition. The method mainly comprises the following steps: step 1, carrying out preprocessing operation (face detection and feature point calibration) on a training image by using MTCNN; and providing a posture guiding strategy PGS based on a K-means algorithm, and determining a posture template. And 2, generating a plurality of human faces with different postures by using the image input into the network through a posture template, generating a network encoder to perform feature encoding on the human faces, performing weighted average to obtain fusion features, and restoring a decoder into a front human face image. And 3, constructing a posture-oriented dual-discriminator to generate an antagonistic network PGDD-GAN, and performing antagonistic training on the synthetic image. And 4, implanting a self-attention model in an encoder and a discriminator network in order to enhance the local texture information of the synthesized image. The method reduces the requirement of the model on the source data set, and improves the robustness of face recognition in an unsupervised environment.

Description

Coupled posture face recognition method based on self-attention

Technical Field

The invention belongs to the technical field of computer mode recognition, and particularly relates to a face recognition method based on invariant posture.

Background

In real life, such as special occasions as access control systems, airports, customs entrances and the like, the identity verification system requires that a target object actively cooperates with the acquisition of a front face image, so that a relatively ideal recognition result can be obtained. However, in real scenes, active fitting of the target object cannot be achieved in most cases, and even the target object is acquired by a video monitoring system without knowing the target object, that is, the target object is acquired under a non-ideal viewing angle such as a top view, a side view, a bottom view and the like. Under the non-ideal visual angles, people can still accurately identify the human face, and the identification performance is kept quite high; however, machine vision is difficult to achieve, and under a non-ideal viewing angle, factors such as illuminance, accessories such as glasses and the like, image resolution and the like have a particularly obvious influence on the recognition performance of the machine vision. Therefore, the active and deep research on the key problems of face recognition under posture change has important theoretical significance and wider application prospect.

Some researchers have been working on solving the problem of face recognition under pose change, and have achieved some research results. The main idea of the posture-invariant feature learning method is as follows: by extracting the features of the images with different postures, the front face is restored by using a network module, and the front face is stored as supervision information for network training. The most representative method is a deep neural network DNN, and the method starts from learning the face identity retention features and obtains good recognition results, but has some disadvantages: due to the deep structure of the model, the network has millions of parameters to adjust, and therefore a large amount of multi-pose training data is needed. Compared with a posture-invariant feature learning method, the method based on the face synthesis technology has more advantages in the application of actual life scenes. Face synthesis methods can be roughly divided into two categories: 2D face synthesis based and 3D face synthesis based. The 2D face synthesis based method tries to extract robust features of pose through a nonlinear regression model, and synthesizes a frontal pose image using a local warping strategy, and the representative methods of this type of method are: the method comprises the steps of Stack-flow, a disentanglement Representation Learning method (DR-GAN), a field Unsupervised Face regularization posture and Expression Recognition method (unscuperated Face Normalization with Expression position and Expression in the Wild, FNM) and the like, wherein the Face image synthesized by the method has a fuzzy effect and loses fine Face texture information. A3D face synthesis-based method normalizes a face image to a uniform posture by evaluating face depth information change, and the representative method comprises the following steps: 3D type variable model (3DMM), field Large-Pose Face normalization (facial-Face frontization in the Wild, FF-GAN). Such methods typically use limited information, such as dense facial keypoint coordinates, to estimate pose and shape parameters, and errors in pose and shape estimation can produce undesirable artifacts in subsequent texture mapping and face synthesis operations, adversely affecting face recognition. In order to solve the above problems, the present invention provides a coupled pose face recognition method based on self-attention.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A coupled pose face recognition method based on self attention is provided. The technical scheme of the invention is as follows:

a self-attention-based coupled pose face recognition method comprises the following steps:

101. inputting training data images into a multitask convolutional neural network for face detection and five feature point calibration, calculating attitude yaw angles corresponding to the feature points through a three-dimensional deformation model, then providing an attitude guidance strategy PGS based on a K-means algorithm, clustering the attitude yaw angles of all the images to obtain four optimal attitude yaw angles, acting on the three-dimensional deformation model, and generating four attitude templates;

102. using the posture guide strategy PGS constructed in the step 101, obtaining four posture faces from the target image through the posture guide strategy PGS, inputting the target image and the four posture faces into a generator network simultaneously, performing feature extraction on a plurality of posture face images by using a coding network, performing weighted average on a plurality of features to obtain a fusion feature, and restoring the fusion feature into a front face image by using a decoding network;

103. adopting a disentanglement representation learning method, and adopting a double-discriminator network to discriminate and learn the synthetic image to obtain a posture face recognition result;

the step 101 specifically includes the following steps:

a1, inputting the training data image into an MTCNN network for face detection and calibrating five key points, namely the centers of the left and right eyes, the nose tip and the left and right mouth corners;

b1 two-dimensional coordinates (x) through five keypoints _i ，y _i ) Calculating Euler angles, namely a pitch angle pitch, a yaw angle yaw and a roll angle, and forming the obtained yaw angle into a one-dimensional angle vector;

c1, clustering the vectors obtained in the step B1 by using a K-means algorithm, and dividing the vectors into four types by calculating Euclidean distances to obtain four posture templates;

in the step 102, the target image is subjected to a pose guidance strategy PGS to obtain four pose faces, the target image and the four pose faces are simultaneously input to a generator network, a coding network is used to perform feature extraction on multiple pose face images, a plurality of features are weighted and averaged to obtain a fusion feature, and a decoding network is used to restore the fusion feature to a front face image, which specifically includes the following steps:

a2, acting the four posture templates obtained in the step 101 on a single input image, and generating four face images with the same identity and different postures with the assistance of a three-dimensional metamorphic model 3 DMM;

b2, encoder G using the original image and the four posture images as input for generating network G _enc Extracting the characteristics to obtain characteristics f ₁ 、f ₂ 、f ₃ 、f ₄ 、f ₅ Encoder G _enc Not only learning f (x), but also estimating a confidence coefficient w for each image to predict and represent the learning quality, and carrying out weighted average on the features to obtain a fusion feature f (x) in order to reduce the intra-class difference of the posture ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ) The fusion feature is represented as G _dec Generating an x and _i new image G with same identity positive posture _syn ；

C2, synthesizing the image G _syn Input identity discriminator D ^d The identity of the composite image is classified and then the front image having the same identity as the original image x is classified

And G _syn Simultaneous input posture discriminator D ^p Performing attitude classification, D ^d And D ^p Responsible for G _syn Judging the feature to be a fake class, and simultaneously designing a self-attention encoder and a self-attention discriminator, wherein the self-attention encoder has the function of enabling the extracted feature to be more authentic and discriminative, and the self-attention discriminator is responsible for judging the extracted feature to be the fake class and providing the feature with more robustness for the learning of a generator;

the calculation formula of the step C2 is as follows:

wherein p is _z (z) denotes a noise code, initialized to 50, p _c (c) Representing a pose code, and G (x, c, z) representing a composite image, wherein the formula utilizes cross entropy to calculate identity loss and pose loss;

the step 103 adopts a de-entanglement representation learning method and adopts a double-discriminator network to discriminate and learn the synthetic image, and specifically comprises the following steps:

firstly, inputting the face image synthesized in the step 102 into an identity discriminator to calculate identity loss, then inputting the synthesized face and a real front face into a posture discriminator to calculate front posture loss, and finally synthesizing a vivid front face image according to a game mechanism for generating a confrontation network;

firstly, inputting the face image synthesized in the step 102 into an identity discriminator to calculate identity loss, wherein the calculation formula is as follows:

then inputting the synthesized face and the real face into a pose discriminator to calculate the loss of the face pose, wherein the calculation formula is as follows:

finally, according to a game mechanism for generating a confrontation network, the generator tries to synthesize an image of a deception discriminator, the discriminator tries to discriminate the synthesized image into a fake type, the two types of images resist and learn each other, and a vivid front face image is synthesized, wherein the calculation formula is as follows:

further, in the step 103, in the process of performing discriminant learning on the synthesized image by using the dual-discriminator network, a self-attention module is further implanted into the generator and the discriminator, and the weight of the feature channel is distributed, so that the local texture information is enhanced.

Further, the fusion feature f (x) of the step B2 ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ) The calculation method of (c) is as follows:

wherein, w _i Is a weight parameter corresponding to each feature, n represents the number of features (n is 5), and w is constrained to be in the range of [0,1 ] by applying a Sigmod activation function through learning w]。

Further, the step C2 is a step of designing the self-attention discriminator and the self-attention encoder, which comprises the following steps:

a3, calculating an original image by a plurality of convolution layers to obtain a feature map x, respectively passing x through f (x), g (x), H (x) convolution layers to obtain f ' (x), g ' (x), H ' (x), f (x), g (x) and H (x) which are all 1 × 1 convolutions, respectively, wherein f (x), g (x) and H (x) respectively represent the convolutions of which all are 1 × 1, the sizes of channels are different, and the difference lies in that the sizes of output channels are different and are respectively (C/8, N), (C/8, N) and (C, N), wherein C represents the number of feature channels, and N ═ W × H, W, H respectively represents the width and height of the feature map;

b3, transposing f '(x) and multiplying g' (x), obtaining the output matrix S with size (N, N), normalizing the S matrix row by softmax to obtain β matrix, each row representing an attention mode, as shown in formula (3):

wherein beta is _i,j Indicates the attention degree, s, to the ith position of the jth region in the composition _ij The value, f (x), representing the ith position of the jth region of the output matrix _i ) And g (x) _j ) Respectively representing two feature spaces of x after primary transformation;

c3, multiplying the N row vectors by h' (x) pixel by pixel, that is, each pixel is related to the whole feature map, and obtaining N new pixels as the attention feature map O, as shown in formula (4):

wherein W _h Represents the learned weight matrix as a1 × 1 convolution, h (x) _i ) Representing the feature space of x after primary transformation;

and finally, performing feature fusion on the attention feature map O and the feature map x, as shown in formula (5):

y＝γO+x (5)

where γ represents a learnable scalar, initialized to 0.

The invention has the following advantages and beneficial effects:

the innovation of the present invention is primarily the steps 101, 102, 103 of the claims.

The innovation of claim 101 is in data expansion of the original data set using a K-means clustering algorithm. In the common pose face data sets LFW and IJB-A, CFP, the poses are not uniformly distributed, and part of the poses do not exist, so that the manpower and financial resources consumed by large-scale data set acquisition cannot be measured, and the requirement of network training on equipment is very high. The invention provides a Posture Guidance Strategy (PGS), which is characterized in that five key points of a training image are obtained and mapped into a 3D model to calculate a deflection angle yaw, four optimal templates are obtained by clustering the deflection angles of all images, and the templates are acted on each image to obtain four face images with the same identity and different postures, so that the posture data expansion of an original data set is realized.

The innovative point of step 102 is to construct a multi-image generation network using the augmented pose face images. The generation network in the traditional GAN framework is single image input, and the characteristic difference of the pose face is not well solved. The invention provides a multi-image generation network, and a plurality of posture images are simultaneously input into the generation network by utilizing the expansion data in the step 101, so that the intra-class difference between postures is reduced, and the local texture information of the contour human face is used for supplementing and learning the local characteristics of the front human face by adopting a weighted average method, so that the network has the similar human distinguishing thinking and can be more suitable for the actual life scene.

The innovation of step 103 is to propose a self-attention dual-arbiter network. In the posture face recognition visual task based on the generated confrontation network, the local edge of the synthesized image is fuzzy, the posture characteristic is insensitive, and the network training convergence speed is low. The invention constructs a self-attention double-discriminator network framework, wherein an identity discriminator is responsible for identity feature calculation and classification, a posture discriminator is used for judging whether the face is positive, and a self-attention module is implanted for extracting deep feature representation and enhancing the quality of a synthetic image and local texture information. In conclusion, the synthesized front face image is more vivid and the recognition effect is better.

The invention mainly aims at the problems that the existing popular gesture-invariant face recognition method based on the generation countermeasure network lacks the robust feature extraction of face gesture change, the synthetic image is fuzzy, and the face texture information expression is insufficient, and designs a self-attention-based coupled gesture face recognition method. The method fully considers the situation that self-shielding phenomenon is caused by the posture deflection of the face in an actual scene, the face cannot be accurately identified, and a plurality of posture changes exist in a short time, and designs a set of self-attention double-discriminator network PGDD-GAN adopting posture guidance. In addition, the geometric symmetry property of the face is fully considered, and in order to improve the authenticity of the method when the front face is synthesized from the pose face, a face symmetry loss function is defined, and the tolerance capability of the network model to pose change is effectively improved. The method obtains better face recognition effect.

Drawings

FIG. 1 is a diagram of a network of generating network encoders;

fig. 2 is a flow chart of a self-attention module.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly in the following with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the method comprises the steps that based on the fact that a posture face recognition frame for generating a countermeasure network GAN is used as a basic frame, firstly, a posture yaw angle is obtained and calculated through an MTCNN, and four optimal posture templates are clustered through a K-means algorithm; then constructing a multi-image generation network structure; and finally, constructing a self-attention double-discriminant to carry out countermeasure training learning on the output image of the generated network.

The implementation process of the self-attention-based coupled pose face recognition framework provided by the embodiment of the invention is detailed as follows:

step 1, acquiring five key points (the centers of the left and right eyes, the nose tip and the left and right mouth corners) of a human face by using MTCNN, aligning the key points with a corresponding 3D model, calculating an attitude yaw angle yaw by using weak perspective projection,

1.1, inputting a training data image into an MTCNN network to carry out face detection and calibrating five key points (the centers of left and right eyes, the nose tip and the left and right mouth corners);

1.2 two-dimensional coordinates (x) of five key points _i ，y _i ) Mapping to a corresponding 3D face model, calculating Euler angles (a pitch angle pitch, a yaw angle yaw, and a roll angle) by using a weak perspective projection method, and forming the obtained yaw angle yaw into a one-dimensional angle vector by using the calculation method as follows:

[p 1] ^T ＝fA[R|t _3d ][P 1] ^T (1)

where f is a scale factor, A is an orthogonal projection matrix, R is a 3x3 attitude matrix consisting of pitch angle, yaw angle, roll angle, and t _3d Obtaining a yaw angle yaw as a human face posture angle by decomposing R, wherein the yaw is a balance vector;

and 1.3, clustering the yaw angle vectors obtained in the step 1.2 by using a K-means algorithm, and calculating Euclidean distances to divide the Euclidean distances into four classes to obtain four optimal attitude templates.

And 2, in order to reduce the intra-class difference among the postures and solve the problem of monitoring and identifying a face with a plurality of posture changes in the actual life, the method is different from the traditional single image input of the GAN, and a fused front face is generated by adopting a multi-image input mode. G _enc Extracting the features of each input image, weighting and averaging a plurality of features to obtain a fusion feature, and finally obtaining a feature value G _dec And synthesizing a front face image. The method comprises the following specific steps:

2.1, acting the four posture templates obtained in the step 1 on a single input image to generate four face images with the same identity and different postures;

2.2, using the original image and the four attitude images as the input for generating the network G, the encoder G _enc Performing feature extraction to obtain a feature f ₁ 、f ₂ 、f ₃ 、f ₄ 、f ₅ Encoder G _enc Not only learning f (x), but also estimating a confidence coefficient w for each image to predict the representationQuality of learning. In order to reduce the intra-class difference of the postures, the features are weighted and averaged to obtain a fusion feature f (x) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ) The calculation method is as follows:

by learning w, the high quality image is made to contribute more to the fused representation. The invention applies Sigmod activation function to constrain the range of w to [0,1 ]]. Representing the fusion feature as G _dec Generating an x and _i new image G with same identity positive posture _syn ；

And 3, performing countermeasure training by using the network output and the original image in the step 2 as the input of the self-attention double-discriminant to obtain a trained model. The method comprises the following specific steps:

synthesizing the image G _syn Input identity discriminator D ^d The identity of the composite image is classified and then the front image having the same identity as the original image x is classified

And G _syn Simultaneous input posture discriminator D ^p And carrying out posture classification. D ^d And D ^p Responsible for G _syn Classifying the data into a pseudo class, wherein the calculation formula is as follows:

further, the step 2 and the step 3 construct and train a PGDD-GAN network framework, and the specific implementation steps are as follows:

A. as shown in fig. 1, the network structure of the encoder for generating the network is, from top to bottom: the first layer is divided into two sublayers, wherein the two sublayers are convolution layers, the sizes of convolution kernels are both 3x3, and the number of the convolution kernels is 32 and 64 respectively; the second layer is divided into three sublayers, which are convolution layers and convolution kernelsThe sizes are all 3x3, and the number of convolution kernels is 64, 64 and 128 respectively; the third layer is divided into three sublayers, wherein the three sublayers are convolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 128, 96 and 192 respectively; the fourth layer is divided into three sublayers, wherein the three sublayers are convolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 192, 128 and 256 respectively; the fifth layer is divided into three sublayers, wherein the three sublayers are convolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 256, 160 and 321 respectively; the sixth layer is an average pooling layer with a pooling interval of 6x 6; the seventh layer is a full-junction layer with a neuron number of N ^d +N ^p +1(N ^d Representing training data set number of people, N ^p Representing a discrete total number of poses);

B. the decoder network structure is the deconvolution process of the encoder network structure: the first layer is divided into three sublayers, namely a full-connection layer and two deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 320, 160 and 256 respectively; the second layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 256, 128 and 192 respectively; the third layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 192, 96 and 128 respectively; the fourth layer is a self-attention layer which is divided into three 1x1 convolution layers and a softmax layer, the number of convolution kernels is 128, and the weight parameters are continuously learned and updated, so that part of characteristic images needing attention are emphasized; the fifth layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 128, 64 and 64 respectively; the sixth layer is a self-attention layer and is divided into three 1x1 convolution layers and one softmax layer, the number of convolution kernels is 64, and the feature images are subjected to weighted summation again to generate an attention feature map; the seventh layer is divided into three sublayers which are all deconvolution layers, the sizes of convolution kernels are all 3x3, and the number of the convolution kernels is 64, 32 and 3 respectively; the image synthesized by the decoder is used as the input of the discrimination network;

C. the network structure of the identity discrimination network and the posture discrimination network is similar to that of a decoder, except that a self-attention layer is added behind the fourth layer and the fifth layer respectively. Identity discriminating network andthe posture discrimination network is distinguished in that the number of neurons in the seventh layer is N ^d 、N ^p ；

D. Testing stage, inputting with label y ═ y ^d ,y ^p The real face x, y of ^d Represents an identity tag, y ^p Representing the attitude tag, generating a network output composite image with an identity discriminator D ^d The attitude discriminator is D ^p ；

E. When the neural network is trained, the cross entropy function is used for calculating the loss of the two classification tasks of the real sample and the synthetic sample, and the two classification tasks are shown in a calculation formula (3):

wherein, x represents a real face of a person,

representing the synthesized face output by the generator.

Similarly, the identity loss and the attitude loss corresponding to the two discriminators are respectively calculated by using a cross entropy function, and are shown in a calculation formula (4):

inputting a real face x, two discriminators aim to estimate identity information and pose information, thus proposing the following goal, as shown in equation (5):

equation (5) the first term represents the maximization of the probability that x is classified as a true identity and pose, and the second term represents the maximization

Probability of being classified as a false class.

At the same time, the generator G consists of an encoder G _enc And a decoder G _dec Constitution G _enc The identity representation is intended to be learned from the real image x: f (x) ═ G _enc (x)，G _dec The purpose of the method is to synthesize a target gesture code c and an identity label y ^d Face image of (1)

z represents random noise, c is represented by target pose y ^t And generating a one-hot vector. The generator G is aimed at spoofing the arbiter D will

The true identity and target pose classified as input x are the same, thus proposing the following goal, as shown in equation (6):

g and D mutually improve the learning ability of the network in the alternate training, and strive to synthesize a face with a positive posture and real identity information. At G _dec In the input of a separate gesture code c, training G _enc The pose changes are separated from the feature map f (x), even if f (x) represents as much identity information as possible and as little pose information as possible. Meanwhile, in the phase of the discriminator, the identity discriminator can obtain more complete identity information, and the attitude discriminator can judge the target attitude more robustly, so that the attitude characteristic is more accurate, and the influence caused by the fusion of the identity characteristic and the attitude characteristic is reduced.

Finally, the geometric symmetry of the human face is fully considered, and the symmetric constraint is applied to the synthesized image, so that the self-shielding problem can be effectively relieved, and the performance under the condition of large posture change is greatly improved. The following objective is therefore proposed, as shown in equation (7):

for convenience of calculation, the input picture is selectively flipped so that the occluded parts are all on the right. Where W and H denote the width and length of the composite image, respectively.

F. In the present example, the CMU Multi-PIE dataset is used as the training set, which is the largest dataset to evaluate face synthesis and recognition. 337 subjects with neutral expression, 13 poses (+ -90 °), 20 lights were used in the setup of the Multi-PIE dataset for training and testing. The first 200 objects are used as training set, and the remaining 137 objects are used as test set.

Further, the self-attention module shown in fig. 2 is implemented as follows:

A. calculating an original image by a plurality of convolution layers to obtain a feature map x, and respectively performing convolution on x by f (x), g (x), H (x) and f ' (x), g ' (x), H ' (x), f (x), g (x) and H (x) to obtain convolutions of which the numbers are 1x1, wherein the differences are that output channels are different in size and are respectively (C/8, N), (C/8, N) and (C, N), wherein C represents the number of feature channels, and N is W H;

B. transposing f '(x) and multiplying g' (x) to obtain an output matrix S with the size of (N, N), normalizing the S matrix row by softmax to obtain a beta matrix, wherein each row represents an attention mode, as shown in formula (8):

C. multiplying the N row vectors by h' (x) pixel by pixel, that is, each pixel is related to the whole feature map, and obtaining N new pixels as an attention feature map O, as shown in formula (9):

and finally, performing feature fusion on the attention feature map O and the feature map x, as shown in formula (10):

y＝γO+x (10)

according to the invention, the attitude data set is subjected to data expansion through an attitude guide strategy, so that the labor cost required by data acquisition is reduced, the requirement on the quantity of training data sets is reduced, then a multi-image generation network is formed by utilizing the generated attitude template, the intra-class difference among the attitudes is reduced, the applicability of attitude diversity recognition in actual life is improved, and finally, the synthesized image is effectively classified through a self-attention double-discriminator network, so that the texture information of the synthesized image is enhanced, and the identity information of the original image is better preserved. Compared with other posture face recognition methods, the method effectively improves the performance of front face synthesis under the condition of adopting a conventional data set, thereby improving the face recognition precision and saving the labor cost and the network computing cost.

The methods, systems, apparatuses, modules or units described in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure in any way whatsoever. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A coupled pose face recognition method based on self attention is characterized by comprising the following steps:

103. adopting a de-entanglement representation learning method, and adopting a double-discriminator network to perform discrimination learning on the synthesized image to obtain a posture face recognition result;

the step 101 specifically includes the following steps:

step 1, acquiring five key points of a human face by using MTCNN (multiple-view transform neural network), aligning the five key points with corresponding 3D models, wherein the five key points comprise left and right eye centers, nose tips and left and right mouth angles, and calculating a posture yaw angle yaw by using weak perspective projection;

1.1, inputting a training data image into an MTCNN network for face detection and calibrating five key points;

1.2 two-dimensional coordinates (x) of five key points _i ，y _i ) Mapping to a corresponding 3D face model, calculating Euler angles by using a weak perspective projection method, wherein the Euler angles comprise a pitch angle pitch, a yaw angle yaw and a roll angle, and forming a one-dimensional angle vector by using the obtained yaw angle yaw, wherein the calculation mode is as follows:

[p 1] ^T ＝fA[R|t _3d ][P 1] ^T (1)

where f is a scale factor, A is an orthogonal projection matrix, R is a 3x3 attitude matrix consisting of pitch angle, yaw angle, roll angle, and t _3d The method comprises the following steps that a balance vector is adopted, P and P respectively represent a two-dimensional index point of a key point and an index point in a 3D face model, and a yaw angle yaw is obtained as a face posture angle through R decomposition;

1.3, clustering the yaw angle vectors obtained in the step 1.2 by using a K-means algorithm, and dividing the vectors into four classes by calculating Euclidean distances to obtain four optimal attitude templates;

the step 102 specifically includes the following steps:

a2, acting the four gesture templates obtained in the step 101 on a single input image, and generating four face images with the same identity and different gestures with the assistance of a three-dimensional model changing 3 DMM;

b2, encoder G using the original image and the four posture images as input for generating network G _enc Extracting the characteristics to obtain characteristics f ₁ 、f ₂ 、f ₃ 、f ₄ 、f ₅ Encoder G _enc Not only learning f (x), but also estimating a confidence coefficient w for each image to predict and represent the learning quality, and carrying out weighted average on the features to obtain a fusion feature f (x) in order to reduce the intra-class difference of the posture ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ) The fusion feature is represented as G _dec Is input to generate a sheet and x _i New image G with same identity positive posture _syn (ii) a The fusion characteristic f (x) of the step B2 ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ) The calculation of (c) is as follows:

wherein, w _i Is a weight parameter corresponding to each feature, n represents the number of features (n is 5), and w is constrained to be in the range of [0,1 ] by applying a Sigmod activation function through learning w]；

the design steps of the step C2 self-attention discriminator and the self-attention encoder are as follows:

b3, transposing f '(x) and multiplying g' (x), obtaining the size of the output matrix S as (N, N), normalizing the S matrix row by softmax to obtain a β matrix, each row representing an attention mode, as shown in formula (3):

y＝γO+x (5)

where γ represents a learnable scalar, initialized to 0;

the step 103 specifically includes the following steps:

wherein, p (x) represents the gesture code of the original image, G (x, c, z) represents the composite image, then the composite face and the real face are input into the gesture discriminator to calculate the face gesture loss, and the calculation formula is as follows:

finally, according to a game mechanism for generating a confrontation network, the generator tries to synthesize an image of a deception discriminator D, the discriminator tries to discriminate the synthesized image into a fake class, the two classes compete with each other for learning, and a vivid front face image is synthesized, wherein the calculation formula is as follows:

in the step 103, in the process of performing discriminant learning on the synthesized image by using the dual-discriminator network, a self-attention module is further implanted into the generator and the discriminator, and the local texture information is enhanced by distributing the weight of the feature channel.