CN112800937A

CN112800937A - Intelligent face recognition method

Info

Publication number: CN112800937A
Application number: CN202110101590.7A
Authority: CN
Inventors: 李弘�; 肖南峰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-05-14
Anticipated expiration: 2041-01-26
Also published as: CN112800937B

Abstract

The invention discloses an intelligent face recognition method, which comprises the following steps: 1) face detection: intercepting a source posture face picture taking a face part as main content from an original picture; 2) aligning the human face: identifying and positioning face key points in a source posture face picture; 3) the human face gesture is rotated: according to the source posture face picture and the selected posture, keeping identity information and expression information of the source posture face picture, and generating a target posture face picture; 4) facial expression and identity recognition: and combining the source posture face picture and the target posture face picture to judge the expression and identity of the face in the picture. The invention provides an end-to-end identification method which is established by combining three innovative points of attention mechanism, generation of a countermeasure network and integrated learning. The limitation of extreme postures is broken through, the synthesized front image is used for face identity and expression recognition without constraint conditions, the accuracy and robustness are improved, and the method has a wide application prospect in the field of face recognition.

Description

Intelligent face recognition method

Technical Field

The invention relates to the technical field of face recognition, in particular to an intelligent face recognition method.

Background

Human face-related vision tasks are an important field of computer vision applications, and have made tremendous progress with the help of deep learning. However, the performance of the visual algorithm is severely restricted by complex factors such as multi-view angles, expressions, illumination, shading and the like in a real application scene, wherein the performance is most seriously degraded by posture change. The proposed strategy of 'recognition after rotation', namely recognizing after the face rotates to the front, is one of the mainstream means for solving the face posture problem. Referring to fig. 1, a general flow of face "recognition after rotation" can be summarized as face detection, face alignment, face pose rotation, and face recognition.

Face detection: namely, a partial picture with the human face as the main content is cut out from the original picture and is input to the subsequent flow. At present, the mainstream face detection mode is from coarse to fine, the whole image is sliced according to different window sizes and step lengths, whether the whole image contains a face image or not is judged by networks of different depths, the positioning of a boundary frame is corrected, and finally a plurality of image areas most possibly containing the face image are obtained.

Aligning the human face: and identifying and positioning key points of the human face. The face key points refer to feature points predefined in the face picture, and are mainly positioned around or in the center of facial components such as five sense organs and facial contours. The general 68 keypoint labeling scheme is shown in fig. 2.

The human face gesture is rotated: giving a face picture with any gesture, keeping the identity information and the expression information of the face picture, and converting to generate other gesture pictures with visual reality. In the current literature environment, the face rotating in the horizontal direction is mostly specified to be orthogonalized, that is, a non-frontal face picture is given to generate a frontal posture picture. For example, 3D modeling of faces in the case of few data sources; in the photo editing process, the face which does not look at the lens in the group photo can be corrected to be the direct-view lens; face synthesis in virtual reality and augmented reality, etc. The face rotation mainly comprises 2 technical routes, wherein one technical route is a 2D route, and a source posture face picture is directly converted into a target posture face picture; and secondly, constructing a 3D model according to the source posture face picture, rotating the 3D model to a target posture, projecting the target posture to a 2D plane and rendering a final picture. The present invention employs a 2D strategy.

Face recognition: face recognition is a broad term, and detailed application categories include face authentication, identity recognition, attribute recognition, expression recognition, and the like. There are two main application forms of identity recognition. Identity query, namely giving a face to be detected and a face database with a certain scale, and identifying the identity number of the face to be detected; and identity authentication, namely giving the face to be detected and the comparison face and judging whether the two faces are the same identity.

And (3) expression recognition: the description of the expression is generally divided into a discrete label, an expression action unit and a continuous expression space. From the perspective of simplicity and practicality, the present invention employs a discrete label approach. The 7 categories of basic expressions include "fear", "anger", "disgust", "happiness", "neutral", "sadness", "surprise", see fig. 3.

Although face recognition has been widely applied to various aspects of social life, such as various passes based on face authentication, payment authentication, human-machine emotion interaction based on identity recognition and expression recognition, public management monitoring, driver monitoring, and the like. However, in a real application environment, a large number of face recognition tasks are faced with unconstrained face conditions such as changing postures, expressions, illumination, shading and the like. Especially extreme poses, degrade the performance of the face recognition system most severely.

In the early research on the extreme face gesture recognition, Liu and the like train and train a plurality of mutually independent sub-networks for extracting bottom-layer features on the basis of discrete gesture labels, and a simple attention sub-channel strategy is adopted in feature extraction, so that the calculation efficiency and the flexibility are weak.

At present, few research achievements research the retention and recognition of expression features in the face normalization process. The existing face recognition after rotation algorithm has obvious defects in visual effect of generated pictures, or can not restore violent expression actions, or lacks modeling learning on posture change in the vertical direction, even Luan and the like propose interference of eliminating expression information in the rotation process, and the face with various expressions is regressed into the face with the positive neutral expression. There is a great progress space in the field of face rotation and recognition.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an intelligent face recognition method, which can effectively process and recognize face images in extreme postures in various practical application scenes and expand the application range of a face recognition system.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: an intelligent face recognition method comprises the following steps:

1) face detection: intercepting a source posture face picture taking a face part as main content from an original picture;

2) aligning the human face: identifying and positioning face key points in a source posture face picture;

3) the human face gesture is rotated: according to the source posture face picture and the selected posture, identity information and expression information of the source posture face picture are reserved, and a target posture face picture with visual sense of reality is generated through conversion;

4) facial expression and identity recognition: and combining the source posture face picture and the target posture face picture to judge the expression and identity of the face in the picture.

In the step 1), a multitask convolutional neural network MTCNN is adopted to carry out preprocessing on face data, and local pictures taking the face as main contents are intercepted from original pictures in various practical application fields.

In the step 2), detecting the face pictures in various postures by using a neural network Awing based on an adaptive loss function, determining the positions of key points of the face, and further generating a heat map surrounding the Gaussian divergence of the key points, wherein the used heat map is divided into the key point heat map and an attention heat map, the divergence radius of the key point heat map is 3, and the divergence radius of the attention heat map is 25; the key point heat map is used for posture guidance of the subsequent steps, and the attention heat map is used for image detail enhancement of the subsequent steps.

In the step 3), a generation countermeasure network AFGAN of a fusion attention mechanism is provided, the identity information and the emotion information of the face picture are reserved according to the source posture and the selected posture, and the face picture with the target posture and visual sense of reality is generated through conversion;

the network structure foundation of AFGAN isCGAN comprising 1 generator G and 2 discriminators D_ii、D_ihAnd an identity feature extractor D_ip(ii) a A generator G inputs a source posture face picture and a target posture key point heat map, takes the key point heat map as posture condition information, and outputs a synthesized target posture picture; discriminator D_iiInputting the face picture of the source posture and the face picture of the real or synthesized target posture, and a discriminator D_ihInputting a real or synthesized target posture face picture and a target posture key point heat map, and outputting an authenticity label and an expression label; discriminator D_iiAnd identity feature extractor D_ipIs consistent, and an identity feature extractor D_ipThe extracted identity characteristic vector is used for maintaining identity information consistency before and after rotation and an identity recognition task for a LightCNN model;

loss functions used for training the generator G in the AFGAN comprise conditional confrontation loss, overall change loss, identity maintenance loss, expression recognition loss and multi-scale pixel value loss; the conditional countermeasure loss ensures the authenticity of the synthesized picture, the total change loss inhibits the sawtooth distortion of the synthesized picture, the identity keeps the loss and maintains the consistency of the identity information, the expression recognition loss maintains the consistency of the expression information, and the multi-scale pixel value loss accelerates the convergence of the training;

the method comprises the steps of using a generator G in the AFGAN to realize the rotation of the face pose and the picture synthesis, multiplying a source pose face picture by a source pose attention heat map based on an attention mechanism of a face key point to generate an attention subgraph, wherein the formal definition is as follows:

x＝I_s+H_t,x₁＝(I_s*H_s)+H_t

wherein, I_sAs a source pose, H_sAs a source pose attention heat map, H_tRepresenting element series multiplication operation for the key point heat map of the target attitude, + representing the connection operation of the matrix, x being the main channel input, x₁An attention subchannel input;

the generator G respectively processes the input of the main channel and the attention sub-channel, and performs connection fusion on the characteristic output of the two channels, wherein the formalization definition is as follows:

G₂(x,x₁)＝Linear(G₁(x)+G₁(x₁))

wherein G is₂(x,x₁) To generate input features of the penultimate layer of G, G₁(x) Output characteristic of the penultimate layer of the main channel, G₁(x₁) For feature output of the penultimate layer of the attention subchannel, Linear () represents a Linear combination;

the final feature after fusion is processed by the last layer of the generator G to obtain a synthetic target pose picture, formally defined as:

I_t＝G₃(G₂(x,x₁))

wherein, I_tFor the resultant target pose face picture, G₃() To the output of the generator G.

Further, during the gesture rotation process, the AFGAN uses the identity recognition network LightCNN as the identity feature extractor D_ipRespectively extracting identity characteristic vectors of the synthesized target attitude picture and the real target attitude picture, punishing an error between the synthesized target attitude picture and the real target attitude picture through an identity keeping loss function, and maintaining the consistency of the identity information of the human face; and integrating expression recognition learning in a subsequent discriminator, forcing the picture synthesized by the generator G to show the same expression characteristics as a real picture through an expression recognition loss function, and maintaining the front-back consistency of the facial expression information.

In the step 4), a face expression is recognized by using a generation countermeasure network AFGAN of a fusion attention mechanism, and an attention mechanism based on key points of the face is adopted in the recognition process;

the network structure of the AFGAN is based on CGAN, and comprises 1 generator G and 2 discriminators D_ii、D_ihAnd an identity feature extractor D_ip(ii) a A generator G inputs a source posture face picture and a target posture key point heat map, takes the key point heat map as posture condition information, and outputs a synthesized target posture picture; discriminator D_iiInputting the face picture of the source posture and the face picture of the real or synthesized target posture, and a discriminator D_ihInputting a real or synthesized target posture face picture and a target posture key point heat map, and outputting an authenticity label and an expression label; discriminator D_iiAnd identity feature extractor D_ipIs consistent, and an identity feature extractor D_ipAnd the extracted identity characteristic vector is used for maintaining identity information consistency before and after rotation and an identity recognition task for the LightCNN model.

Training arbiter D in AFGAN_iiAnd D_ihThe loss function of (1) comprises a conditional confrontation loss and an expression recognition loss; the conditional countermeasure loss guarantee discriminator has the capability of distinguishing real and synthesized face pictures, and the expression recognition loss guarantee discriminator has the capability of recognizing the expressions of the faces;

using 2 discriminators D in AFGAN_iiAnd D_ihChecking the authenticity of the synthesized picture and identifying the facial expression; based on the attention mechanism of the key points of the human face, multiplying the human face picture to be recognized by the corresponding attention heat map to generate an attention subgraph, wherein the formalization definition is as follows:

x＝I_t+H_t,x₁＝(I_s*H_s1)+H_t1,x₂＝(I_s*H_s2)+H_t2

wherein, I_tFor a picture of the face to be recognized, i.e. a picture of the true or synthetic target pose, H_s1、H_s2Eye and mouth attention heat maps, H, respectively, of the source pose_t1、H_t2Eye and mouth key point heat maps of target postures respectively representing element series multiplication operation, + representing matrix connection operation, x₁、x₂Respectively inputting a main channel, an eye attention sub-channel and an oral attention sub-channel;

discriminator D_ihRespectively processing the input of the main channel and the two attention sub-channels, and connecting and fusing the characteristic outputs of the three channels, wherein the formalization definition is as follows:

D₂(x,x₁,x₂)＝Linear(D₁(x)+D₁(x₁)+D₁(x₂))

wherein D is₂(x,x₁,x₂) Is a discriminator D_ihFinal output of the common feature extraction stage, D₁(x)、D₁(x₁)、D₁(x₂) Respectively outputting common characteristics of the main channel, the eye attention subchannel and the oral attention subchannel, wherein Linear () represents a Linear combination;

the fused public features respectively enter a discriminator D_ihThe two output branches obtain a truth feature matrix and an expression prediction feature vector of the face picture to be recognized, and the formalization definition is as follows:

Exp＝D₃(D₂(x,x₁,x₂)),Gan＝D₄(D₂(x,x₁,x₂))

wherein D is₃() For expression output branches, Exp is expression prediction feature vector, D₄() Outputting branches for the image truth degree, and Gan is an image truth degree characteristic matrix;

discriminator D_iiTreatment method of (1)_ihOnly mixing H with_tReplacement by source pose face picture I_sAnd the others are kept consistent.

Further, a strategy of 'identification after rotation' is adopted, namely, the face is identified after being rotated to the front, and the identity of the face is directly identified by using an identity identification network LightCNN in combination with the source posture face picture and the synthesized front face picture.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention initiates an attention mechanism based on the key points of the human face. The academic and industrial fields have conducted extensive and intensive research on attention mechanisms in artificial intelligence, particularly in computer vision. The previous research results mainly focus on the self-learning of the neural network to generate the attention heat map, and although the method has effects, the method lacks interpretability and stability; or simply based on the specific area of the five sense organs of the human face, the picture is directly cut, the attention weight of the data of the area is increased implicitly, and the flexibility and the adaptivity are lacked. The AFGAN method combines reliable face key points extracted by the precursor networks MTCNN and Awing to generate an attention heat map around the key point area, so that the attention heat map has higher interpretability and effectiveness.

2. The invention firstly realizes the fusion of the attention mechanism and the generation of the confrontation network based on the key points of the human face. The invention fully utilizes the strong learning and reasoning ability of the production countermeasure network, and models the reliable human face structure migration and reconstruction to ensure the overall image reality; constructing an attention mechanism based on the key points of the human face, improving the characteristic extraction and image reconstruction quality of local areas of the key points, and acquiring a synthetic picture with more real and complete details; the advantages of the two are combined and supplemented with each other.

3. The invention is not limited to the face orthogonalization, and can rotate the face picture in any posture into a specific posture and simultaneously keep the front-back consistency of the identity and expression information. Therefore, the face rotating part of the invention is not limited to improving the accuracy of face recognition, and can be applied to other face image related fields.

4. In the expression recognition process, the invention also introduces an attention mechanism based on the key points of the human face, more efficiently extracts and analyzes the image area which is closely related to the human face expression. And the strategy of 'identification after rotation' is adopted, so that the identification accuracy of the facial expression pictures of all the postures is improved.

5. In the process of identifying the identity, the method adopts a strategy of 'identification after rotation', combines the face picture in the source posture with the synthesized face picture in the front side, and improves the identification accuracy rate of the face identity in each posture.

6. The invention initiates the integrated learning of face rotation, expression recognition and identity recognition. In the conventional research technology, face rotation, identity recognition and expression recognition are regarded as mutually independent machine learning tasks, and different neural network models are developed. On the contrary, the invention initiatively provides the ensemble learning, discovers and utilizes the common characteristics and expression rules among the three tasks, simultaneously realizes the purpose of keeping the consistency of the identity and the expression characteristics in the process of face rotation, and effectively utilizes the bottom layer characteristics obtained when the face rotates in the process of identity and expression recognition. Compared with three tasks of independent learning, the method saves the model capacity, improves the calculation speed and improves the effects of synthesis and identification.

7. The invention can break through the limitation of the gesture and the expression on the face recognition system, broaden the available range of the face recognition application under the large gesture and the large expression, effectively utilize more face image data which are difficult to utilize in the past, and bring benefits to a plurality of fields of social public safety monitoring, driving monitoring, teaching monitoring, pass verification, human-computer interaction of service robots and the like.

Drawings

Fig. 1 is an overall flow of face rotation and recognition.

Fig. 2 is a labeled diagram of 68 key points of a human face.

Fig. 3 shows 7 basic expressions of human face.

FIG. 4 is a schematic diagram of the structure of AFGAN.

Fig. 5 is a schematic diagram of an attention mechanism based on face key points.

FIG. 6 is a diagram of the visual effect of AFGAN application.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

The embodiment discloses an intelligent face recognition method, which has the following specific conditions:

1) face detection: and intercepting a source posture face picture taking the face of the person as the main content from the original picture.

The face detection adopts a multitask convolutional neural network (MTCNN) to carry out preprocessing of face data, and partial pictures taking the face as main contents are intercepted from original pictures in various practical application fields.

The MTCNN combines a face region detection task and a key point alignment task in any posture into an integrated learning task, and sequentially learns the picture pyramids with different resolutions by adopting the cascaded CNN. The model inputs an original picture and outputs a face region bounding box and corresponding 5 key points.

Firstly, constructing picture pyramids with different high resolutions on a whole picture, sequentially performing three tasks (namely three output branches) on picture slices with different resolutions by using three-level networks P-Net, R-Net and O-Net, and judging whether the pictures are faces (secondary classification), face bounding box regression (adjusting the position of a frame) and face key point alignment (executed on the O-Net); the result of the previous stage network is used as input to the next stage network.

Thus, the MTCNN firstly uses a small network to rapidly extract a large number of face examinations on a low-resolution picture, and then uses a larger network to screen extracted parts, and finally obtains a fine result.

The MTCNN's objective function consists of the lossy terms of three tasks. The face region classification loss is a two-classification cross entropy loss function for judging whether the picture region contains the face image; performing regression loss of the bounding box, and calculating a second-order Euclidean distance between the vertexes of each candidate window and the real face region with the closest distance; and (3) the face key point positioning loss is obtained, and the second-order Euclidean distances between the 5 predicted landmark points and the real key points are similarly calculated.

Considering different types of samples (for example, a non-face area does not need to perform bounding box regression and landmark alignment), different networks have different task weights, and if a certain sample does not contain a certain task, the task label of the certain sample is set to be 0, and otherwise, the task label is 1. And during training, a difficult sample mining strategy is adopted for judging false positive samples and giving higher penalty weight to the false positive samples in the training process. Samples with the first 70% loss value in each training batch were recorded as difficult samples, and only the gradient information they produced was transmitted back.

2) Aligning the human face: and identifying and positioning the face key points in the source pose face picture.

The human face alignment adopts a neural network (AWING) based on an adaptive loss function to detect human face pictures under various postures, the positions of key points of the human face are determined, and then a heat map which is scattered around the key points is generated. The heat map used was divided into a key point heat map with a divergence radius of 3 and an attention heat map with a divergence radius of 25; the key point heat map is used for posture guidance of the subsequent steps, and the attention heat map is used for image detail enhancement of the subsequent steps.

The Awing aims to solve the problem of face alignment in a complex environment. Because the mainstream loss function used in the current human face key point alignment is the minimum Mean Square Error (MSE), a tiny error is excessively tolerated, and the predicted key point heat map is fuzzy. The AWING inputs an original picture, outputs a predicted key point heat map, and a supervision signal is the heat map drawn by true key points according to Gaussian distribution divergence (the high pixel value near the key points is the foreground, and the low pixel value far away from the key points is the background).

The network main body is a stacked Hourglass (HG) model which is similar to a nested residual error model, a down-sampling layer is symmetrical to an up-sampling layer, addition operation is carried out between symmetrical layer outputs, but each layer is expanded into a residual error module; each HG predicts a landmark point heat map and a boundary line heat map for a single channel (boundary lines are interpolated from real keypoints).

The Awing provides a self-adaptive Wing loss function which treats foreground and background pixels differently, punishs a small error of the foreground pixel (enables the landmark positioning to be accurate), and tolerates a background pixel error (enables the loss to be easy to converge); adjusting the weighted loss heat map of foreground and background pixel loss weights according to the real heat map; and introducing boundary prediction and coordinate convolution performed on boundary pixel points.

The winging function is a core component of the model. When the error is small, an ln form is adopted, and the smaller the error value is, the larger the gradient is; and when the error is large, a linear function is adopted, and the gradient is stable. According to the magnitude of the pixel value of the real key point thermal image, the foreground is represented when the pixel value is large, the ln index is close to 1, but the gradient can be rapidly reduced when the error is extremely close to 0, and the gradient discontinuity is avoided; the background is expressed when the value is small, the ln index is close to 2, the loss function shows similar MSE when the error is small, namely the background pixel point is allowed to tolerate the small error.

And weighting the loss heat map, performing expansion operation of digital image processing on the real key point heat map, dividing a binary mask by taking 0.2 as a threshold value, and multiplying the calculated Awing loss value by the loss of the foreground to calculate the weight.

The boundary prediction is convolved with the coordinates, firstly a coordinate information channel consisting of x and y coordinates of pixels is added in the network, a first HG generates a supervised boundary heat map, and then a boundary mask is generated with a threshold value of 0.05 to obtain boundary coordinate information.

3) The human face gesture is rotated: and according to the source posture face picture and the selected posture, keeping the identity information and the expression information of the source posture face picture, and converting to generate a target posture face picture with visual reality.

The invention integrates the tasks of face rotation and recognition into the AFGAN to realize, thereby enabling the two tasks of face rotation and recognition to share the bottom layer characteristics and the internal relation, reducing the model capacity and improving the operation efficiency and the system performance. The AFGAN combines an attention mechanism based on key points of the human face, a generation countermeasure network and an expression recognition integrated learning mechanism into a whole, and all the parts supplement each other. A generator G synthesizes a target posture face picture according to the source posture face picture and the selected posture; two discriminators D_iiAnd D_ihThe method is used for judging the authenticity of the picture and identifying the expression label.

Referring to FIG. 4, the network structure of AFGAN is based on CGAN, and comprises 1 generator G and 2 discriminators D_ii、D_ihAnd an identity feature extractor D_ip. And G, inputting the source posture face picture and the target posture key point heat map, and outputting the synthesized target posture picture by taking the key point heat map as posture condition information. D_iiInputting the source pose face picture and the real or synthetic target pose face picture, D_ihInputting real or synthesized target pose face pictures and target pose key point heat maps, and outputting authenticity labels and expression labels. D_iiAnd D_ipThe structures of the two parts are basically consistent. D_ipAnd the LightCNN model is used for maintaining identity information consistency before and after rotation and an identity recognition task by using the extracted identity characteristic vector.

The network structure of the generator G is based on U-net, and the down-sampling layer is overlapped by a 'ReLU activation function + convolution layer + batch normalization layer' module; the upper sampling layer is overlapped by a ReLU activation function, a reverse convolution layer and a batch normalization layer module; the up-sampling layer and the down-sampling layer are symmetrically distributed, and connection operation is carried out between the output characteristics of the symmetrical layers.

Discriminator D_ii、D_ihThe network structure is based on patchGAN, and basically comprises a module superposition group of' ReLU activation function + convolution layer + batch normalization layerAnd (4) obtaining. Two output branches of picture truth and expression prediction are divided at an output end, wherein the expression prediction is a feature vector with 7 dimensions; the prediction of the picture truth is not a 01 label, but a feature matrix, and each feature value represents the truth of 70 × 70 sub-pictures.

The attention mechanism of AFGAN is expanded around the attention heat map. The observation shows that the main characteristics of the face identity and the expression are expressed by the geometric shapes and the action changes of the eyes and the mouth of the face. Therefore, the weight of the feature information is increased for these key regions, which is helpful for the targeted extraction of feature vectors and improves the effect of face rotation and recognition, as shown in fig. 5.

And the generator G multiplies the attention heat map and the source posture face picture by weight to construct a self-adaptive local picture taking the facial features as the center, and the self-adaptive local picture becomes an input sub-channel of G. And (4) processing the attention weighted sub-channel by G, connecting the attention weighted sub-channel with the feature matrix of the main channel before the last layer of the upper sampling layer, and synthesizing a final picture. Thus, G can pay attention to the five sense organ areas weighted by the attention heat map while ensuring the overall quality of the generated picture, so as to obtain more exquisite detailed expression in the key areas.

Discriminator D_ii、D_ihThe input end multiplies the human face picture to be recognized and the attention heat map number for weighting, and the human face picture is also divided into three branches of a main channel, an eye sub-channel and an oral sub-channel so as to train convolution kernels specially used for analyzing different geometric shape characteristics. And sequentially connecting and merging the three channel characteristics, and using the acquired common bottom layer characteristics for subsequent branch tasks. The expression output branch and the truth output branch are connected with a Linear full connection layer after a plurality of ReLU activation function, convolution layer and batch normalization layer modules.

The AFGAN aims to rotate the face picture in any posture to a specific posture, and meanwhile, the front-back consistency of identity and expression information is kept. Therefore, the authenticity of the picture, the consistency of the identity information, the consistency of the expression information and the recognition capability of the expression form a multi-task learning target of the model. The model guides the training of the network model through 5 complementary loss functions, namely conditional countermeasure loss, total change loss, identity maintenance loss, expression recognition loss and multi-scale pixel value loss.

The conditional countermeasure loss is used for guiding the migration of data from the source domain to the target domain, and the authenticity of the composite picture is improved. It can suppress excessive smoothing and produce an image with more abundant details. The loss function encourages the generator G to deceive the discriminator, so that the discriminator gives the highest possible truth to the composite sheet; meanwhile, the identification power of the discriminator is enhanced, the real picture is judged to be high in authenticity, and the synthesized picture is judged to be low in authenticity.

The overall variation loss can suppress the jagged distortion phenomenon caused by the conditional opposition loss. Specifically, the total of the pixel value changes of the generated picture in the vertical direction and the horizontal direction is calculated. The loss term leads the generated picture to gradually change overall, and abrupt changes of pixel values are suppressed.

Identity maintenance loss maintains consistency of identity information, and the AFGAN introduces an identity recognition network LightCNN. The method is trained to recognize identity information of a large number of faces, and therefore has strong identity feature extraction capability. AFGAN uses it as identity extractor D_ipRespectively calculating the identity characteristic vectors of the source posture picture and the target posture picture synthesized by the generator G, and requiring the identity characteristic vectors to be consistent. The function is defined as the cosine distance between the identity feature vectors of the source and target pictures. D_ipThe extracted identity characteristic vector can eliminate the interference of non-identity information to a great extent, so that the extracted identity characteristic vector can be used as an effective means for keeping identity information in the face rotation process.

As known at present, the invention firstly proposes to follow a multi-task learning strategy and integrates expression recognition learning into two discriminators D_ii、D_ihAnd one generator G. In this way, the underlying features and potential associated information shared between the two tasks may be fully utilized. In the process of recognizing the expression, the adaptive attention mechanism helps the discriminator to focus on the key human face five sense organ region, and local information which is important for classifying the expression is better extracted.

The expression recognition loss employs a cross entropy function. In training, two discriminators D_ii、D_ihOne expression prediction vector is output. The model averages the two vectors to obtain the final prediction vector. In this way, expression information from different poses is fused to make a robust prediction. The expression recognition loss is applied to the discriminator on one hand to improve the recognition capability of the discriminator; on one hand, the method is applied to a generator G, and the generator G is forced to generate pictures with consistent expressions.

In order to further improve the authenticity of the picture, AFGAN requires that the generated picture is as close as possible to the real picture. And the AFGAN constructs a picture pyramid by using the pictures generated by the G, and calculates the L1 distance of the pixel value level between the synthetic picture and the real picture. Although this loss of multi-scale pixel values may result in some degree of over-smoothing of the composite picture, it may still help to speed up the convergence of the training process.

The AFGAN can realize the conversion of face pictures between any postures under the assistance of a front-end face detection and face alignment method, simultaneously keep identity and expression information, and realize robust expression recognition classification by combining a source image and a synthetic image. The example of the rotation and recognition of a part of pictures in the test process is shown in fig. 6, so that the identity information and the expression information of the original face are perfectly maintained, and the rotation of the posture is accurately completed.

x＝I_s+H_t,x₁＝(I_s*H_s)+H_t

G₂(x,x₁)＝Linear(G₁(x)+G₁(x₁))

I_t＝G₃(G₂(x,x₁))

4) Facial expression and identity recognition: and (4) continuously using the AFGAN to identify the facial expression (adopting an attention mechanism based on facial key points in the identification process), and judging the expression and identity of the face in the picture by combining the source posture face picture and the target posture face picture.

The network structure of the AFGAN is based on CGAN, and comprises 1 generator G and 2 discriminators D_ii、D_ihAnd an identity feature extractor D_ip(ii) a A generator G inputs a source posture face picture and a target posture key point heat map, takes the key point heat map as posture condition information, and outputs a synthesized target posture picture; discriminator D_iiInputting the face picture of the source posture and the face picture of the real or synthesized target posture, and a discriminator D_ihInputting a real or synthesized target posture face picture and a target posture key point heat map, and outputting an authenticity label and an expression label; discriminator D_iiAnd identity feature extractor D_ipIs basically consistent, and an identity feature extractor D_ipAnd the extracted identity characteristic vector is used for maintaining identity information consistency before and after rotation and an identity recognition task for the LightCNN model.

Training arbiter D in AFGAN_iiAnd D_ihThe loss function of (1) comprises a conditional confrontation loss and an expression recognition loss;the conditional countermeasure loss guarantee discriminator has the capability of distinguishing real and synthesized face pictures, and the expression recognition loss guarantee discriminator has the capability of recognizing the expressions of the faces;

using 2 discriminators D in AFGAN_iiAnd D_ihChecking the authenticity of the synthesized picture and identifying the facial expression; following by D_ihFor example, based on the attention mechanism of the face key point, the face picture to be recognized is multiplied by the corresponding attention heat map number to generate an attention subgraph, which is formally defined as:

x＝I_t+H_t,x₁＝(I_s*H_s1)+H_t1,x₂＝(I_s*H_s2)+H_t2

wherein, I_tFor a picture of the face to be recognized (a picture of the true or synthetic target pose), H_s1、H_s2Eye and mouth attention heat maps, H, respectively, of the source pose_t1、H_t2Eye and mouth key point heat maps of target postures respectively representing element series multiplication operation, + representing matrix connection operation, x₁、x₂Respectively inputting a main channel, an eye attention sub-channel and an oral attention sub-channel;

D₂(x,x₁,x₂)＝Linear(D₁(x)+D₁(x₁)+D₁(x₂))

the fused public features respectively enter a discriminator D_ihObtaining a truth characteristic matrix of the face picture to be recognized andexpression prediction feature vector, formalized definition is:

Exp＝D₃(D₂(x,x₁,x₂)),Gan＝D₄(D₂(x,x₁,x₂))

In addition, in the step, a strategy of 'identification after rotation' is adopted, namely, the face is identified after being rotated to the front, the face identity is directly identified by using an identity identification network LightCNN by combining the source pose face picture and the synthesized front face picture, and the identification accuracy of the face identity in each pose is improved.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An intelligent face recognition method is characterized by comprising the following steps:

2. The intelligent face recognition method of claim 1, wherein: in the step 1), a multitask convolutional neural network MTCNN is adopted to carry out preprocessing on face data, and local pictures taking the face as main contents are intercepted from original pictures in various practical application fields.

3. The intelligent face recognition method of claim 1, wherein: in the step 2), detecting the face pictures in various postures by using a neural network Awing based on an adaptive loss function, determining the positions of key points of the face, and further generating a heat map surrounding the Gaussian divergence of the key points, wherein the used heat map is divided into the key point heat map and an attention heat map, the divergence radius of the key point heat map is 3, and the divergence radius of the attention heat map is 25; the key point heat map is used for posture guidance of the subsequent steps, and the attention heat map is used for image detail enhancement of the subsequent steps.

4. The intelligent face recognition method of claim 1, wherein: in the step 3), a generation countermeasure network AFGAN of a fusion attention mechanism is provided, the identity information and the emotion information of the face picture are reserved according to the source posture and the selected posture, and the face picture with the target posture and visual sense of reality is generated through conversion;

the network structure of the AFGAN is based on CGAN, and comprises 1 generator G and 2 discriminators D_ii、D_ihAnd an identity feature extractor D_ip(ii) a A generator G inputs a source posture face picture and a target posture key point heat map, takes the key point heat map as posture condition information, and outputs a synthesized target posture picture; discriminator D_iiInputting the face picture of the source posture and the face picture of the real or synthesized target posture, and a discriminator D_ihInputting a real or synthesized target posture face picture and a target posture key point heat map, and outputting an authenticity label and an expression label; discriminator D_iiAnd identity feature extractor D_ipIs consistent, and an identity feature extractor D_ipIs a LightCNN model and is extractedThe identity feature vector is used for maintaining identity information consistency before and after rotation and identity recognition tasks;

x＝I_s+H_t,x₁＝(I_s*H_s)+H_t

G₂(x,x₁)＝Linear(G₁(x)+G₁(x₁))

I_t＝G₃(G₂(x,x₁))

5. The intelligent face recognition method of claim 4, wherein: in the gesture rotation process, the AFGAN uses the identity recognition network LightCNN as the identity feature extractor D_ipRespectively extracting identity characteristic vectors of the synthesized target attitude picture and the real target attitude picture, punishing an error between the synthesized target attitude picture and the real target attitude picture through an identity keeping loss function, and maintaining the consistency of the identity information of the human face; and integrating expression recognition learning in a subsequent discriminator, forcing the picture synthesized by the generator G to show the same expression characteristics as a real picture through an expression recognition loss function, and maintaining the front-back consistency of the facial expression information.

6. The intelligent face recognition method of claim 1, wherein: in the step 4), a face expression is recognized by using a generation countermeasure network AFGAN of a fusion attention mechanism, and an attention mechanism based on key points of the face is adopted in the recognition process;

Use in AFGAN for trainingDiscriminator D_iiAnd D_ihThe loss function of (1) comprises a conditional confrontation loss and an expression recognition loss; the conditional countermeasure loss guarantee discriminator has the capability of distinguishing real and synthesized face pictures, and the expression recognition loss guarantee discriminator has the capability of recognizing the expressions of the faces;

x＝I_t+H_t,x₁＝(I_s*H_s1)+H_t1,x₂＝(I_s*H_s2)+H_t2

D₂(x,x₁,x₂)＝Linear(D₁(x)+D₁(x₁)+D₁(x₂))

the fused public features respectively enter a discriminator D_ihTwo output branches ofObtaining a truth feature matrix and an expression prediction feature vector of the face picture to be recognized, wherein the formalization definition is as follows:

Exp＝D₃(D₂(x,x₁,x₂)),Gan＝D₄(D₂(x,x₁,x₂))

7. The intelligent face recognition method of claim 6, wherein: and (3) adopting a strategy of 'identification after rotation', namely identifying after the face rotates to the front, and combining the source posture face picture and the synthesized front face picture to directly identify the face identity by using an identity identification network LightCNN.