CN110991281B

CN110991281B - Dynamic face recognition method

Info

Publication number: CN110991281B
Application number: CN201911145377.5A
Authority: CN
Inventors: 高建彬; 蒋文韬
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2022-11-04
Anticipated expiration: 2039-11-21
Also published as: CN110991281A

Abstract

The invention provides a dynamic face recognition method, which comprises the following steps: firstly, shooting a video through a camera so as to obtain a marked face image; performing targeted illumination compensation on the marked face image; constructing a three-dimensional face reconstruction model based on a Convolutional Neural Network (CNN) for three-dimensional face reconstruction to obtain a face three-dimensional model; extracting the features of the human face three-dimensional model to obtain a human face feature vector; and carrying out face matching based on the face feature vector, and further realizing face recognition. According to the dynamic face recognition method provided by the invention, the problems of gesture shielding and the like can be well solved through three-dimensional face reconstruction, and the superior dynamic face recognition effect compared with the prior art can be obtained by performing face recognition on the three-dimensional face reconstruction model based on the CNN.

Description

Dynamic face recognition method

Technical Field

The invention relates to a face recognition technology, in particular to a dynamic face recognition method.

Background

The research of face recognition dates back to the sixties and seventies of the last century, and the face recognition system has the following advantages: 1. optional characteristics: the user can almost acquire the face image in an unconscious state without specially matching with face acquisition equipment, and the sampling mode has no 'compulsory'; "2, non-contact: the user can obtain the face image without directly contacting with the equipment; 3. concurrency: the face recognition method has the advantages that sorting, judging and identifying of multiple faces can be carried out in an actual application scene; 4. also in line with the visual characteristics: the character of 'recognizing people' is simple to operate, the result is visual, and the concealment is good. The traditional face recognition technology is mainly based on face recognition of visible light images, which is a familiar recognition mode, and the technology has been developed for more than 30 years. However, this method has a defect that it is difficult to overcome, and especially when the ambient light changes, the recognition effect will be rapidly reduced, which cannot meet the needs of the actual system. Techniques for solving the illumination problem include three-dimensional image face recognition and thermal imaging face recognition. However, the two technologies are still far from mature and the recognition effect is not satisfactory. In recent years, with the development of deep learning, the success rate of face recognition in a control environment is high, but a satisfactory effect is difficult to achieve in a non-control environment, some factors restrict large-scale commercialization of a face recognition system, factors such as illumination change, different postures, local face shielding, complex expression and age change greatly increase the difficulty of face recognition, and meanwhile, deep learning is driven by data, and at present, it is not practical for the same person to obtain a large number of images of various types of the face recognition system, so that the detection effect under an uncontrollable condition is also influenced to a certain extent. Therefore, the method for improving the accuracy and the robustness of the face recognition method has great research value, and has great application value in the fields of security monitoring, identity authentication, crime investigation, intelligent control and the like. Early recognition methods based on human face geometric features were generally used, and the geometric features commonly used were the local shape features of five sense organs of a human face, such as eyes, nose, mouth, and the like. After that, a face recognition method based on correlation matching, subspace and statistics appears, and then based on a neural network, gutta and the like propose a mixed neural network, lawrence and the like to realize sample clustering through a multi-stage SOM, a convolutional neural network CNN is used for face recognition, lin and the like adopt a probability decision-based neural network method, demers and the like propose to extract face image features by adopting a principal component neural network method, further compress the features by using an autocorrelation neural network, and finally a multilayer perceptron (MLP) is adopted to realize face recognition. Er and the like adopt Principal Component Analysis (PCA) to carry out dimension compression, then Linear Discriminant Analysis (LDA) is used for extracting features, and then face recognition is carried out based on a Radial Basis Function (RBF). The advantage of the neural network is that the implicit expression of the rules and rules is obtained through the learning process, and the neural network is highly adaptive. Finally, an identification method based on a three-dimensional model can overcome the defects of the traditional two-dimensional image, atick and the like expand the thought of a characteristic face, mainly represent the front face by a three-dimensional function, nastar and the like provide a method based on a gray plane, a face image is modeled into a variable 3D grid surface, the face matching problem is converted into an elastic matching problem, anh and the like provide a 3D face reconstruction algorithm, a deep convolutional neural network method is used for reconstructing the 3D face, the 3D face can be accurately estimated through a plurality of pictures of the same face, and a good effect is achieved. Generally, three-dimensional face images are based on a plurality of pictures, and inspired by success of a Deep Neural Network (DNN) since 2017, dou et al propose a DNN-based method for performing end-to-end 3D face reconstruction from a single 2D image; jackson et al propose that large-pose three-dimensional face reconstruction is realized from a single image, a three-dimensional face is directly regressed from a CNN instead of estimating parameters of a face 3D deformation statistical model (3 DMM) through the CNN; in 2018, feng Y et al proposed a reconstruction algorithm that can accomplish both three-dimensional reconstruction and end-to-end alignment. Lu et al propose a three-dimensional reconstruction method based on single-picture geometric details by aligning the projection of 3D face markers with 2D markers detected from the input image, generating a smooth and rough 3D face by an example-based bilinear face model; then, using local correction, refining the 3D face by luminosity consistency constraint; finally, a shadow shape method is applied on the media side to recover fine geometric details. The method keeps higher reduction degree, and can improve the identification effect to a great extent during identification. Feng et al assessed dense three-dimensional reconstructions of field 2D images, provided a dedicated data set, and compared the effects of the current three more advanced 3D reconstruction systems.

However, the existing dynamic face recognition technology has a series of problems: (1) illumination problems: when faced with the examination of various environmental light sources, the phenomena of side light, backlight, top light, high light and the like may occur, and the illumination at each time interval may be different, even the illumination at each position in the monitoring area may be different. (2) character pose and accouterment problems: because the dynamic monitoring is of a non-fitting type, monitored personnel pass through a monitoring area in a natural posture, and therefore various non-frontal face postures such as a side face, a head-down posture and a head-up posture and ornaments phenomena such as a hat, a mask and glasses can occur. (3) image problems of the camera: many technical parameters of the camera affect the quality of video images, such as the size of the sensor, the processing speed of the DSP, the built-in image processing chip and lens, etc., and some setting parameters built in the camera also affect the quality of video images, such as exposure time, aperture, dynamic white balance, etc. (4) frame loss, face missing detection and the like: the situations of video frame loss and face missing detection can be caused by required network identification and system calculation identification; in the area with large monitoring people flow, people lose frames and people face missing detection is often caused due to the bandwidth problem and the computing power problem of network transmission.

Disclosure of Invention

Aiming at the problems, in order to overcome the defects, the invention provides an efficient, reasonable and effective dynamic face recognition method to solve a series of problems such as illumination postures, and the method is suitable for performing effective face recognition in a non-controllable environment. In order to achieve the above object, the dynamic face recognition method of the present invention comprises the following steps:

(1) A video is taken by a high profile camera (e.g., sony SSC-N21), a common face detection algorithm is invoked, and frames from the video that may contain faces are truncated for input to the detector.

(2) Preprocessing the extracted frames of the human face, estimating an illumination mode by utilizing an illumination mode parameter space, and then performing targeted illumination compensation so as to eliminate the influences of shadow, highlight and the like caused by non-uniform front illumination;

(3) In order to eliminate the influence of the posture, a three-dimensional face reconstruction model based on a Convolutional Neural Network (CNN) is constructed to realize three-dimensional face reconstruction, a residual block is combined with common convolution operation, meanwhile, a feature point is added in the reconstruction process as a guide, and as the alignment of the feature points of the face is carried out according to the feature points in the reconstruction process, the alignment is carried out without redundant steps after the reconstruction.

(4) Constructing a residual error network model for extracting feature vectors, training the residual error network model by adopting large-margin cosine measurement, finely adjusting network parameters of the residual error network model, simultaneously increasing a skip level structure (by taking the characteristics of the residual error network into account, the output of each layer can spread across multiple layers), and optimizing the feature vector extraction result by adjusting the number of skip layers in each training and the number of convolutional layers of the residual error network model.

(5) And matching the extracted feature vectors to realize face recognition.

The invention discloses a dynamic face recognition method, which improves a loss function in the processes of three-dimensional face reconstruction and subsequent face recognition aiming at a three-dimensional face model and realizes the dynamic face recognition efficiency through the three-dimensional face reconstruction. The invention carries out three-dimensional face reconstruction by constructing a three-dimensional face reconstruction model based on the Convolutional Neural Network (CNN), extracts features by constructing a residual network and finally carries out face matching, carries out illumination compensation on the face before carrying out feature extraction, eliminates the influence of illumination, can well solve the problems of gesture shielding and the like by three-dimensional face reconstruction, and can obtain better effect than the prior effect by carrying out face identification by the three-dimensional face reconstruction model based on the Convolutional Neural Network (CNN), thereby obtaining good effect when carrying out dynamic face identification.

Drawings

FIG. 1 is a flow chart of the dynamic face recognition implementation provided by the present invention

FIG. 2 is a model of face detection algorithm of the present invention

FIG. 3 is a schematic diagram of a three-dimensional human face reconstruction model according to the present invention

FIG. 4 is a schematic diagram of the residual error network model of the present invention

FIG. 5 is a diagram illustrating a residual block structure

FIG. 6 shows the comparison of the three-dimensional face reconstruction results

FIG. 7 shows a comparison of human face recognition effects

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The invention provides a dynamic face recognition method, as shown in fig. 1, the method comprises the following steps:

step 1: firstly, a video is shot through a camera, a face detector (using the existing face detector) is designed for the purpose, a common face detection algorithm is called, image frames possibly containing faces are intercepted from the video, each image frame of the faces intercepted from the video is input into the face detector, and a marked face image is obtained. The specific implementation process is shown in fig. 2. Wherein the camera is a high-configuration camera, such as a Sony SSC-N21 model.

Step 2: then aiming at the influence of illumination, estimating an illumination mode by using an illumination mode parameter space, and performing targeted illumination compensation on the marked face image obtained in the step (1) to obtain a marked face image subjected to illumination compensation so as to eliminate the influence of shadow, highlight and the like caused by non-uniform front illumination;

and step 3: and 3, performing three-dimensional face reconstruction on the marked face image subjected to illumination compensation obtained in the step 2. As for the two-dimensional face, the three-dimensional face has great advantages, the problems of posture and the like can be effectively solved, the influence of the posture is eliminated, and the recognition effect is improved. The three-dimensional face reconstruction model adopts a mode of combining residual error blocks and common convolution operation, and meanwhile, characteristic points are added in the reconstruction process to serve as guidance. The specific implementation process is shown in fig. 3. The alignment of the three-dimensional face reconstruction model is specifically performed by: the feature points are added in the three-dimensional face reconstruction process as a guide, and the process enables a three-dimensional face model after three-dimensional face reconstruction to directly determine coordinate values of the feature points, so that the three-dimensional face model after three-dimensional face reconstruction is an aligned three-dimensional face model.

The three-dimensional face reconstruction model is realized by adopting a structure combining an encoder (encoder) and a decoder (decoder), the three-dimensional face is represented by a three-dimensional face (Volumeric), the face is regarded as 200 cross slices from a behind-the-ear plane to a nose tip plane, each cross slice is an equal-altitude point, and feature point guidance is added in the working process of the encoder and the decoder to directly obtain the three-dimensional face model.

And 4, step 4: extracting features of the human face three-dimensional model obtained in the step 3, designing a residual network model for extracting features, training and optimizing the residual network model by adopting large-margin cosine measurement, fine-tuning network parameters of the residual network model, increasing a skip level structure (by taking the characteristics of the residual network as a reference, the output of each layer can skip multi-layer propagation), optimizing a recognition result by adjusting the number of skip layers during each training (the number of skip layers after the number of skip layers during each training is adjusted is the same) and the number of convolutional layers of the residual network model, and extracting the human face feature vector of the human face three-dimensional model generated in the step 3 by training the optimized residual network model, wherein the specific implementation process is shown in fig. 4. In this embodiment, the residual network model includes 18 convolutional layers connected in sequence, each layer is convolved with convolution kernels having the same size of 3 × 3, the number of convolution kernels in the 1 st to 6 th convolutional layers is 64, the number of convolution kernels in the 7 th to 12 th convolutional layers is 128, the number of convolution kernels in the 13 th to 18 th convolutional layers is 256, and an additional connection is added after each two layers of convolutional layers, that is, an input of a first layer of the two layers of convolutional layers and an output after passing through the two layers of convolutional layers are fused and then used as an input of a next layer of the two layers of convolutional layers, and after multiple jumping and fusion, the input is input into a full-connection layer after average pooling, and finally, a face feature vector of the face three-dimensional model is obtained. The purpose of this is that for a general convolutional neural network, as the number of convolutional layers increases, the problem of gradient disappearance or gradient explosion is easily generated, and then the adoption of this skip level structure can make it possible to avoid the above problem even in a deep network model.

Aiming at the problem of softmax loss, a normalized cosine loss function with large margin is adopted during training, and a jump-level structure (namely, the number of layers of each jump in the residual error network model is adjusted) is adopted, so that the problems of gradient disappearance and the like along with the increase of depth in the training process are solved.

And 5: and finally, comparing the face feature vectors extracted in the step 4 in a preset face library to realize face matching and further carry out face recognition, wherein the face recognition is realized by adopting a recognition network, and a recognition result is output through the recognition network, namely which person is recognized is output.

The core of the invention is mainly that end-to-end three-dimensional face reconstruction is carried out by utilizing a three-dimensional face reconstruction model, and a loss function of a subsequent residual network model is improved to a certain extent, the three-dimensional face reconstruction model in the invention is different from the prior face reconstruction model in that a corresponding face three-dimensional model is obtained by CNN network regression directly and alignment is not needed, compared with the traditional method, the invention can generate a face three-dimensional model in a one-to-one manner, and the reconstruction effect is superior to that of the traditional method (the traditional method at present mainly reconstructs a three-dimensional face model based on fitting 3DMM parameters and generates a face three-dimensional model in a multi-to-one manner).

Outputting a corresponding UV position mapping chart aiming at each three-dimensional face reconstruction, then outputting a face three-dimensional model aiming at each UV position mapping chart, carrying out face recognition aiming at the obtained face three-dimensional model each time, and specifically outputting which person is recognized.

The most important part of the method is the three-dimensional face reconstruction part, and the performance of the three-dimensional face reconstruction model influences the face recognition efficiency of the whole method.

The three-dimensional human face reconstruction model generally adopts an encoder-decoder structure, a 256 x 3 human face image is input, an 8 x 512 feature map is formed after encoding, a 256 x 3 UV position mapping map is output after decoding, wherein the UV position mapping map is a two-dimensional image for recording three-dimensional coordinates of facial point clouds of all human face images, semantic information is reserved in each UV position mapping map, RGB information of each point in the three-dimensional coordinates of the facial point clouds of the human face images is contained in the UV position mapping map, the encoding structure is formed by cascading 10 identical residual blocks, the structure of the residual blocks is shown in figure 5, each residual block structure comprises 3 convolutional layers, the first layer adopts identical convolutional kernels with the size of 1 x 1, the number of the convolutional kernels is 64, the second layer adopts identical convolutional kernels with the size of 3 x 3, the number of convolution kernels is 64, the same convolution kernels with the size of 1 x 1 are adopted in the third layer, the number of convolution kernels is 256, each convolution layer is activated through a Relu function after convolution and input into the next convolution layer, finally, the output after passing through the three-layer residual block structure and the input before inputting into the first convolution layer of the residual block structure are subjected to summation, then the Relu function activation is carried out to obtain the output of each residual block structure, the decoding structure is composed of 17 layers of deconvolution layers, the size of the deconvolution kernel in the first layer is 4 x 4, the step size is 2, the padding is 1, the size of the deconvolution kernel in the second layer is 2 x 2, the step size is 1 and the padding is 0, the parameters of the odd layers in the 17 layers of deconvolution layers are the same as those in the first layer, the parameters of the even layers are the same as those in the second layer, and the size of the finally obtained image is the same as the input through the decoding structure, and finally, adopting a Sigmoid function by an output layer. The loss function adopted by the three-dimensional face reconstruction model is shown as a formula (1):

wherein (x, y) represents coordinate values in the UV position map, W (x, y) represents a weight of each point in the UV position map, P (x, y) represents a prediction of the UV position map, and

a true UV bit representing the current faceMap information is set.

And training the three-dimensional face reconstruction model, wherein a 300W-based synthetic data set (300W-LP) is adopted as a training set when the three-dimensional face reconstruction model is trained, and the synthetic data set comprises annotations of different face image angles and estimated 3DMM coefficients. The size of the facial image of the human face in the training sample of the training set is firstly scaled to 256 × 256, then an adaptive moment estimation (Adam) optimizer is used, the learning rate is started from 0.0001, the attenuation is half every 5 cycles, and the size of the training batch in the training process is set to be 16.

The test set of the three-dimensional face reconstruction model described above was chosen as a data set AFLW2000-3D containing 2000 unconstrained face images for evaluation of the three-dimensional face reconstruction and alignment. The experimental effect is shown in fig. 6 (where the first image in each line is an input face image, the second image in each line is a result reconstructed by using the three-dimensional face reconstruction model, and the third image in each line is a result obtained by using a VRN-Guide (an algorithm with the best reconstruction effect at present)), and it can be seen from fig. 6 that the reconstruction effect of the present invention is better than that of the VRN-Guide.

Inputting the three-dimensional face reconstruction into the residual error network model for feature extraction, wherein the specific structure is shown in fig. 4, and different from the previous feature extraction, the residual error network model is trained by adopting an improved cosine measure-based loss function (LMCL), and the loss function of the residual error network model is shown in formula (2):

formula (2) shows that if the input face number to be recognized is the current face label to be recognized, an interval m is added during loss calculation, if not, the interval m is not added, wherein m is an interval, training samples in the training set can be further separated, and if the weight of a full connection layer for recognizing the current face to be recognized in the residual error network model is W _j J represents a current face label to be recognized in the residual error network model, and the input vector is x'Assuming that the fully-connected layer offset bias is 0, the output after passing through the fully-connected layer (for example, the current face label to be recognized is 1, and then W is _j That is, for a full connection layer weight of the current face label 1 to be recognized, j = 1):

f _j ＝||W _j ||*||x′||*cos(θ _j ) (3)

wherein theta is _j Is W _j Forming an angle with x', and forming an angle with W _j Setting 1 as | l, s as (s is a constant), N represents the total number of training samples in the training set, i represents a label corresponding to any input face in the training set, yi represents the number of the input face to be recognized, and theta _yi Full connection layer weight W for representing input face to be recognized and current face to be recognized _j The angle between them, after adding the interval m, yields the loss function used in equation (2).

The final effect of the Loss function adopted by the residual network model relative to other Loss functions is shown in fig. 7, wherein the effect of the Loss function (LMCL) adopted by the invention is generally better than that of the Loss function (Softmax Loss, L-Softmax Loss, a-Softmax Loss and the like) commonly used at present under various data sets (LFW, YTF, MF1 RAN K1 and the like).

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all the inventions utilizing the concepts of the present invention are intended to be protected.

Claims

1. A dynamic face recognition method is characterized by comprising the following steps:

step 1: firstly, shooting a video through a camera, using a face detector, calling a face detection algorithm in the face detector, intercepting image frames containing faces from the video, and inputting each frame of face image frames intercepted from the video into the face detector to obtain a marked face image;

and 2, step: then aiming at the influence of illumination, estimating an illumination mode by utilizing an illumination mode parameter space, and carrying out targeted illumination compensation on the marked face image obtained in the step 1 to obtain a marked face image subjected to illumination compensation so as to eliminate the influence of shadow and highlight caused by non-uniform front illumination;

and step 3: performing three-dimensional face reconstruction on the marked face image subjected to illumination compensation obtained in the step 2 to eliminate the influence of the posture and improve the recognition effect, and constructing a three-dimensional face reconstruction model based on a Convolutional Neural Network (CNN) to perform end-to-end three-dimensional face reconstruction to obtain a corresponding face three-dimensional model;

and 4, step 4: extracting features of the human face three-dimensional model obtained in the step 3, constructing a residual error network model for extracting features, training and optimizing the residual error network model by adopting large-margin cosine measurement, finely adjusting network parameters of the residual error network model, and simultaneously adding a skip level structure, namely the output of each layer of convolutional layer of the residual error network model can spread across multiple layers of convolutional layers, optimizing a recognition result by adjusting the number of skip layers during each training and the number of convolutional layers of the residual error network model, wherein the number of skip layers during each training is the same as the number of convolutional layers of the residual error network model after the number of skip layers during each training is adjusted, and extracting the human face feature vector of the human face three-dimensional model generated in the step 3 by the trained and optimized residual error network model;

and 5: finally, comparing the face feature vectors extracted in the step 4 in a preset face library to realize face matching and further carry out face recognition, wherein the face recognition is realized by adopting a recognition network, and a recognition result is output through the recognition network, namely which person is recognized;

the three-dimensional face reconstruction model in the step 3 is realized by adopting a mode of combining a residual block with a common convolution operation and adopting a structure of combining an encoder (encoder) and a decoder (decoder), the three-dimensional face is represented by a three-dimensional face (Volumeric), the face is considered as 200 cross slices from an ear back plane to a nose tip plane, each cross slice is an equal altitude point, meanwhile, feature point guidance is added in the working process of the encoder and the decoder, and the specific operation of aligning the three-dimensional face reconstruction model is as follows: the characteristic points are added in the three-dimensional face reconstruction process as guidance, and the process enables a three-dimensional face model after three-dimensional face reconstruction to directly determine coordinate values of the characteristic points, so that the three-dimensional face model after three-dimensional face reconstruction is an aligned three-dimensional face model;

outputting a corresponding UV position mapping chart for each three-dimensional face reconstruction, then outputting a face three-dimensional model for each UV position mapping chart, and performing face recognition on the obtained face three-dimensional model each time to specifically output which person is recognized;

specifically, the three-dimensional human face reconstruction model generally adopts a coding-decoding (encoder-decoder) structure, a 256 × 256 × 3 human face image is input, then an 8 × 8 × 512 feature map is formed after coding, and finally a 256 × 256 × 3 UV position mapping map is output after decoding, wherein the UV position mapping map is a two-dimensional image for recording three-dimensional coordinates of all human face image face point clouds, semantic information is reserved in each UV position mapping map, and RGB information of each point in the three-dimensional coordinates of the human face image face point clouds is contained; the coding structure is formed by cascading 10 identical residual block structures, each residual block structure comprises 3 convolutional layers, the first layer adopts convolutional kernels with the same size of 1 x 1, the number of the convolutional kernels is 64, the second layer adopts convolutional kernels with the same size of 3 x 3, the number of the convolutional kernels is 64, the third layer adopts convolutional kernels with the same size of 1 x 1, the number of the convolutional kernels is 256, each convolutional layer is activated through a Relu function after being convolved and input into the next convolutional layer, finally, the output after passing through the three layers of residual block structures and the input before inputting into the first layer of convolutional layer of the residual block structure are subjected to summation, and then the output of each residual block structure is obtained after the Relu function activation; the decoding structure is composed of 17 layers of deconvolution layers, wherein the first layer is provided with a deconvolution kernel with the size of 4 multiplied by 4, the step length of 2 and the padding of 1, the second layer is provided with a deconvolution kernel with the size of 2 multiplied by 2, the step length of 1 and the padding of 0, the parameters of odd-numbered layers of deconvolution layers in the 17 layers of deconvolution layers are the same as the parameters of the first layer of deconvolution layers, the parameters of even-numbered layers of deconvolution layers are the same as the parameters of the second layer of deconvolution layers, the size of the finally obtained image is the same as the input through the decoding structure, and the final output layer adopts a Sigmoid function; the loss function adopted by the three-dimensional face reconstruction model is as follows:

wherein (x, y) represents coordinate values in the UV position map, W (x, y) represents weights of points in the UV position map, P (x, y) represents a prediction of the UV position map, and

real UV position mapping map information representing a current face;

training the three-dimensional face reconstruction model, wherein a 300W-based synthetic data set 300W-LP is used as a training set when the three-dimensional face reconstruction model is trained, and the training set comprises annotations of different face image angles and estimated 3DMM coefficients; firstly, scaling the size of a face image in a training sample of the training set to 256 multiplied by 256, then using an adaptive moment estimation (Adam) optimizer, starting from 0.0001, attenuating the learning rate by half every 5 periods, and setting the size of a training batch to be 16 in the training process;

selecting a data set AFLW2000-3D from the test set of the three-dimensional face reconstruction model, wherein the data set AFLW2000-3D comprises 2000 unconstrained face images for evaluating three-dimensional face reconstruction and alignment;

the residual error network model adopted in the step 4 comprises 18 convolutional layers which are sequentially connected, each layer is convoluted by adopting the same convolution kernel with the size of 3 × 3, the number of the convolution kernels in the 1 st to 6 th convolutional layers is 64, the number of the convolution kernels in the 7 th to 12 th convolutional layers is 128, the number of the convolution kernels in the 13 th to 18 th convolutional layers is 256, an additional connection is added after each two layers of convolutional layers, namely, the input of the first layer of the two layers of convolutional layers and the output of the first layer after the two layers of convolutional layers are fused to be used as the input of the next convolutional layer of the two layers of convolutional layers, after multiple jumping fusion, the input of the fully-connected layers is subjected to average pooling, and finally the face feature vector of the face three-dimensional model is obtained, and the problem of gradient disappearance or gradient explosion generated along with the increase of the depth in the training process is avoided by adopting the jump structure;

the residual network model is trained by adopting an improved cosine measure-based loss function (LMCL), and the loss function of the residual network model is as follows:

wherein m is an interval, training samples in a training set of the residual error network model can be further divided, and if the weight of a full connection layer for identifying the current face to be identified in the residual error network model is W _j J represents the current face label to be recognized in the residual network model, the input vector is x', and if the offset bias of the full connection layer is 0, the output after the full connection layer is performed is as follows:

f _j ＝||W _j ||*||x′||*cos(θ _j )

wherein theta is _j Is W _j Forming an included angle with x', and using | | | W _j Setting 1 as | l, s as | x' | and s as a constant, N representing the total number of training samples in the training set of the residual error network model, i representing the label corresponding to any input human face in the training set of the residual error network model, yi representing the number of the input human face to be recognized, theta _yi Full connection layer weight W for representing input face to be recognized and current face to be recognized _j The included angle between the two is added into the interval m, and then the Loss function Loss is obtained ₂ 。