CN107122705B

CN107122705B - Face key point detection method based on three-dimensional face model

Info

Publication number: CN107122705B
Application number: CN201710159215.1A
Authority: CN
Inventors: 朱翔昱; 雷震; 刘浩; 李子青
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2020-05-19
Anticipated expiration: 2037-03-17
Also published as: CN107122705A

Abstract

The invention relates to a human face key point detection method based on a three-dimensional human face model, which comprises the following steps: step 01, acquiring initial parameters of a face image and a three-dimensional face model from a face training sample; step 02, generating an attitude self-adaptive feature and a normalized coordinate code according to the face image and the initial parameters; step 03, respectively carrying out transformation fusion on the attitude self-adaptive features and the normalized coordinate codes by using a convolutional neural network to obtain real residual errors and parameter residual errors of initial parameters; step 04, updating the initial parameters according to the parameter residual errors, and turning to step 02 until the parameter residual errors reach preset threshold values; and step 05, updating the three-dimensional face model by using the parameter residual reaching a preset threshold value, and collecting face key points on the three-dimensional face model. In the invention, the detection of the key points of the human face under the full posture is realized.

Description

Face key point detection method based on three-dimensional face model

Technical Field

The invention belongs to the technical field of image processing and pattern recognition, and particularly relates to a human face key point detection method based on a three-dimensional human face model.

Background

The key points of the human face are a series of points with fixed semantics on the human face, such as eye corners, nose tips, mouth corners and the like, and the detection of the key points is an important preprocessing step in the computer vision based on human face understanding. Most face analysis systems need to perform key point detection first to accurately know the distribution of facial features, so as to extract features at the designated positions of the faces. However, most of the existing key point detection methods can only process the face with a moderate attitude or below, namely, the deflection angle (yaw) is less than 45 degrees, and the detection of the key points of the face with a large attitude (the deflection angle can reach 90 degrees) is always a difficult point.

The challenges presented therein are mainly three: first, in the conventional keypoint detection algorithm, it is assumed that all keypoints have stable appearance characteristics and thus can be detected. However, in a large gesture, some key points inevitably become invisible due to self-occlusion, and the invisible points cannot be detected due to occlusion of the representation information, so that the traditional method fails; secondly, the appearance change of the face under the large posture is more complex and can be changed from the front to the side, which requires that the positioning algorithm must be more robust to understand the face appearance under different postures; finally, in the aspect of training data, it is difficult to calibrate key points of the human face in a large pose, positions of invisible key points need to be guessed, most of the human faces in the existing database are in a medium pose, only visible key points are labeled in a few databases containing the human face in the large pose, and a key point algorithm for processing any pose is difficult to design.

One possible solution in the prior art is to fit a three-dimensional face model directly from the image. A cascaded convolutional neural network is generally used to transform an input image and regress parameters of a three-dimensional face model. However, this technique has the following drawbacks: firstly, the technology uses an Euler angle to express the rotation of a human face, and the Euler angle can generate ambiguity due to dead lock of a universal joint under a large posture; secondly, the technology only uses the input characteristics of the image visual angle, namely, the original image is directly sent to a convolutional neural network, and the intermediate result image can be used for gradual correction in cascade connection, so that the fitting precision is further improved; finally, the technique does not effectively model the priorities of the model parameters when training the convolutional neural network, so that the fitting performance of the convolutional neural network is dispersed on some secondary parameters.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a face key point detection method based on a three-dimensional face model, so as to realize face key point detection under a full posture.

The method comprises the following steps:

step 01, extracting face images and initial parameters of a three-dimensional face model from a face training sample;

step 02, generating an attitude self-adaptive feature and a normalized coordinate code according to the face image and the initial parameters;

step 03, respectively carrying out transformation fusion on the attitude self-adaptive features and the normalized coordinate codes by using a convolutional neural network to obtain real residual errors and parameter residual errors of initial parameters;

step 04, updating the initial parameters according to the parameter residual errors, and turning to step 02 until the parameter residual errors reach preset threshold values;

and step 05, updating the three-dimensional face model by using the parameter residual reaching a preset threshold value, and collecting face key points on the three-dimensional face model.

Preferably, in the step 02, when the pose adaptive feature is generated, the three-dimensional face model is projected, and a formula in the projection includes:

wherein V (p) is a function for constructing and projecting a three-dimensional face model, and can obtain two-dimensional coordinates of each key point on the three-dimensional model on an image,

representing the average shape of a human face, A_idprincipal component axis, alpha, of PCA extracted from three-dimensional face with neutral expression_idDenotes a shape parameter, A_exprepresenting the principal component axis, α, of PCA extracted from the difference between expressive and neutral faces_expRepresenting expression parameters, f is a scaling factor, Pr is a forward projection matrix, R is a rotation matrix, and the expression parameters are represented by a quadruple [ q [ q ] ]₀,q₁,q₂,q₃]Construction of t_2dFor the translation vector, the fitting target parameter is [ f, R, t ]_2d,α_id,α_exp]Fitting ofThe target parameter set is combined as [ f, q ]₀,q₁,q₂,q₃,t_2d,α_id,α_exp]。

Preferably, said group of four tuples [ q [ ]₀,q₁,q₂,q₃]The formula for constructing the rotation matrix is:

preferably, the generating of the pose adaptive feature in the step 02 includes:

calculating the cylindrical coordinates of each vertex of the three-dimensional face model, and sampling n x n anchor points at equal intervals on an azimuth axis and a height axis; in the fitting process, the anchor points are deformed, scaled, rotated and translated by using the parameters of the current model to obtain the positions of the anchor points on the image, and the posture self-adaptive feature is generated.

Preferably, the generating of the normalized coordinate code in step 02 includes the following formula:

PNCC(I,p)＝I&ZBuffer(V_3d(p),NCC)

wherein PNCC is the normalized coordinate code of projection, NCC is the normalized coordinate code, I is the input face image, p is the current parameter,&for stacking operations in channel dimensions, the function Zbuffer is a function for rendering a three-dimensional patch using texture to generate a two-dimensional image, V_3dAnd (p) the three-dimensional human face after the rotation translation deformation is scaled, and the images generated by stacking together are coded by normalized coordinates.

Preferably, the step 03 specifically includes:

and transforming the attitude self-adaptive feature and the normalized coordinate code respectively according to two parallel convolutional neural networks, fusing the transformed features by using an additional full-connection layer, and regressing a fusion result to obtain a parameter residual error.

Preferably, the calculation formula of the parameter residual in step 03 is as follows:

Δp^k＝Net^k(PAF(p^k,I),PNCC(p^k,I))

wherein p is^kFor the current parameter, I is the input image, Δ p^kFor the residual between the current parameter and the true residual, PAF is the attitude adaptive feature, PNCC is the normalized coordinate code, Net^kIs a two-way parallel convolutional neural network.

Preferably, the step 03 further includes training the convolutional neural network, and weighting the real residuals during training, where the formula is:

wherein p is^c＝p⁰+ Δ p, w is more than or equal to 0 and less than or equal to 1, w is a parameter weight, Δ p is the output of the convolutional neural network, p^gFor true residual, p⁰As an input parameter for the current iteration, p^cFor the current parameters, V (p) is the deformation and weak perspective projection function, and diag is the diagonal matrix construction.

Preferably, in the step 04, updating the initial parameter according to the parameter residual, specifically, adding the parameter residual to the initial parameter.

Compared with the prior art, the invention has at least the following advantages:

the human face key point detection method based on the three-dimensional human face model realizes the human face key point detection under the full posture.

Drawings

FIG. 1 is a schematic flow chart of a face key point detection method based on a three-dimensional face model according to the present invention;

fig. 2 is a schematic diagram of a processing flow of a two-way parallel convolutional neural network provided by the present invention.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The invention discloses a face key point detection method based on a three-dimensional face model, which comprises the following steps as shown in figure 1:

and 00, constructing a three-dimensional variable human face model.

Obtaining a three-dimensional face point cloud sample through a three-dimensional scanner, and constructing a three-dimensional variable model by using Principal Component Analysis (PCA):

wherein S represents a three-dimensional face of a person,

representing the average shape of a human face, A_idprincipal component axis, alpha, of PCA extracted from three-dimensional face with neutral expression_idDenotes a shape parameter, A_exprepresenting the principal component axis, α, of PCA extracted from the difference between expressive and neutral faces_expRepresenting an expression parameter.

After the three-dimensional face model is constructed, the three-dimensional face model is projected onto an image plane by using weak perspective projection:

V(p)＝f*Pr*R*(S+A_idα_id+A_expα_exp)+t_2d

v (p) is a function for constructing and projecting a human face model, two-dimensional coordinates of each point on the three-dimensional model on an image can be obtained, f is a scaling factor, Pr is a forward projection matrix, R is a rotation matrix, t_2dIs a translation vector; then the target parameter of the fit is [ f, R, t ]_2d,α_id,α_exp]。

Traditionally, the pose of a human face is usually expressed in terms of euler angles, including pitch, yaw, and roll. However, when the yaw angle is close to 90 °, i.e. the attitude approaches the side, the problem of gimbal deadlock can make the euler angles ambiguous, i.e. two different euler angles may correspond to the same rotation matrix. Therefore, we have adopted the quadruple [ q ]₀,q₁,q₂,q₃]To represent a rotation matrix and to integrate a scaling factor f into this matrix, the resulting set of model parameters is:

[f,q₀,q₁,q₂,q₃,t_2d,α_id,α_exp]。

a three-dimensional variable face model is used as a fitting target. Manually calibrating the face key points to serve as basic training samples (or using a public face key point data set as a basic training sample), and performing out-of-plane rotation on the face by using a face sidedness technology on the basis to generate a face training sample set with a larger variable angle and richness.

And step 01, extracting the face image and the initial parameters.

And step 02, generating a posture self-adaptive feature and a normalized coordinate code.

The following describes a convolutional neural network-based three-dimensional face model fitting algorithm, i.e., how to estimate the pose, shape and expression parameters of a face using a convolutional neural network. For the input of the convolutional neural network, two input features are designed, namely an attitude adaptive feature and normalized coding of projection.

First, a Pose Adaptive Feature (PAF) is explained.

In convolutional neural networks, the conventional convolutional layer is convolved pixel by pixel along the two-dimensional image axis, whereas in PAF the convolution is performed at some fixed semantic location of the face. The position where the PAF performs the convolution operation is obtained by: considering that a face can be roughly approximated by a cylinder, we calculate two-dimensional cylindrical coordinates of each vertex of a three-dimensional face model and sample n x n anchor points at equal intervals on an azimuth axis and a height axis. In the fitting process, given a current model parameter p, a three-dimensional face model is projected, and the position of an anchor point on a two-dimensional image is obtained and used as the position of the PAF for convolution operation. Note that the convolution operation on the anchor points forms an n x n map, and then the conventional convolution operation can be performed. To reduce the impact of the feature at the occlusion region, we generate the pose adaptive feature by dividing the response at the occlusion region by 2.

The Normalized coordinate Code (Projected Normalized Code-PNCC) of the projection is described below. This input feature relies on a new coordinate encoding, first normalizing the three-dimensional average face to 0-1 in three-dimensional space:

after normalization, the points on the three-dimensional model are uniquely distributed on [0,0,0] to [1,1,1], so that the three-dimensional model can be regarded as a three-dimensional coordinate code which is called normalized coordinate code. Unlike commonly used numbering (e.g., 0,1, …, n), normalized coordinate encodings are continuous in three-dimensional space. In the fitting process, given the current model parameters p, we use the ZBuffer algorithm to render the projected three-dimensional face with normalized coordinate encoding:

PNCC(I,p)＝I&ZBuffer(V_3d(p),NCC)

wherein PNCC is a normalized coordinate code, I is an input face image, p is a current parameter,&for stacking operations in channel dimensions, the function Zbuffer is a function for rendering a three-dimensional patch using texture to generate a two-dimensional image, V_3dAnd (p) scaling the three-dimensional human face subjected to the rotational translation deformation, stacking the three-dimensional human face to generate a graph which is a normalized coordinate code, and inputting the graph into a convolutional neural network.

The two generated characteristics have complementarity, wherein the normalized coding of the projection belongs to the characteristics of the image view angle, and the characteristic is that the original image is directly fed into a convolution neural network. The posture self-adaptive feature belongs to the feature of the model view angle, and is characterized in that the original image is corrected by using the fitting intermediate result. The projected normalized code contains the whole face image, so that the context information of the image is richer, the normalized code is suitable for face positioning and rough fitting, and the normalized code is important in the initial iterations; the pose self-adaptive feature is equivalent to positioning and correcting the human face in the image by using the current model parameters due to the convolution operation at the anchor point, so that the fitting task is gradually simplified, the pose self-adaptive feature is suitable for fitting on details, and is important in final iteration for several times.

And 03, transforming and fusing to obtain parameter residual errors.

It can be seen that the two characteristics have a complementary relationship, and in order to fully utilize the advantages of the two characteristics, a two-way parallel convolutional neural network structure is utilized to perform K iterations. In the k-th iteration, an initial parameter p is given^kWe use p^kGenerating the attitude adaptive features and the projected normalized coordinate coding features, and training a two-way parallel convolutional neural network as shown in fig. 2, wherein the attitude adaptive feature branch comprises 5 convolutional layers, 4 pooling layers and a full-connection layer. The projected normalized coordinate coding branch comprises a pose self-adaptive convolutional layer, three common convolutional layers, three pooling layers and a full-connection layer. The network uses two parallel neural networks to respectively transform two characteristics, and uses a full connection layer to fuse. The fused final features are used for regressing the residual error between the current parameter and the target parameter:

Δp^k＝Net^k(PAF(p^k,I),PNCC(p^k,I))

The basic idea of how to train a convolutional neural network is to make the regressed parameter residuals close to the true parameter residuals. However, since the importance of the parameters of the face model is different, the importance of a few parameters (such as pose) is much greater than that of most parameters, and therefore, the loss of each parameter needs to be weighted during training. The weights in the conventional algorithm are independent of each other and are usually determined manually or based on "the loss caused by estimating some parameter incorrectly". However, the weights of the parameters are correlated, and for example, before the posture parameters are accurate enough, it is meaningless to estimate the expression parameters. The invention obtains the weight values of all parameters in a unified way by optimizing an energy function, and designs the following optimally Weighted Parameter Distance loss (Optimized Weighted Parameter Distance Cost-OWPDC):

E_owpdc＝(Δp-(p^g-p⁰))^Tdiag(w*)(Δp-(p^g-p⁰))

As shown in the formula, by weighting the true residual diag (w) × (p)^g-p^c) Adding to the current parameter p^cIn the method, the three-dimensional face constructed by the updated parameters is expected to be closer to the real face V (p)^g). Meanwhile, since the fitting ability of the neural network is limited, λ | | | diag (w) × (p) is used^g-p^c)||²The pressure of the current parameters on the neural network is modeled and fitted, and the pressure is added into a loss term, so that the neural network is expected to assign weights to the parameters with higher cost performance.

During the training process, finding the optimal w for each sample is too complex, so V (p) is used^c+diag(w)*(p^g-p^c) In p)^gUsing taylor expansion to obtain:

||V′(p^g)*diag(w-1)*Δp^c||²+λ||diag(w)*Δp^c||²

wherein, V' (p)^g) Is V (p)^g) The above equation is expanded and the constant term is removed to obtain:

w^T(diag(Δp^c)V′(p^g)^TV′(p^g)diag(Δp^c))w-2*1^T(diag(Δp^c)V′(p^g)^TV′(p^g)diag(Δp^c))w

-λ*w^Tdiag(Δp^c.*Δp^c)w

let H equal V' (p)^g)diag(Δp^c) Then the original optimization problem can be written as:

0≤w≤1

the above formula is a standard quadratic programming problem, which can be solved quickly by interior point method. However, the computation of H in this loss function is very time consuming, recalculating H while training each sample makes training time unacceptable. The experiment shows that the only non-constant term of H is V' (p)^g) For each training sample V' (p)^g) Is stationary. Thus, before training, V' (p) of each sample can be compared^g) Calculated and stored, and directly read during training. The weight value obtained, i.e., the weight lost by each parameter in the OWPDC, may describe the priority of each parameter.

And step 04, updating the initial parameters according to the parameter residual errors.

Then adding the input parameter and the parameter residual error to obtain a better parameter p^k+1＝p^k+Δp^kAnd performing next iteration including input feature construction and parameter estimation of the convolutional neural network. After K iterations to reach the preset threshold, V (p) is used^k) And obtaining the position of each point on the three-dimensional face on the image.

And step 05, collecting face key points on the three-dimensional face model.

Because the existing face key point training sample is usually within the medium posture, the invention generates the training sample under the large posture by performing out-of-plane rotation on the existing training sample, and the method specifically comprises the following steps:

given a training sample comprising a face image and manually calibrated key points, a three-dimensional model of the face in the image can be obtained by using key point-based three-dimensional face model fitting. Some anchor points are then sampled uniformly over the background area. For each anchor point, its depth is estimated from the point on the three-dimensional face model that is closest to it. After the depths of all anchor points are obtained, triangularization is used to group the anchor points into a series of triangular patches. These patches, together with the fitted three-dimensional face, constitute the depth information of the image. The virtual depth image can be rotated out of plane in a three-dimensional space and rendered at any angle, and the appearance of the human face in different postures in the image is generated. The deflection angle is used as the step length by 5 degrees, and a series of virtual samples are generated by gradually expanding until the step length reaches 90 degrees.

The method overcomes the defect that the traditional key point detection algorithm can not position the self-shielding key points, directly carries out three-dimensional face model fitting by the image, and samples the key points from the fitted three-dimensional face. In the process of face fitting, except for using normalized coordinate coding of feature projection of an image view, special model view feature 'attitude adaptive feature' is designed, and the feature can use a fitting intermediate result to carry out implicit orthogonalization on the image, so that a fitting task is simplified step by step, and fitting accuracy is further improved. Because the image visual angle characteristic and the model visual angle characteristic have a complementary relation, in order to combine the advantages of the two characteristics, the two input characteristics are simultaneously transformed and fused by utilizing a double-path parallel convolution neural network, and finally the fused characteristics are used for model parameter regression. When the convolutional neural network is trained, the fitting precision is further improved by intensively fitting a plurality of important parameters of the convolutional neural network by considering the priority of the parameters of the face model. Finally, the invention realizes the detection of the key points of the face in the full posture.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A face key point detection method based on a three-dimensional face model is characterized by comprising the following steps:

step 01, acquiring initial parameters of a face image and a three-dimensional face model from a face training sample;

step 05, updating the three-dimensional face model by using parameter residual errors reaching a preset threshold value, and collecting face key points on the three-dimensional face model;

the step 03 specifically includes:

2. The method for detecting the key points of the human face based on the three-dimensional human face model according to claim 1, wherein in the step 02, the three-dimensional human face model is projected when the pose adaptive feature is generated, and a formula in the projection includes:

representing the average shape of a human face, A_idprincipal component axis, alpha, of PCA extracted from three-dimensional face with neutral expression_idDenotes a shape parameter, A_exprepresenting the principal component axis, α, of PCA extracted from the difference between expressive and neutral faces_expExpressing expression parameters, f is a scaling factor, Pr is a forward projection matrix, R is a rotation matrix, and the expression parameters are represented by a quadruple [ q [ [ q ]₀,q₁,q₂,q₃]Construction of t_2dFor the translation vector, the fitting target parameter is [ f, R, t ]_2d,α_id,α_exp]The set of fitting target parameters is [ f, q ]₀,q₁,q₂,q₃,t_2d,α_id,α_exp]。

3. The method for detecting the key points of the human face based on the three-dimensional human face model as claimed in claim 2, wherein the key points are composed of four tuples [ q [ ]₀,q₁,q₂,q₃]The formula for constructing the rotation matrix is:

4. the method for detecting the key points of the human face based on the three-dimensional human face model according to claim 3, wherein the generating of the pose adaptive feature in the step 02 comprises:

5. The method for detecting key points of a human face based on a three-dimensional human face model according to claim 3, wherein the generating of the normalized coordinate code in the step 02 comprises the following formula:

PNCC(I,p)＝I&ZBuffer(V_3d(p),NCC)

6. The method for detecting the key points of the human face based on the three-dimensional human face model according to claim 1, wherein the calculation formula of the parameter residual error in the step 03 is as follows:

Δp^k＝Net^k(PAF(p^k,I),PNCC(p^k,I))

7. The method for detecting the key points of the human face based on the three-dimensional human face model according to any one of claims 1 to 6, wherein the step 03 further comprises training the convolutional neural network, and weighting the real residuals during training, wherein the formula is as follows:

wherein p is^c＝p⁰+ Δ p, w is more than or equal to 0 and less than or equal to 1, w is a parameter weight, Δ p is the output of the convolutional neural network, p^gFor true residual, p⁰As an input parameter for the current iteration, p^cFor the current parameter, V (p) isDeformation and weak perspective projection function, diag is a diagonal matrix structure.

8. The method for detecting key points of a human face based on a three-dimensional human face model according to any one of claims 1 to 6, wherein the step 04 of updating the initial parameters according to the parameter residuals is to add the parameter residuals and the initial parameters.