CN114842136A

CN114842136A - Single-image three-dimensional face reconstruction method based on differentiable renderer

Info

Publication number: CN114842136A
Application number: CN202210365752.2A
Authority: CN
Inventors: 傅予力; 梁俊韬; 蔡磊; 向友君
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-08-02

Abstract

The invention discloses a single-image three-dimensional face reconstruction method based on a differentiable renderer, which comprises the following steps: s1, inputting the target face image into a pre-trained regression network to obtain initialized three-dimensional face parameters; s2, inputting the three-dimensional face parameters and the posture parameters into a differentiable renderer to respectively obtain rendering images with the same posture as the input images and rendering images with different postures; s3, comparing the target image with the rendered image in the same posture to obtain a key point loss value and a pixel level loss value; s4, comparing the target image with the rendered images in different postures to obtain a generated countermeasure loss value and an identity consistency loss value; and S5, optimizing the three-dimensional face parameters according to the calculated loss values, and returning the updated three-dimensional face parameters to the step S2 until iteration is converged to obtain the optimized three-dimensional face parameters. The invention introduces a true differentiable renderer to obtain a three-dimensional face reconstruction result with higher quality.

Description

Single-image three-dimensional face reconstruction method based on differentiable renderer

Technical Field

The invention belongs to the field of image processing, and relates to a single-image three-dimensional face reconstruction method based on a differentiable renderer.

Background

With the continuous development of face recognition technology, three-dimensional face reconstruction technology is gradually developed into an important application branch of computer graphics. The traditional three-dimensional face reconstruction technology mainly depends on expensive three-dimensional scanning equipment and a large amount of post-manual processing. Therefore, how to use the two-dimensional face picture to perform fast and accurate three-dimensional face model reconstruction is a research focus.

The most advanced three-dimensional face reconstruction methods at present can be roughly divided into two types, namely learning-based methods and optimization-based methods. The deep learning-based method usually adopts a regression mode, and learns and regresses corresponding three-dimensional face model parameters by taking a face image as input. However, these methods usually require a large amount of labeled data, and the acquisition of the parameters of the real three-dimensional face model is difficult. On the other hand, the optimization-based method generally regards face imaging as a generation process, takes a series of parameters (including face geometry, albedo, texture, illumination, viewing angle, and the like) as input, generates a rendered image according to a certain graphic rule, and optimizes the input parameters by minimizing the distance between the rendered image and a target image.

Recent developments in micro-renderable devices provide an efficient tool optimization framework for both types of face reconstruction methods. Regression parameters in the learning-based method can be rendered into the image, and the parameters are optimized by utilizing pixel loss, so that unsupervised training can be realized. For optimization-based approaches, the micro-renderer introduces a gradient-based optimization, allowing more complex penalty functions to be employed, and stabilizing the training process. Most existing methods simply use z-buffer rendering, which is not necessarily differentiable, especially when the triangle surrounding each pixel changes during the optimization process. These methods make the reconstructed three-dimensional face somewhat flawed in reality and some texture details (hair, eyebrows, wrinkles).

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art, provides a single-image three-dimensional face reconstruction method based on a differentiable renderer and solves the problem that three-dimensional face reconstruction cannot be accurately realized.

In order to achieve the purpose, the invention adopts the following technical scheme:

a single-image three-dimensional face reconstruction method based on a differentiable renderer comprises the following steps:

S1, inputting the target face image into a pre-trained three-dimensional face model to obtain an initialized three-dimensional face parameter;

s2, inputting the initialized three-dimensional face parameters and the attitude parameters into a differentiable renderer to respectively obtain a rendered image with the same attitude as the input image and rendered images with different attitudes;

s3, comparing the target face image with the rendered image in the same posture to obtain a key point loss value and a pixel level loss value;

s4, comparing the target face image with the rendered images in different postures to obtain a generated confrontation loss value and an identity consistency loss value;

and S5, optimizing the three-dimensional face parameters according to the calculated loss values, and returning the updated three-dimensional face parameters to the step S2 until iteration is converged to obtain the optimized three-dimensional face parameters.

Further, step S1 specifically includes:

inputting the target Face image into a pre-trained Large Scale Face Model three-dimensional Face Model, and regressing to obtain initialized three-dimensional Face parameters including an identity parameter alpha, an expression parameter beta, a texture parameter T and a posture parameter p ₀ ；

The expression of the three-dimensional face model is as follows:

wherein S represents a three-dimensional face shape, S represents an S-average face shape, S _α And s _β The face identity and the face expression vector are respectively the basis, alpha is a face identity coefficient with 158 dimensions, and beta is a face expression coefficient with 29 dimensions.

Further, the attitude parameters are obtained by random sampling, wherein the pitch and yaw angles are sampled from U (-40 °, 40 °), and the roll angles are sampled from U (-15 °, 15 °).

Further, inputting the initialized three-dimensional face parameters and the posture parameters obtained by regression into a differentiable renderer to respectively obtain a rendered image G with the same posture as the input image ₁ And a rendered image G having a different pose from the input image ₂ 。

Further, the differentiable renderer considers an aggregation mechanism of probability contributions of all triangles to process the occlusion, and the implementation steps are as follows:

construction of a probability map to estimate the triangle f _j Probability contribution to the pixel Pi

Wherein,

σ is a scalar quantity controlling the sharpness of the probability distribution, and d (i, j) is the pixels Pi to f _j The euclidean distance of the edges is,

mapping the input variable to be between (0, 1);

finally, through an aggregation function, rendering output I at pixel point Pi ⁱ Comprises the following steps:

where T is the pixel value on the texture map, T _b Representing the background color.

Further, step S3 is specifically:

calculating the pixel level loss L of the generated rendering image with the same posture as the input image and the input original image _pix And the loss of key point L _lan ；

The pixel level error is optimized based on the pixel value difference, the illumination parameter is optimized, and the expression is as follows:

L _pix ＝||I ₀ -R(α,β,T,p ₀ )|| ₁

wherein, I ₀ Representing an input original image, R representing a differentiable renderer, and R (alpha, beta, T, p) being a rendered image generated by the renderer under a random attitude parameter p;

carrying out 68 key point detection on the face image by adopting a depth alignment network M for key point loss;

inputting an original image and a rendered image to a depth alignment network, obtaining 68 key point coordinates of the original image and the rendered image related to human face features, calculating point-to-point Euclidean distances among the key points to optimize a reconstruction result, and enabling the postures of the rendered image and the input image to be consistent, wherein the expression is as follows:

L _lan ＝||M(I ₀ )-M(R(α,β,T,p ₀ ))|| ₂

where M () represents the input image to the depth alignment network, resulting in its 68 keypoint coordinates for the facial features.

Further, step S4 is specifically:

calculating the generation resisting loss L of the generated rendering image different from the input posture according to the generated rendering image _adv And identity consistency loss L _id ：

The generation route is regarded as a generator part for generating the confrontation network, a discriminator D is designed to judge whether the generator generates an image or not, a generation-confrontation mechanism is utilized to optimize a renderer to generate a more realistic face image, and the expression is as follows:

L _adv ＝logD(I ₀ )+log(1-D(R(α,β,T,p)))

Wherein R (alpha, beta, T, p) is a rendering image generated by the renderer under the random attitude parameter p; d () represents an input discriminator which judges whether an input image is a real image, to obtain a probability of being a real image;

training a face recognition network, respectively extracting 256-dimensional face features of an input image and a rendered image, calculating cosine distance as similarity, and designing an identity consistency loss function to judge whether the two images belong to the same person:

the identity consistency loss function expression is:

L _id ＝1-cos(F _R (R(α,β,T,p)),F _R (I ₀ ))

where cos () represents the cosine distance of two vectors, F _R Representing a face recognition network, F _R () And (4) representing the input image to a face recognition network to obtain 256-dimensional face features.

Further, specifically, MTCNN is used as a depth alignment network, and the face image is input to obtain 68 key point coordinates of the face image with respect to the face features.

Further, the face recognition network specifically adopts Light CNN-29v 2.

Further, the human face features are specifically eyes, eyebrows, nose, mouth, and contours.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method introduces a truly differentiable renderer, can better learn the dense correspondence between the two-dimensional image and the three-dimensional face by utilizing a gradient descent algorithm compared with an incompletely differentiable renderer, and obtains a three-dimensional face reconstruction result with higher quality.

2. For an optimization-based method, input three-dimensional face parameters are too abstract for a neural network, and a generated image often lacks authenticity and identity consistency; the method of the invention adds a production confrontation network of a branch, can generate the image which is consistent with the target image in identity but different in posture, and utilizes the introduced generation confrontation loss and identity consistency loss to ensure that the finally optimized three-dimensional face result has higher authenticity and more detailed texture characteristics.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a network block diagram of the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

As shown in fig. 1 and fig. 2, a single-image three-dimensional face reconstruction method based on a differentiable renderer includes the following steps:

s1, inputting the target face image into a regression network trained to obtain initialized three-dimensional face parameters; in this embodiment, the following concrete steps are performed:

inputting a target Face image into a pre-trained three-dimensional Face Model, outputting the three-dimensional Model by adopting a current universal Large Scale Face Model to obtain initialized three-dimensional Face parameters including an identity parameter alpha, an expression parameter beta, a texture parameter T and a posture parameter p ₀ Etc.;

wherein, the expression of the 3DMM shape model is as follows:

S2, inputting the initialized three-dimensional face parameters and the initialized posture parameters into a differentiable renderer to respectively obtain a rendered image G with the same posture as the input image ₁ And rendered images G of different poses ₂ ；

In this embodiment, G ₁ Initialization from the result in step 1The dimensional face parameters comprise an identity parameter alpha, an expression parameter beta, a texture parameter T and a posture parameter p ₀ And (6) rendering to obtain. G ₂ In the rendering process, the input identity, expression and texture parameters and G ₁ The input of the rendering process is the same, but the attitude parameter p is obtained by random sampling, wherein the pitching and yawing angles are sampled from U (-40 degrees and 40 degrees), namely are uniformly distributed, and the value range is from-40 degrees to 40 degrees; the sampling and roll angles were sampled from U (-15 °, 15 °).

The differentiable renderer in the embodiment is different from a z-buffer renderer in the traditional rasterization rendering; at present, z-buffer rasterization renderers are mostly adopted for renderers based on three-dimensional grids as input. The principle of the traditional rasterization renderer is based on computer vision rules, a three-dimensional grid is input, a triangular patch is appointed for each pixel of a rendered image, interpolation is carried out according to attributes (illumination, texture and normal vector) of vertexes in the triangular patch corresponding to the pixel, the color of each pixel in the rendered image is calculated, and the rendered two-dimensional image is obtained. The traditional rasterization process is a discrete process and an undifferentiated process, and in the subject research of the single-image three-dimensional reconstruction, the traditional rasterization renderer cannot optimize reconstruction parameters by using a gradient descent algorithm, so that the problem of low reconstruction quality of the human face with large posture (namely, non-front and partially shielded) is caused. The differentiable renderer of the embodiment considers an aggregation mechanism of probability contributions of all triangles to process occlusion, and the implementation steps are as follows:

Construction of probability map estimation triangle f _j Probabilistic contribution to the pixel Pi

Wherein,

sigma controlling the sharpness of the probability distributionScalar quantity d (i, j) is the pixels Pi to f _j The euclidean distance of the edges is,

mapping the input variable to be between (0, 1); in this example, σ is 0.01.

S3, calculating the pixel level loss L of the generated rendering image with the same posture as the input image and the target human face image _pix And the loss of key point L _lan ；

Pixel level errors (pixel loss) are optimized based on pixel value differences, optimizing lighting parameters such as ambient color, direction, distance, and color of the light source, helping to improve recovery of texture features. The expression is as follows:

L _pix ＝||I ₀ -R(α,β,T,p ₀ )|| ₁

wherein, I ₀ Representing a target face image, and R representing a renderer;

keypoint loss (landmark loss) a depth-aligned network M is used to perform 68 keypoint detection on the face image. In this embodiment, MTCNN is used as a depth alignment network, an original image and a rendered image are input, 68 key point coordinates of facial features (eyes, eyebrows, nose, mouth, and contour) of the original image and the rendered image are obtained, a point-to-point euclidean distance between the key points is calculated to optimize a reconstruction result, so that the postures of the rendered image and the input image are consistent, and the expression is as follows:

L _lan ＝||M(I ₀ )-M(R(α,β,T,p ₀ ))|| ₂ 。

S4, calculating the generation resisting loss L of the generated rendering image different from the input posture and the input image _adv And identity consistency loss L _id ：

Regarding the generation route as a Generator part (Generator) of a generation countermeasure network, designing a discriminator D, wherein the discriminator is of a general decoder network structure, inputting the produced rendering image and the original image into the discriminator D together to finally obtain a probability, the probability represents the probability that the image is a real image, and is used for judging whether the Generator generates the image (real/fake), and optimizing the renderer by using a generation-countermeasure mechanism to generate a face image with more authenticity, and the expression is as follows:

L _adv ＝logD(I ₀ )+log(1-D(R(α,β,T,p)))

wherein, R (α, β, T, p) is a rendered image generated by the renderer under the random attitude parameter p, and D () represents an input discriminator to judge whether the input image is a real image, to obtain a probability of being the real image.

Wherein, the generation-countermeasure mechanism, the training target of the generator is to generate the image with enough fidelity, and the training target of the discriminator is to judge the reality of the image more accurately. The generator and the discriminator are in a game relationship, in the embodiment and the formula, the goal of the generator (renderer) is to enable the generated rendering image under the random pose parameter p to "cheat" the discriminator, and finally the discriminator is trained to be unable to distinguish the authenticity of the real image and the generated rendering image, and the formula represents that the generator (renderer) is designed and optimized.

Training a face recognition network F _R In this embodiment, Light CNN-29v2 is used as a face recognition network to extract 256-dimensional face features (Embeddings) of an input image and a rendered image, calculate a cosine distance between the input image and the rendered image as similarity, and design an identity consistency loss function to determine whether two images belong to the same person:

the identity consistency loss function expression is:

L _id ＝1-cos(F _R (R(α,β,T,p)),F _R (I ₀ ))

It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A single-image three-dimensional face reconstruction method based on a differentiable renderer is characterized by comprising the following steps:

s2, inputting the initialized three-dimensional face parameters and the initialized posture parameters into a differentiable renderer to respectively obtain a rendered image with the same posture as the input image and rendered images with different postures;

s3, comparing the target face image with the rendered image under the same posture to obtain a key point loss value and a pixel-level loss value;

2. The method for reconstructing a single-image three-dimensional face based on a differentiable renderer according to claim 1, wherein the step S1 specifically comprises:

The expression of the three-dimensional face model is as follows:

wherein S represents a three-dimensional face shape, representing

Average face shape, s _α And s _β The face identity and the face expression vector are respectively the basis, alpha is a face identity coefficient with 158 dimensions, and beta is a face expression coefficient with 29 dimensions.

3. The differentiable renderer-based single-image three-dimensional face reconstruction method according to claim 1, wherein the pose parameters are obtained by random sampling, wherein the pitch and yaw angles are sampled from U (-40 °, 40 °), and the roll angles are sampled from U (-15 °, 15 °).

4. The method of claim 1, wherein the initialized three-dimensional face parameters and pose parameters obtained by the regression are input into the differential renderer to obtain a rendered image G with the same pose as the input image respectively ₁ And a rendered image G having a different pose from the input image ₂ 。

5. The method for reconstructing single-image three-dimensional human face based on differentiable renderer according to claim 4, wherein the differentiable renderer considers the aggregation mechanism of all triangle probability contributions to process occlusion, and the implementation steps are as follows:

Wherein,

σ is a scalar quantity controlling sharpness of probability distribution, and d (i, j) is pixels Pi to f _j The euclidean distance of the edges is,

mapping the input variable to be between (0, 1);

wherein T isPixel value, T, on texture map _b Representing the background color.

6. The method for reconstructing a single-image three-dimensional face based on a differentiable renderer according to claim 1, wherein the step S3 specifically comprises:

L _pix ＝||I ₀ -R(α,β,T,p ₀ )|| ₁

L _lan ＝||M(I ₀ )-M(R(α,β,T,p ₀ ))|| ₂

7. The method for reconstructing a single-image three-dimensional face based on a differentiable renderer according to claim 1, wherein the step S4 specifically comprises:

calculating the generation resisting loss L of the generated rendering image different from the input posture _adv Loss of identity consistency L _id ：

L _adv ＝log D(I ₀ )+log(1-D(R(α,β,T,p)))

the identity consistency loss function expression is:

L _id ＝1-cos(F _R (R(α,β,T,p)),F _R (I ₀ ))

8. The differential renderer-based single-graph three-dimensional face reconstruction method as claimed in claim 6, wherein MTCNN is used as a depth alignment network to input the face image and obtain its 68 key point coordinates related to the face features.

9. The method for reconstructing single-image three-dimensional human face based on differentiable renderer as claimed in claim 7, wherein the human face recognition network specifically adopts Light CNN-29v 2.

10. The differential renderer based single image three-dimensional face reconstruction method according to claim 6 or 7, characterized in that the face features are eyes, eyebrows, nose, mouth and contour.