WO2021223134A1

WO2021223134A1 - Micro-renderer-based method for acquiring reflection material of human face from single image

Info

Publication number: WO2021223134A1
Application number: PCT/CN2020/088883
Authority: WO
Inventors: 翁彦琳; 周昆; 耿佳豪; 王律迪
Original assignee: 浙江大学; 杭州相芯科技有限公司
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2021-11-11

Abstract

A micro-renderer-based method for acquiring a reflection material of a human face from a single image. The method mainly comprises the five steps of: calculating 3D geometrical information of a human face in an image; initializing a hidden space and spherical harmonic lighting of a reflection of the human face; decoding the hidden space of the human face into a reflection material image of the human face; improving the quality of a reflection material of the human face; and iteratively optimizing a hidden space coefficient and a spherical harmonic lighting coefficient of the reflection material of the human face, and acquiring the reflection material of the human face according to the hidden space coefficient of the reflection material. By means of the method, a high-quality material of a human face can be iteratively acquired according to a frontal facial image with a neutral facial expression, and both the results of facial reconstruction and re-rendering performed by using the obtained material achieve the level of the currently most advanced technique. The method can be applied to a series of applications, such as capturing of the material of a human face, facial reconstruction, and real face rendering.

Description

A method for solving face reflection material from a single image based on micro-renderer

Technical field

The present invention relates to the field of face capture, in particular to a method for solving the reflection material of a face based on a single image.

Background technique

Regarding the field of facial capture, there is a class of professional facial capture methods based on professional equipment. These methods require the target person to be in a specific and controlled environment, and professionals use specially designed equipment and algorithms to solve the target person's reflective material. For example, through Light Stages (Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and Mark Sagar. 2000.Acquiring the Reflectance Field of a Human Face. In Proceedings of Ghoffe GRAPH, Grahamet of SIGhije 2000.) ( ,Borom Tunwattanapong, Jay Busch, Xueming Yu, and Paul Debevec.2011.Multiview Face Capture using Polarized Spherical Gradient Illumination.ACMTrans.Graphics(Proc.SIGGRAPHAsia)(2011).)(Wan-ChunMas, Timer Hawkins Charles-Felix Chabert, Malte Weiss, and Paul Debevec. 2007. Rapid Acquisition of Specular and Diffuse Normal Maps from Polarized Spherical Gradient Illumination. The high-quality data obtained has promoted the creation of many digital images in the film and television industry. There are also things like (Thabo Beerer, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010. High-Quality Single-Shot Capture of Facial Geometry. ACM Trans. on Graphics (Proc. SIGGRAPH) 29, 3 (2010), 40 :1-40:9.)(Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W. Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. Trans.Graph.30,4(Aug.2011),75:1–75:10.https://doi.org/10.1145/2010324.1964970) Based on multi-camera equipment, shape-from-shading technology is used to reconstruct pores in human faces Level of subtle information. Graham et al. (P. Graham, Borom Tunwattanapong, Jay Busch, X. Yu, Andrew Jones, and Paul Debevec. 2013. Measurement-based Synthesis of Facial Microgeometry.) used optical and elastic sensors to measure facial peripheral information. Such technology can be used to create high-fidelity digital images, just like methods (J.von der Pahlen, J. Jimenez, E. Danvoye, Paul Debevec, Graham Fyffe, and Oleg Alexander.2014.Digital Ira and Beyond: Creating a Real-Time Photoreal Digital Actor Technical Report.) Although these methods can reconstruct high-fidelity digital face images, they require high equipment and are expensive, requiring professional operations, and are not friendly to ordinary users.

In addition, there are also some single-view-based face capture methods, including 3D deformable models (Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In SIGGRAPH. https://doi.org/ 10.1145/311535.311556) is the first to successfully model the change of face shape and appearance as a linear combination of a set of orthogonal bases. Over the years, 3D deformable models have influenced the development of many methods such as (James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniahy, and David Dunaway. 2016. A 3D Morphable Model Learnt from 10,000 Faces. In 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).5543–5552.https://doi.org/10.1109/CVPR.2016.598ISSN:1063-6919.)(Ira Kemelmacher.2013.Internet Based Morphable Model.3256–3263.https://doi .org/10.1109/ICCV.2013.404)(Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nieβner. 2016.Face2face:Real-time face capture and reenactment of rgb videos.In Proceedings of the IEEE Conference on Vision and Pattern Recognition. 2387-2395.). These parameterized linear model methods all use the minimized fitting loss function to generate the face shape and reflection material. The main drawback of this type of method is that its effect is limited by the expressive ability of the linear model, and the linear model is difficult to express realistically. Face features. Like (Ayush Tewari, Michael

Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Pérez, and Christian Theobalt. 2017.MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction.In arXiv:1703.10580[cs].http://arxiv.org/abs/ 1703.10580arXiv:1703.10580.)(Luan Tran,Feng Liu,and Xiaoming Liu.2019.Towards High-fidelity Nonlinear 3D Face Morphable Model.In In Proceeding of IEEE Computer Vision and Pattern Recognition.Long Beach,CA.)(Kyle Genova,Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and William T. Freeman.2018.Unsupervised Training for3D Morphable Model Regression.In arXiv:1806.06098[cs].http://arxiv.org/abs/1806.06098arXiv:1806.06098.) (Yu Deng,Jiaolong Yang,Sicheng Xu,Dong Chen,Yunde Jia,and Xin Tong.2019.Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.0-0.) These methods use neural networks to separate the geometry and reflection materials of a single face image. However, these methods are not intended to generate materials that can be used for high-fidelity face reconstruction, so their effects are still lacking in expressiveness.

There are also some methods aimed at generating high-realistic face reflection materials. Saito et al. proposed an algorithm to infer high-resolution diffuse material from a single unconstrained image (Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. 2017. Photorealistic Facial Texture Inference Using Deep Neural Networks .In arXiv:1612.00523[cs].http://arxiv.org/abs/1612.00523arXiv:1612.00523.), their central idea is to use the middle layer feature correlation of the neural network to mix the high-resolution materials in the database to This generates tiny facial details. Yamaguchi and others proposed another algorithm based on deep learning (Shuco Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle Olszewski, Shigeo Morishima, and Hao Li. 2018. High-fidelity Facial Reflectin and Reference an unconstrained image.ACMTransactions on Graphics(TOG)37,4(2018),162.), they can infer high-quality face material from a single unconstrained image, and they can use them to render reasonable and realistic results, but they The method cannot guarantee the consistency of the rendering result with the target image characteristics.

Summary of the invention

The purpose of the present invention is to address the shortcomings of the prior art and provide a method for solving high-quality human face reflection materials from a single image based on a differentiable renderer. The present invention first detects the 3D geometric information of the face of the input image, initializes the hidden space of the face reflection material and the spherical harmonic illumination variable, and then uses the decoder based on the neural network to decode the hidden space variable to generate the corresponding face reflection material, and then uses the face reflection material based on The quality enhancer of the neural network improves the quality of the reflection material generated by the decoder. Finally, the physically-based differentiable renderer performs the character rendering according to the reflection material and spherical harmonic lighting, and minimizes the color space between the rendering result and the input face The hidden space and spherical harmonic illumination variables are iteratively updated until convergence, and the finally obtained hidden space variables can be decoded and quality enhancement operations to obtain a high-quality face reflection material that meets the input face characteristics and is performed with the above materials Rendering can get high-fidelity, high-feature matching rendering results. This method has reached the most advanced level of face material generation technology and has high practical value.

The purpose of the present invention is achieved through the following technical solutions: a method for solving the reflection material of a human face from a single image based on a differentiable renderer, including the following steps:

(1) Calculate the 3D information of the face in the input image, and obtain the face color map in the texture space and the static information for physically-based differentiable rendering according to the 3D information. The 3D information includes a 3D model of a human face, a rigid body change matrix, and a projection matrix; the static information includes a shadow map T _sha and an environment normal map T _bn .

(2) Based on the texture space face color map obtained in step 1, the initial value of the hidden space coefficient of the face reflection material is obtained through the encoder encoding based on the convolutional neural network

The initial value of the spherical harmonic illumination coefficient

*A, n, s represent diffuse reflection material, normal material and specular reflection material respectively.

(3) Using a differentiable decoder based on a convolutional neural network to reflect the coefficients of the hidden space of the face material

Decode into the corresponding reflection material image

(4) Improve the reflective material image obtained in step 3

The resolution and detail quality of the image T _{* are} obtained.

(5) Iteratively optimize the hidden space coefficient and spherical harmonics of the face reflection material by minimizing the difference between the rendering result obtained by the physically-based differentiable renderer rendering step 4 and the improved quality of the reflective material image T _{* and the input face image} Illumination coefficient, the hidden space coefficient of the optimized face reflection material is solved through the decoding and quality improvement operations in steps 3-4 to obtain the face reflection material.

Further, the step 1 includes the following sub-steps:

(1.1) Calculation of face 3D information: detect the two-dimensional feature points of the face in the input image, and use the deformable model to optimize the character identity coefficient, the rigid body change matrix and the projection matrix, through the linear interpolation of the deformable model and the character identity coefficient To get a 3D model of the character.

(1.2) Calculation of face color pictures in texture space: using the rigid body change matrix and projection matrix obtained in step 1.1, project the 3D model obtained in step 1.1 to the input image, and establish the mapping between each vertex of the 3D model and the image pixel. Map the input image pixels to the vertices of the 3D model, and then use the mapping between the vertices of the 3D model and the texture space to map the image pixels to the texture space. Then, the texture space is obtained by triangulating the texture space and interpolating the triangle center of gravity coordinates. Human face color image.

(1.3) Calculation of static information based on physically differentiable rendering: using the 3D model, rigid body change matrix, and projection matrix in step 1.1, the texture coordinates are drawn as color information to the image space to obtain the texture coordinate image I _uv ; using 1.1 The rigid body change matrix and 3D model obtained in the step are obtained, and the rigid change 3D model is obtained. The ray tracing algorithm is used to calculate the occlusion in each direction of each vertex of the above 3D model, and project it to the spherical harmonic function polynomial, thereby obtaining each vertex The occluded spherical harmonic coefficient; in addition, the proportion of the unoccluded area and the center direction of the unoccluded area are recorded to obtain the environment normal vector of each vertex. Finally, through the triangular meshing of the texture space, and the triangular barycentric coordinate interpolation of the occlusion spherical harmonic coefficient of each vertex and the environment normal vector, the final shadow map T _sha and the environment normal map T _{bn are obtained} .

Further, in the step 1.2, the Poisson algorithm is used to fill the void areas in the face color image in the texture space.

Further, the encoder and decoder based on the convolutional neural network are obtained by forming a U-shaped network for joint training, and the training specifically includes the following sub-steps:

(a) Training data: Obtain N target face images I _o and corresponding diffuse reflection materials

Normal material

And specular reflection material

The face image is mapped to the texture space, and the corresponding texture space face color image I is obtained.

The training data that composes the U-shaped network, each of which has a resolution of 1024×1024.

(b) Diffuse material, normal material, and specular material each have a U-shaped network. For a U-shaped network U _{a of} diffuse reflection material, the input is a scaled texture space face color image

The encoder part E _{a of} U _a contains 9 down-sampling modules. The first 8 down-sampling modules all include a convolutional layer with a core size of 3×3 and a step size of 2×2, a batch normalization layer, and an LReLU activation function layer. , The last down-sampling module includes a convolutional layer with a kernel size of 1×1 and a step size of 2×2, a batch normalization layer, and an LReLU activation function layer. The final encoding becomes a 1×1×1024 diffuse reflection material hidden space. The decoder part of U _a _{D a} contains 9 up-sampling modules. Each up-sampling module contains a zoomed convolutional layer with a core size of 3×3 and doubled magnification, batch normalization layer, LReLu activation function layer, and finally passed A convolutional layer with a core size of 1×1, a step size of 1×1, and an activation function of Sigmoid will obtain an output with a final resolution of 512×512×3. U-shaped network of normal material U _n , the input is a face color image in texture space scaled by regional interpolation

The resolution is 256×256, and the encoder _En includes 8 down-sampling modules. The first 7 down-sampling modules include a convolutional layer with a core size of 3×3 and a step size of 2×2, a batch normalization layer, LReLU activation function layer, the last down-sampling module includes a convolutional layer with a core size of 1×1 and a step size of 2×2, a batch normalization layer, and an LReLU activation function layer. The final encoding becomes a 1×1×512 normal material Hidden space. The decoder D _n includes 8 up-sampling modules, the first 7 up-sampling modules all include a core size of 3×3, zoomed convolution layer twice magnified, batch normalization layer, LReLu activation function layer, and finally through a core A convolutional layer with a size of 1×1, a step size of 1×1, and an activation function of Sigmoid obtains an output with a final resolution of 256×256×3. Specularly reflective material Network U _S U, _S E of the encoder structure which is same as E _n, D _s 7 before upsampled module contains a core size of 3 × 3, enlarged to twice the scale layer convolution, normalized batch Layer, LReLu activation function layer, and finally through a convolutional layer with a kernel size of 1×1, a step size of 1×1, and an activation function of Sigmoid to obtain an output with a final resolution of 256×256×1. Among them, the three modules with the highest resolution in the U-shaped network E _* and D _* are connected by skipping transmission, and * is a, n, s.

(c) The training loss function is defined as follows:

U _* represents a U-shaped network, where the subscript * can be a, n, s representing diffuse reflection material, normal material, and specular reflection material, respectively.

Represents the face color image representing the scaled texture space,

and

Respectively represent the U-shaped network output material image and the corresponding zoomed real material image, where

The resolution is 512×512, and

The resolution is 256×256.

The resolution is 512×512, and

The resolution is 256×256.

Further, in the step 2, the initial value of the spherical harmonic illumination of the input image

It is obtained by constructing a spherical harmonic illumination coefficient regression network. The spherical harmonic illumination coefficient regression network includes an encoder based on a convolutional neural network and a regression module composed of a full connection. The training process includes the following steps:

(A) A _{training data pair is composed of {I o} , z _e }, in which the spherical harmonic coefficient z _{e is} calculated according to the HDR ambient light image I _e by the following formula:

Where i,j represent the Cartesian coordinates of the image length and width in W and H directions, Y ^k represents the spherical harmonic function polynomial, k represents the order of the spherical harmonic, 0≤k＜9, and φ represents the image coordinate i,j to the spherical coordinate θ ,

The conversion equation of, its expression is as follows:

(B) Scaling I _o to a resolution of 256×256 as the network input, and using the L2 norm as the loss function to perform end-to-end supervised learning training on the network.

Further, in the step 4, the reflective material image is improved by constructing the reflective material quality enhancement network R _*

The resolution and quality of detail include the following sub-steps:

(4.1) Train the reflection material quality enhancement network based on the convolutional neural network, as follows:

(4.1.1) Training data: input the face color image I used for training into the U-shaped network trained in step 2 to generate

Original with face color image I

Compose training data pair

* Means a, n, s.

(4.1.2) Training method: SRGAN network is used as the reflection material quality enhancement network R _* , and the generative confrontation (GAN) method is used for training; among them, for the diffuse reflection material quality enhancement network R _a , the input is 512×512

The output image resolution is 1024×1024. For the normal material quality enhancement network R _n and the specular material quality enhancement network R _s , the first layer of the network accepts an image depth of 4, and the input includes

And scaled texture space face color image

The input resolution is 256×256, and the output resolution is 1024×1024 high-quality texture images.

(4.2) Quality enhancement of material images: based on step 3

Use the trained quality enhancement network in step 4.1 for quality enhancement to obtain a high-quality material image T _* ,* represents a, n, s, and the whole process can be expressed by the following formula:

Represents a face color image scaled to 256×256 in texture space.

Further, the step 5 includes the following sub-steps:

(5.1) Physically-based forward rendering using reflective materials and spherical harmonic lighting:

(5.1.1) Calculate the diffuse reflection of the face: According to the I _uv _{obtained in step 1.3, the T a} , T _n and T _s output by the T _* quality enhancement network and the shadow map T _sha and the environment normal map T _bn are bilinear Sampling to obtain the material image t _{* of the} corresponding image space, * is a, n, s, sha, bn representing the diffuse material, the normal material, the specular material, the shadow map, and the environment normal map, respectively. Traverse _{all the pixels in I uv} , and use the following physically-based rendering formula to calculate the diffuse lighting of each pixel:

Among them, k represents the order of the spherical harmonic function polynomial, and reprojects z _e and v using the properties of spherical harmonic multiplication projection to obtain w, v represents the visibility of each pixel in each direction, and is recorded in t _sha ; c is determined by max The spherical harmonic coefficient of (0, cosθ) is rotated to the spherical harmonic coefficient of the normal direction n of the current pixel, and n is recorded in t _n .

(5.1.2) Calculate the specular reflection of the face and calculate the rendering result: use the following formula to calculate the specular specular reflection of the face:

L _s =DFG·LD,

Where DFG represents the pre-calculated rendering transmission equation that obeys the GGX distribution, and the calculation method of LD is as follows:

Use the following formula to fuse diffuse reflection and specular reflection to calculate the rendering result of each pixel in _{I uv:}

That is the final rendering result.

(5.2) Iterative optimization of material hidden space variables and spherical harmonic illumination coefficient z _e : Minimize the following formula:

L represents the loss function,

Represents the differentiable rendering process of step 5.1. Rendering using differentiable, differentiable strengthen the network quality and a micro decoder, is transmitted to the reverse loss values z _*, and iteratively updated z _{*, *} is a, n, s, e each represent diffuse material, method To material, specular reflection material, spherical harmonic lighting, until convergence, and finally to diffuse reflection, normal vector, specular reflection material decoder respectively input z _a , z _n , z _s , and then input the output to the corresponding material quality enhancement _{Through the network, materials T a} , T _n , and T _{s that} meet the character characteristics of the input image are obtained.

The beneficial effect of the present invention is that the present invention proposes a method that combines a neural network-based non-linear decoder, quality enhancement, and a physically-based differentiable renderer to calculate a face reflection material from a single face image. Use the neural network-based nonlinear decoder and quality enhancer to express the complex face reflection material space, and use the physically-based differentiable renderer to optimize the face reflection material space, so that the solved face reflection material conforms to the input face Features, and the rendering results are realistic and similar to the input face. This method reaches the most advanced level of the technology for solving face reflection materials, and the processing time is short. The present invention can be used in applications such as the capture of human face materials, the reconstruction of human faces, and the rendering of real human faces.

Description of the drawings

Figure 1 is the result of solving, reconstructing and re-rendering the material of the first face picture by applying the method of the present invention. In the figure, A is the input image, B is the result of the reconstruction of the face reflection material obtained by the solution, and C is Render the result under new lighting conditions; D is the diffuse material t _a , E is the normal material t _n , and F is the specular material t _s .

Figure 2 is the result of solving, reconstructing and re-rendering the material of the second face picture by applying the method of the present invention. In the figure, A is the input image, B is the result of the reconstruction of the face reflection material obtained by the solution, and C is Render the result under new lighting conditions; D is the diffuse material t _a , E is the normal material t _n , and F is the specular material t _s .

Figure 3 is the result of solving, reconstructing and re-rendering the material of the third face picture by applying the method of the present invention. In the figure, A is the input image, B is the result of the reconstruction of the face reflection material obtained by the solution, and C is Render the result under new lighting conditions; D is the diffuse material t _a , E is the normal material t _n , and F is the specular material t _s .

Figure 4 is the result of solving, reconstructing and re-rendering the material of the fourth face picture by applying the method of the present invention. In the figure, A is the input image, B is the result of the reconstruction of the face reflection material obtained by the solution, and C is Render the result under new lighting conditions; D is the diffuse material t _a , E is the normal material t _n , and F is the specular material t _s .

Figure 5 is the result of solving, reconstructing and re-rendering the material of the fifth face picture by applying the method of the present invention. In the figure, A is the input image, B is the result of the reconstruction of the face reflection material obtained by the solution, and C is Render the result under new lighting conditions; D is the diffuse material t _a , E is the normal material t _n , and F is the specular material t _s .

Detailed ways

The core technology of the present invention uses a neural network to non-linearly express the complex face reflection material space, and uses a physically-based differentiable renderer to optimize the space to obtain a face reflection material that meets the characteristics of the input image. The method is mainly divided into the following five main steps: the calculation of 3D geometric information of the face, the initialization of the hidden space of the face reflection material and the spherical harmonic lighting, the decoding of the hidden space to the reflection material image, the quality improvement of the face reflection material, and iteration Optimize the hidden space coefficient and spherical harmonic illumination coefficient of the face reflection material, and solve the face reflection material according to the hidden space coefficient of the reflection material.

The steps of the present invention are described in detail below. Figures 1-5 are the results of applying the method of the present invention to solving the material of five character pictures, reconstructing the face, and re-rendering under new lighting. The left picture in the first row of each picture is the input image, the middle picture is the result of the reconstruction of the face reflection material obtained by the solution, the right picture is the rendering result under the new lighting conditions; the second row, the left picture is the diffuse reflection material t _a , The middle picture is the normal material t _n , and the right picture is the specular reflection material t _s , which is obtained by bilinear sampling of the solved material _{by I uv.}

1. Calculation of the 3D geometric information of the face in the image: Calculate the 3D information of the face in the input image, and obtain the face color map in the texture space and the static information for physically-based differentiable rendering.

1.1 Calculation of face 3D information

The present invention uses algorithms (Chen Cao, Qiming Hou, and Kun Zhou.2014a. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on graphics (TOG) 33, 4 (2014), 43.) to detect input Two-dimensional feature points of the image face, using (Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nieβner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedrferences V of the IEEE Conference Pattern Recognition.2387-2395.) Solve the identity coefficient (identity), rigid body change matrix and projection matrix, and interpolate the deformable shape model through the identity coefficient, and then the 3D model of the input face can be obtained:

1.2 Face color image calculation in texture space

Using the rigid body change matrix and projection matrix obtained in step 1.1, project the 3D model obtained in step 1.1 to the input image, and establish a mapping between each vertex of the 3D model and the image pixel, so that the input image pixel can be mapped to the vertex of the 3D model , And then use the mapping between the vertices of the 3D model and the texture space to map the image pixels to the texture space, and then obtain the face color image in the texture space by triangulating the texture space and interpolating the coordinates of the triangle's center of gravity. Due to the occlusion of the input face, the face color image in the texture space has a hole area, and the Poisson algorithm is used to fill the hole to obtain the final texture space face color image.

1.3 Calculation of static information based on physically differentiable rendering

Use the 3D model, rigid body change matrix, and projection matrix in step 1.1 to draw texture coordinates as color information to the image space to obtain the texture coordinate image I _uv ; use the rigid body change matrix and 3D model obtained in step 1.1 to obtain the rigid body change matrix 3D model, using the ray tracing algorithm to calculate the occlusion of each vertex of the above 3D model in various directions, and project it to the spherical harmonic function polynomial. In this embodiment, the 9th order is used to obtain the spherical harmonic coefficient of the occlusion of each vertex; Record the proportion of the unoccluded area and the center direction of the unoccluded area to obtain the environment normal vector of each vertex. Finally, through the triangular meshing of the texture space, and the triangular barycentric coordinate interpolation of the occlusion spherical harmonic coefficient of each vertex and the environment normal vector, the final shadow map T _sha and the environment normal map T _{bn are obtained} .

2. Initialization of the hidden space of the face reflection material and the spherical harmonic illumination: based on the texture space face color map obtained in step 1, the hidden space coefficients of the face reflection material and the spherical harmonic illumination are obtained through the encoder encoding based on the convolutional neural network Initial value.

2.1 Training U-shaped network based on convolutional neural network

Training data. The face model database contains 84 3D digital characters, each of which contains a 3D model and diffuse material

Normal material

And specular reflection material

The data in this embodiment comes from 3D Scan Store. Use CFD (Debbie S Ma, Joshua Correll, and Bernd Wittenbrink. 2015.The Chicago face database: A free stimulus set of faces and norming data.Behavior research methods 47,4(2015),1122--1135.) The skin color data of the diffuse reflection material is augmented to obtain about 4000 diffuse reflection material images. In addition, the ambient light database contains 2957 HDR ambient light images I _e . Using the above data, we render the face image through image-based lighting technology and screen-based subsurface technology. During the rendering process, we randomly rotate the 3D model and the HDR ambient light image I _e . In this way, a total of about 100,000 target face images I _{o are obtained} . The face image is mapped to the texture space, and the corresponding texture space face color image I is obtained. Depend on

Network structure. U-shaped network structure: Diffuse reflection material, normal material, and specular reflection material each have a U-shaped network. Each U-shaped network is composed of encoder E, decoder D, and skip transfer. For a U-shaped network U _{a of} diffuse reflection material, the input is a scaled texture space face color image

in,

The area interpolation scaling algorithm is used to scale I to a resolution of 512×512. The encoder part E _{a of} U _a contains 9 down-sampling modules. The first 8 down-sampling modules all include a convolutional layer with a core size of 3×3 and a step size of 2×2, and a batch normalization layer (S.Ioffe and C.Szegedy.Batch normalization:Accelerating deep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167,2015.), LReLU activation function layer (Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models.In Proc.icml,Vol.30.3.), the last one is different from the first eight in that the core size is 1×1, and the final encoding becomes a 1×1×1024 diffuse reflection material hidden space. The decoder part D _{a of} U _a contains 9 up-sampling modules, each of which contains a scaled convolutional layer with a core size of 3×3 and twice the magnification (Jon Gauthier.2014.Conditional generative adversarial nets for convolutional face generation.Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester 2014, 5(2014), 2.), batch normalization layer, LReLu activation function layer, and finally pass a core size of 1×1 The convolutional layer with 1×1 and activation function Sigmoid obtains an output with a final resolution of 512×512×3. In addition _{, the 3 modules with the highest resolution in E a} and D _a will be connected by skip transfer (Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE conference on computer vision and pattern recognition(2017).). The above network structure can be expressed as (C32K3S2,BN,LReLU,Skip1)->(C64K3S2,BN,LReLU,Skip2)->(C128K3S2,BN,LReLU,Skip3)->(C258K3S2,BN,LReLU)->(C512K3S2 ,BN,LReLU)->(C512K3S2,BN,LReLU)->(C512K3S2,BN,LReLU)->(C512K3S2,BN,LReLU)->(C1024K1S2,BN,LReLU)->(RC512K3R2,BN,LReLU) ->(RC512K3R2,BN,LReLU)->(RC512K3R2,BN,LReLU)->(RC512K3R2,BN,LReLU)->(RC512K3R2,BN,LReLU)->(R256K3R2,BN,LReLU)->(Skip3, RC128K3R2,BN,LReLU)->(Skip2,RC64K3R2,BN,LReLU)->(Skip1,RC32K3R2,BN,LReLU)->(C3K1S1,Sigmoid), where CxKySz represents z size step, core size is y, output Convolutional layer with depth x, BN means batch normalization, RCxKyRz means scaled convolutional layer with scaling ratio z, kernel size y, output depth x, Skip means skip connection, the following number means number, the same number Represents the same group of skip connections. For the U-shaped network U _{n of} normal material, the input is the skin color image of the face in the texture space after regional interpolation scaling

The resolution is 256×256. The _{main difference from U a} is that the encoder _En and the decoder D _n each lack a down-sampling layer and an up-sampling layer. The hidden space size is 1×1×512, and D _{n is} output. The size is 256×256×3. The network structure is represented as follows, (C32K3S2,BN,LReLU,Skip1)->(C64K3S2,BN,LReLU,Skip2)->(C128K3S2,BN,LReLU,Skip3)->(C258K3S2,BN,LReLU)->(C512K3S2, BN,LReLU)->(C512K3S2,BN,LReLU)->(C512K3S2,BN,LReLU)->(C512K1S2,BN,LReLU)->(RC512K3R2,BN,LReLU)->(RC512K3R2,BN,LReLU)- >(RC512K3R2,BN,LReLU)->(RC512K3R2,BN,LReLU)->(R256K3R2,BN,LReLU)->(Skip3,RC128K3R2,BN,LReLU)->(Skip2,RC64K3R2,BN,LReLU)->(Skip1,RC32K3R2,BN,LReLU)->(C3K1S1,Sigmoid). For the material of the specular reflection-type network U _S U, _S E of the encoder structure which is the same as E _n, D _s and the only difference is that the output depth D _n convolutional layer is the last layer 1, D _s is the size of the output 256 ×256×1.

Loss function. Use U _{* to} denote a U-shaped network, where the subscript * can be a, n, and s to denote diffuse reflection material, normal material, and specular reflection material, respectively. The loss function is defined as follows:

Represents the face color image of the scaled texture space,

and

Respectively represent the U-shaped network output material image and the corresponding scaled real material image. in

The resolution is 512×512, and

The resolution is 256×256.

The resolution is 512×512, and

The resolution is 256×256. In the training process, the learning rate is 1e-4, and the optimizer used is the Adam optimizer (DPKingma and J.Ba.Adam:A method for stochastic optimization.arXiv preprint arXiv:1412.6980,2014.).

2.2 Training the Spherical Harmonic Illumination Coefficient Regression Network Based on Convolutional Neural Network

Training data. According to the target face image I _o . Obtained in 2.1 and the HDR ambient light image I _e _{used for I o} rendering, the spherical harmonic illumination coefficient z _{e of} _{I e} is calculated by the following formula:

The conversion equation of, its expression is as follows:

Finally, the training data pair is composed of {I _o , z _{e }.}

Network training. We use a network structure similar to VGG (Karen Simonyan and Andrew Zisserman.2014.Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556(2014).) to construct the spherical harmonic illumination coefficient regression network E _e . Specifically, I _{o is} scaled to a resolution of 256×256, and the same 10 convolutional layers as VGG are passed, and finally the spherical harmonic illumination coefficient z _{e is} output through an average pooling layer and a fully connected layer. The L2 norm between the network output of the spherical harmonic illumination coefficient and the true value is used as the loss function to train the spherical harmonic illumination coefficient regression network. The training learning rate is 1e-4, and the optimizer used is Adam.

2.3 Initialize material hidden space variables

Input the face color image of the scaled texture space _{to the encoder E*} in the 3 U-shaped networks trained in 2.1

Can get

This value is the initial value of the material hidden space variable z _* . In addition, it is necessary to record the set of feature maps output by the first three downsampling modules of the encoder.

*A, n, s represent diffuse reflection material, normal material and specular reflection material respectively. This process can be expressed by the following formula:

2.4 Initialize spherical harmonic lighting

Input the face photo scaled to 256×256 to the spherical harmonic illumination regression network E _{e trained in step 2.2}

Obtain the spherical harmonic illumination coefficient

Take this as the initial value of the spherical harmonic illumination coefficient z _e. This process can be expressed by the following formula:

3. Decoding from hidden space to reflective material space: using a differentiable decoder based on a convolutional neural network to decode the hidden space coefficients of the face reflection material into the corresponding reflective material.

3.1 Decoding

Input z _{* to} decoder D _* _{in U *} trained in step 2.1, and

Performing the decoding operation, the corresponding material image can be obtained, which can be expressed by the following expression:

4. Improvement of the quality of face reflection materials: Based on the reflection material obtained in step 3, the differentiable quality enhancement network based on the convolutional neural network is used to further improve the quality of the reflection material.

4.1 Training the reflective material quality enhancement network based on convolutional neural network

Training data. Use the U-shaped network trained in 2.1 and use the I of the training data in step 2.1 as the network input to generate

Form a training data pair with T _{* of the training data in step 2.1}

* Means a, n, s.

Training method. For the quality enhancement network of diffuse materials, we refer to SRGAN (Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017.Photo- realistic single image super-resolution using a generative adversarial network.In Proceedings ofthe IEEE conference on computer vision and pattern recognition.4681-4690) using generated against (GAN) training the super-resolution mode network R _a, the input of 512 × 512

Carrying out quality enhancement, a T _{a of} 1024×1024 is obtained. For normal materials and specular materials, we also sample and generate confrontation methods to train the super-resolution network R _n , R _s . _{There are two points different from R a} . First, they will input 256×256 material images for quality enhancement , Get a high-quality texture image of 1024×1024; secondly, their input is in addition to

There are also face color images in scaled texture space

4.2 Quality enhancement of material images: based on step 3

Represents a face color image scaled to 256×256 in texture space.

5. Iterative optimization of hidden space using physically-based differentiable renderer: by minimizing the difference between the rendering result of physically-based differentiable renderer and the input face image, iteratively optimize the hidden space of the face reflection material, and pass Decoding and quality improvement operations get the output face reflection material result.

5.1 Physically-based forward rendering using reflective materials and spherical harmonic lighting

Calculate the diffuse reflection of the face. First, according to the I _uv obtained in step 1.3, perform bilinear sampling on the output of the quality enhancement network T _* , * represents a, n, s, and the shadow map T _sha and the environment normal map T _{bn obtained in step 1.3 to obtain} The material image t _{* of the} corresponding image space, * can be a, n, s, sha, bn representing diffuse material, normal material, specular material, shadow map, and environment normal map, respectively. Traverse _{all the pixels in I uv} and use the rendering formula to calculate the diffuse illumination of each pixel:

Among them, L(ω) represents the incident light in the ω direction, V represents visibility, and N represents the normal direction. The entire formula represents the area of the sphere on the normal hemisphere. The above formula uses spherical harmonic approximation (Peter-Pike Sloan, Jan Kautz, and John Snyder. 2002. Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments. In ACM Transactions on Graphics (TOG), Vol. 21 .ACM, 527–536.) can be further simplified. L and V can be expressed as spherical harmonic functions as

v is recorded in t _sha and represents the spherical harmonic coefficient of visibility, max(0,N·ω) can also be expressed as spherical harmonic

Among them, c represents the spherical harmonic coefficient of the truncated cosine function, which is rotated from the spherical harmonic coefficient of max(0, cosθ) to the spherical harmonic coefficient of the current pixel normal direction n, and n is recorded in t _n . Multiply projection by spherical harmonic function (Peter-Pike Sloan.2008.Stupid spherical harmonics(sh)tricks.In Game developers conference,Vol.9.Citeseer,42.), reproject z _e and v, you can get w, and finally use The point multiplication of the spherical harmonic function, the point multiplication w term and the c term can be resolved into the following formula:

Calculate the specular reflection of the face. Similarly traverse _{all the pixels in I uv} , and use the following rendering formula to calculate the specular illumination of each pixel:

L _s =∮f _r (ω,ω _o )L(ω)V(ω)max(0,N·ω)dω,

f _r represents the light transmission equation that obeys the GGX distribution (Bruce Walter, Stephen R. Marschner, Hongsong Li, and Kenneth E. Torrance. 2007. Microfacet Models for Refraction through Rough Surfaces.), and ω _o represents the viewing angle direction. We use (Sébastien Lagarde and Charles de Rousiers.2014.Moving frostbite to physically based rendering.In SIGGRAPH2014 Conference,Vancouver.) to split the above integral formula, we can get the following formula:

L _s =DFG·LD,

Where DFG represents the pre-calculated GGX rendering transmission equation, and the calculation method of LD is as follows:

That is the final rendering result.

5.2 Iterative optimization of material hidden space variables and spherical harmonic illumination coefficient z _e : Minimize the following formula:

L represents the loss function,

Represents the differentiable rendering process of step 5.1. Using differentiable rendering, differentiable quality enhancement networks and differentiable decoders, the loss value is passed back to z _* , and z _{* is} updated iteratively until convergence, and finally to the diffuse, normal, and specular material decoders Input z _a , z _n , z _{s respectively} , and then input the output to the corresponding material quality enhancement network to obtain materials T _a , T _n , T _{s that} meet the character characteristics of the input image. *It can be a, n, s, e to represent diffuse reflection material, normal material, specular reflection material, and spherical harmonic lighting respectively.

Implementation examples

The inventor implemented the embodiment of the present invention on a machine equipped with Intel Xeon E5-4650 CPU and NVidia GeForce RTX 2080Ti graphics processor (11GB). The inventor used all the parameter values listed in the specific embodiments to obtain all the experimental results shown in Figures 1-5. The present invention can effectively output a high-quality human face reflection material according to the input character image. For an image of a face area of 600×800, the calculation of the 3D geometric information of the face takes about 30 seconds, the initialization of the hidden space takes about 10 milliseconds, and the iterative optimization process requires each round of forward calculation (decoding, quality enhancement, rendering). 250 milliseconds, it takes 150 iterations to converge, so the entire iteration process takes about 40 seconds. In addition, it takes 12 hours to train the U-shaped network, 4 hours to train the spherical harmonic light coefficient regression network, and about 50 hours to train the material quality enhancement network. These modules only need to be trained once and can be used to process any input character image.

Claims

A method for solving face reflection material from a single image based on a differentiable renderer, which is characterized in that it includes the following steps:

(1) Calculate the 3D information of the face in the input image, and obtain the face color map in the texture space and the static information for physically-based differentiable rendering according to the 3D information; the 3D information includes the 3D model of the face and the rigid body A change matrix and a projection matrix; the static information includes a shadow map T sha and an environment normal map T bn ;

(2) Based on the texture space face color map obtained in step 1, the initial value of the hidden space coefficient of the face reflection material is obtained through the encoder encoding based on the convolutional neural network
The initial value of the spherical harmonic illumination coefficient
*A, n, s represent diffuse reflection material, normal material and specular reflection material respectively;

(3) Using a differentiable decoder based on a convolutional neural network to reflect the coefficients of the hidden space of the face material
Decode into the corresponding reflection material image

(4) Upgrade the reflective material image obtained in step (3)
The resolution and detail quality of, get the image T * ;

(5) By minimizing the rendering step of the physically-based differentiable renderer (4) The quality of the reflective material image T * is improved and the difference between the rendering result obtained and the input face image is iteratively optimized and the hidden space coefficient of the face reflection material Spherical harmonic light coefficient, the optimized hidden space coefficient of the face reflection material is solved through the decoding and quality improvement operations of steps (3)-(4) to obtain the face reflection material.
The method for solving the reflection material of a human face from a single image based on a differentiable renderer according to claim 1, wherein the step (1) comprises the following sub-steps:

(1.1) Calculation of face 3D information: detect the two-dimensional feature points of the face in the input image, and use the deformable model to optimize the character identity coefficient, the rigid body change matrix and the projection matrix, through the linear interpolation of the deformable model and the character identity coefficient , Get the 3D model of the character;

(1.2) Calculation of face color images in texture space: use the rigid body change matrix and projection matrix obtained in step (1.1) to project the 3D model obtained in step (1.1) to the input image, and establish each vertex and image of the 3D model Pixel mapping, the input image pixels are mapped to the vertices of the 3D model, and then the vertices of the 3D model are mapped to the texture space to map the image pixels to the texture space, and then the texture space is triangulated and the triangle center of gravity coordinate interpolation is performed. Get the face color image in texture space;

(1.3) Physics-based calculation of static information that can be rendered differently: using the 3D model, rigid body change matrix, and projection matrix in step (1.1), the texture coordinates are drawn as color information to the image space to obtain the texture coordinate image I uv ; Use the rigid body change matrix and 3D model obtained in step (1.1) to obtain the rigidly changed 3D model, use the ray tracing algorithm to calculate the occlusion of each vertex of the above 3D model in each direction, and project it to the spherical harmonic function polynomial, thus Obtain the spherical harmonic coefficient occluded by each vertex; in addition, record the proportion of the unoccluded area and the center direction of the unoccluded area to obtain the environment normal vector of each vertex; finally, through the triangular meshing of the texture space, and separately for each vertex The occlusion spherical harmonic coefficient and the environment normal vector are interpolated with the triangle barycentric coordinates to obtain the final shadow map T sha and the environment normal map T bn .
The method for solving the reflection material of a face from a single image based on a differentiable renderer according to claim 2, wherein in the step (1.2), a Poisson algorithm is used to fill the face color image in the texture space The void area of existence.
The method for solving the reflection material of a human face from a single image based on a differentiable renderer according to claim 2, wherein the encoder and decoder based on the convolutional neural network are jointly trained by forming a U-shaped network, and the training Specifically, it includes the following sub-steps:

(a) Training data: Obtain N target face images I o and corresponding diffuse reflection materials
Normal material
And specular reflection material
Map the face image to the texture space to obtain the corresponding texture space face color image I;

Form the training data of the U-shaped network, each of which has a resolution of 1024×1024;

(b) Diffuse material, normal material, and specular reflection material each have a U-shaped network; for the U-shaped network U a of diffuse material, the input is a scaled texture space face color image
The encoder part E a of U a contains 9 down-sampling modules. The first 8 down-sampling modules all include a convolutional layer with a core size of 3×3 and a step size of 2×2, a batch normalization layer, and an LReLU activation function layer. , The last down-sampling module includes a convolutional layer with a core size of 1×1, a step size of 2×2, a batch normalization layer, and an LReLU activation function layer, and the final encoding becomes a 1×1×1024 diffuse reflection material hidden space; U a decoder section comprises nine D a sampling module, each module contains the sample core size of a 3 × 3, two magnifications convolution scaling layer, a standardized batch layer, LReLu activation function layer, and finally through a The kernel size is 1×1, the step size is 1×1, and the activation function is the Sigmoid convolutional layer to get the final resolution of 512×512×3 output; the U-shaped network of normal material U n , the input is through regional interpolation Scaled texture space face color image
The resolution is 256×256, and the encoder En includes 8 down-sampling modules. The first 7 down-sampling modules include a convolutional layer with a core size of 3×3 and a step size of 2×2, a batch normalization layer, LReLU activation function layer, the last down-sampling module includes a convolutional layer with a core size of 1×1 and a step size of 2×2, a batch normalization layer, and an LReLU activation function layer. The final encoding becomes a 1×1×512 normal material Latent space; the decoder D n includes 8 up-sampling modules, each up-sampling module contains a core size of 3×3, zoomed in twice the convolutional layer, batch normalization layer, LReLu activation function layer, and finally passed A convolutional layer with a core size of 1×1, a step size of 1×1, and an activation function of Sigmoid obtains an output with a final resolution of 256×256×3; a U-shaped network U s of specular reflection material, and its encoder structure E s and E n the same, D s eight sampling module contains a core size of 3 × 3, enlarged to twice the scale layer convolution, normalized batch layer, LReLu activation function layer, and finally through a core size of 1 ×1, a convolutional layer with a step size of 1×1 and an activation function of Sigmoid to obtain an output with a final resolution of 256×256×1; among them, the three modules with the highest resolution among the E* and D * of the U-shaped network Perform skip transfer connection, * is a, n, s;

(c) The training loss function is defined as follows:

U * represents a U-shaped network, where the subscript * can be a, n, s representing diffuse reflection material, normal material, and specular reflection material, respectively.
Represents the face color image representing the scaled texture space,
and
Respectively represent the U-shaped network output material image and the corresponding zoomed real material image, where
The resolution is 512×512, and
The resolution is 256×256;
The resolution is 512×512, and
The resolution is 256×256.
The method for solving the reflection material of a human face from a single image based on a differentiable renderer according to claim 2, characterized in that, in the step (2), the initial value of the spherical harmonic illumination of the input image
It is obtained by constructing a spherical harmonic illumination coefficient regression network. The spherical harmonic illumination coefficient regression network includes an encoder based on a convolutional neural network and a regression module composed of a full connection. The training process includes the following steps:

(A) A training data pair is composed of {I o , z e }, in which the spherical harmonic coefficient z e is calculated according to the HDR ambient light image I e by the following formula:

Where i,j represent the Cartesian coordinates of the image length and width in W and H directions, Y k represents the spherical harmonic function polynomial, k represents the order of the spherical harmonic, 0≤k＜9, and φ represents the image coordinate i,j to the spherical coordinate θ ,
The conversion equation of, its expression is as follows:

(B) Scaling I o to a resolution of 256×256 as the network input, and using the L2 norm as the loss function to perform end-to-end supervised learning training on the network.
The method for solving the reflection material of a human face from a single image based on a differentiable renderer according to claim 1, wherein in the step (4), the reflection material image is improved by constructing a reflection material quality enhancement network R*
The resolution and quality of detail include the following sub-steps:

(4.1) Train the reflection material quality enhancement network based on the convolutional neural network, as follows:

(4.1.1) Training data: Input the face color image I used for training into step (2) Generated by the trained U-shaped network
Original with face color image I
Compose training data pair
* Indicates a, n, s;

(4.1.2) Training method: SRGAN network is used as the reflection material quality enhancement network R * , and the generative confrontation (GAN) method is used for training; among them, for the diffuse reflection material quality enhancement network R a , the input is 512×512
The output image resolution is 1024×1024; for the normal material quality enhancement network R n , and the specular material quality enhancement network R s , the first layer of the network accepts an image depth of 4, and the input includes
And scaled texture space face color image
The input resolution is 256×256, and the output resolution is 1024×1024 high-quality texture images;

(4.2) Quality enhancement of material images: based on step 3
Use the trained quality enhancement network in step (4.1) for quality enhancement to obtain a high-quality material image T * ,* represents a, n, s, and the whole process can be expressed by the following formula:

Represents a face color image scaled to 256×256 in texture space.
The method for solving the reflection material of a human face from a single image based on a differentiable renderer according to claim 2, wherein the step (5) comprises the following sub-steps:

(5.1) Physically-based forward rendering using reflective materials and spherical harmonic lighting:

(5.1.1) Calculate the diffuse reflection of the human face: According to the I uv obtained in step 1.3, perform bilinear sampling on the T a , T n and T s output by the quality enhancement network, as well as the shadow map T sha and the environment normal map T bn, Obtain the material image t * of the corresponding image space, * is a, n, s, sha, bn, respectively representing diffuse material, normal material, specular material, shadow map and environment normal map; traverse all of I uv Pixels, use the following physically-based rendering formula to calculate the diffuse lighting of each pixel:

Among them, k represents the order of the spherical harmonic function polynomial, and reprojects z e and v using the properties of spherical harmonic projection to obtain w, v represents the visibility of each pixel in each direction, and is recorded in t sha ; c is determined by max The spherical harmonic coefficient of (0, cosθ) is rotated to the spherical harmonic coefficient of the normal direction n of the current pixel, and n is recorded in t n ;

(5.1.2) Calculate the specular reflection of the face and calculate the rendering result: use the following formula to calculate the specular specular reflection of the face:

L s =DFG·LD,

Where DFG represents the pre-calculated rendering transmission equation that obeys the GGX distribution, and the calculation method of LD is as follows:

Use the following formula to fuse diffuse reflection and specular reflection to calculate the rendering result of each pixel in I uv:

That is the final rendering result;

(5.2) Iterative optimization of material hidden space variables and spherical harmonic illumination coefficient z e : Minimize the following formula:

L represents the loss function,
Step 5.1 represents differentiable rendering process; rendering using differentiable, differentiable strengthen the network quality and a micro decoder, is transmitted to the reverse loss values z *, and iteratively updated z *, * is a, n, s ,e respectively represent diffuse reflection material, normal direction material, specular reflection material, spherical harmonic illumination, until convergence, and finally input z a , z n , z s to the diffuse reflection, normal vector, and specular reflection material decoder respectively, and then The output is then input to the corresponding material quality enhancement network to obtain materials T a , T n , and T s that meet the character characteristics of the input image.