CN114119923B

CN114119923B - Three-dimensional face reconstruction method and device and electronic equipment

Info

Publication number: CN114119923B
Application number: CN202111435940.XA
Authority: CN
Inventors: 胡志鹏; 林江科; 袁燚; 范长杰; 卜佳俊
Original assignee: Zhejiang University ZJU; Netease Hangzhou Network Co Ltd
Current assignee: Zhejiang University ZJU; Netease Hangzhou Network Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-07-19
Anticipated expiration: 2041-11-29
Also published as: CN114119923A

Abstract

The application provides a three-dimensional face reconstruction method, a three-dimensional face reconstruction device and electronic equipment, relates to the technical field of three-dimensional face reconstruction, and solves the technical problem of poor accuracy of output maps. The method comprises the following steps: extracting the features of the current face image to obtain a face feature image; inputting a face feature map into a face prediction neural network, wherein the face prediction network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jumping connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer; predicting a convolution kernel through the first auxiliary convolution layer and the face feature map, and outputting a first affine transformation matrix of each pixel point; converting the convolution kernels to target positions corresponding to the current feature categories based on the first affine transformation matrix so that the first main convolution layer extracts diffuse reflection features corresponding to each target position and outputs a diffuse reflection map; and performing three-dimensional face reconstruction based on the diffuse reflection mapping.

Description

Three-dimensional face reconstruction method and device and electronic equipment

Technical Field

The present application relates to the field of three-dimensional face reconstruction technologies, and in particular, to a three-dimensional face reconstruction method and apparatus, and an electronic device.

Background

At present, with the great diversity of neural networks in the field of computer vision, researchers propose a scheme of directly regressing coefficients of a three-dimensional face variability Model (3D movable Model, 3DMM) according to a face image input to the neural network. To obtain paired two-three dimensional data for supervised learning, researchers have generated synthetic data by randomly sampling deformable face models, or have created true value samples using iterative optimization methods to fit a large number of face images.

Recently, micro-renderable techniques have been introduced into the three-dimensional face reconstruction task. With differentiable rendering, facial textures such as UV maps can be optimized during the training phase. Even some researchers have directly designed a completely unsupervised network structure for three-dimensional face reconstruction. The method uses a network structure of a plurality of self-encoders (encoding-decoding), takes a face image as input, respectively outputs an albedo map, a depth map and the like of a three-dimensional face under a front view angle, and then constructs a loss function for training through a micro-renderer.

However, the self-encoder cannot guarantee the output precision and improve the performance of the self-encoder.

Disclosure of Invention

The application aims to provide a three-dimensional face reconstruction method, a three-dimensional face reconstruction device and electronic equipment, affine transformation is carried out on the pixel point position of an input face image, so that corresponding feature classes of target feature class features of an output diffuse reflection mapping from the input face image are realized, and the technical problem of poor accuracy of the output mapping is further solved.

In a first aspect, an embodiment of the present application provides a three-dimensional face reconstruction method, where the method includes:

performing feature extraction on a current face image to obtain a face feature map, wherein the face feature map comprises a plurality of feature categories;

inputting the face feature map into a face prediction neural network, wherein the face prediction network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jumping connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;

predicting a convolution kernel through the first auxiliary convolution layer and the face feature map, and outputting a first affine transformation matrix of each pixel point;

converting the convolution kernel to a target position corresponding to the current feature category based on the first affine transformation matrix so that the first main convolution layer extracts diffuse reflection features corresponding to each target position and outputs a diffuse reflection map;

and performing three-dimensional face reconstruction based on the diffuse reflection map.

In one possible implementation, the step of outputting the first affine transformation matrix of each pixel point by the first auxiliary convolution layer and the face feature map prediction convolution kernel includes:

predicting the convolution kernel position of each pixel point based on the texture coordinate corresponding to each target feature category in the target grid and the face feature map through the first auxiliary convolution layer, and outputting an affine transformation matrix of each pixel point so that each pixel point corresponds to the texture coordinate of the corresponding target feature category of the pixel point under the action of the affine transformation matrix.

In one possible implementation, the face prediction network further includes a location network in which a second encoder and a second decoder formed by a second affine convolution layer are in hopping connection, the second affine convolution layer including a second main convolution layer and a second auxiliary convolution layer; the method further comprises the following steps:

predicting a convolution kernel through the second auxiliary convolution layer and the face feature map, and outputting a second affine transformation matrix of each pixel point;

converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the second affine transformation matrix, so that the second main convolution layer extracts the position feature corresponding to each target position, and outputting a position map;

the three-dimensional face reconstruction based on the diffuse reflection mapping comprises the following steps:

and performing three-dimensional face reconstruction based on the position map and the diffuse reflection mapping.

In one possible implementation, the face prediction network further comprises an illumination network comprising a third encoder formed by a third affine convolution layer and a two-line upsampling layer, the third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer; the method further comprises the following steps:

predicting a convolution kernel through the third auxiliary convolution layer and the face feature map, and outputting a third affine transformation matrix of each pixel point;

converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the third affine transformation matrix, so that the third main convolution layer extracts the illumination feature corresponding to each target position and outputs an optical map;

and performing three-dimensional face reconstruction based on the illumination map, the position map and the diffuse reflection map.

In one possible implementation, the face prediction network further includes a renderer, and the method further includes:

inputting the illumination map, the position map and the diffuse reflection map into the renderer;

respectively reading position characteristic information, illumination characteristic information and color characteristic information from the position map, the light map and the diffuse reflection map according to texture coordinates of the vertex of the target grid;

and generating a projection rendering image of the three-dimensional face in a two-dimensional space according to the position characteristic information, the illumination characteristic information and the color characteristic information.

In one possible implementation, the renderer is a micro-renderer.

In one possible implementation, the method further comprises:

training the face prediction neural network by back-propagating through the micro-renderer until a loss function reaches an expectation, wherein the loss function comprises at least one of phase shifting: perceptual loss terms, reconstruction loss terms, symmetry loss terms, and skin color loss terms.

In one possible implementation, the perceptual loss term includes a first perceptual loss term and a second perceptual loss term, and the method further includes:

extracting a first characteristic vector of the current face image and a second characteristic vector of the projection rendering image;

determining a first perception loss term of the face prediction neural network based on the first feature vector and the second feature vector;

and determining a second perception loss item according to the characteristic vector comparison condition of the diffuse reflection mapping and the true value of the diffuse reflection mapping.

In one possible implementation, the reconstruction loss terms include a first reconstruction loss term and a second reconstruction loss term, the method including:

determining a first reconstruction loss term according to the pixel difference value between the current face image and the projection rendering image;

and determining a second reconstruction loss item according to the pixel difference comparison condition of the diffuse reflection mapping and the diffuse reflection mapping truth value.

In one possible implementation, the method further comprises:

and determining a symmetry loss item according to the diffuse reflection map and the horizontally flipped diffuse reflection map.

In one possible implementation, the method further comprises:

and performing Gaussian blur processing on the diffuse reflection map, and determining the skin color loss item based on the standard deviation of the color value of each pixel of the skin area.

In a second aspect, a three-dimensional face reconstruction apparatus is provided, the apparatus comprising:

the extraction module is used for extracting the features of the current face image to obtain a face feature map, wherein the face feature map comprises a plurality of feature categories;

the input module is used for inputting the face feature map into a face prediction neural network, wherein the face prediction neural network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jump connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;

the matrix output module is used for outputting a first affine transformation matrix of each pixel point through the first auxiliary convolution layer and the face feature map prediction convolution kernel;

an affine transformation module, configured to convert the convolution kernel to a target position corresponding to a current feature class based on the first affine transformation matrix, so that the first main convolution layer extracts a diffuse reflection feature corresponding to each target position, and outputs a diffuse reflection map;

and the reconstruction module is used for reconstructing a three-dimensional face based on the diffuse reflection map.

In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the method of the first aspect when executing the computer program.

In a fourth aspect, this embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, which, when invoked and executed by a processor, cause the processor to perform the method of the first aspect.

The embodiment of the application brings the following beneficial effects:

according to the three-dimensional face reconstruction method, the three-dimensional face reconstruction device and the electronic equipment, the feature class features corresponding to the output diffuse reflection maps correspond to the feature classes of the input face images.

In the scheme, because the symmetry proportion of the input image and the actual map is possibly different, the face eyebrow feature position in the input image is possibly aligned with the eye position of the reflection map, namely, the face eyebrow feature in the input image is extracted to be used as the eye feature in the reflection map to be output, so that the precision of the diffuse reflection map output by the encoder is poor. And outputting an affine transformation matrix to transform the position of the convolution kernel by the auxiliary convolution layer according to the target position of each feature category in the input image, so that the feature corresponding to the target category can be extracted according to the convolution kernel after position transformation, and the accuracy of the output result of the encoder is ensured.

In order to make the aforementioned objects, features and advantages of the present application comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of a three-dimensional face reconstruction method according to an embodiment of the present application;

FIG. 2 illustrates an affine convolutional layer application diagram provided by an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an application of a face prediction neural network according to an embodiment of the present application;

fig. 4 is a schematic diagram of a training method for a face prediction neural network according to an embodiment of the present application;

fig. 5 is another schematic flow chart of a three-dimensional face reconstruction method according to an embodiment of the present application;

fig. 6 is an image result corresponding to the three-dimensional face reconstruction method provided in the embodiment of the present application;

fig. 7 is a schematic structural diagram of a three-dimensional face reconstruction device according to an embodiment of the present application;

fig. 8 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "comprising" and "having," and any variations thereof, as referred to in the embodiments of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For the sake of understanding, the following terms of art are to be interpreted accordingly:

deep learning is a machine learning algorithm composed of large-scale neurons, and can be widely applied to multiple fields of computer vision, speech recognition, natural language processing and the like at present due to the fact that the complex nonlinear problem can be well solved.

Affine transformation (also called Affine mapping) is a process in which, in geometry, one vector space is linearly transformed once and then translated into another vector space.

Convolutional Neural Networks (CNNs) represent the use of a mathematical operation called convolution in a Network. Convolution is a special linear operation and convolutional networks are special neural networks that use convolution in at least one layer instead of general matrix multiplication.

Mesh (mesh), generally referred to as a triangular mesh in embodiments of the present invention, is a data structure for representing three-dimensional models. It is composed of vertexes in three-dimensional space and triangular patches among the three vertexes. Each vertex may contain information such as color, normal, and the like, in addition to the position coordinates.

Diffuse reflection maps (Diffuse maps) reflect the color and intensity of an object surface under Diffuse reflection, represent the inherent color and texture of an object, and are the most fundamental maps of objects. Which can also be understood directly as texture in general.

The Position Map (Position Map) is a 2D image that records the 3D coordinates of a complete point cloud while preserving the semantics of each UV Position. The process of creating a texture map, a normal map, a bump map, or the like on a two-dimensional UV space is called UV unfolding. U and V refer to horizontal and vertical axes of a 2D space because X, Y and Z have been used in a 3D space.

The three-dimensional human face variability Model is a 3D Morphable Model, the Model is composed of mesh, and each dimension numerical control controls local changes of a human face.

Three-dimensional face reconstruction is a hot problem in the field of computer vision, and takes one or more face images as input and outputs three-dimensional representation of a face. The three-dimensional face has various representation methods, and more common methods include Mesh (Mesh), Voxel (Voxel), Point Cloud (Point Cloud), Depth Map (Depth Map), and the like.

Most of the initial three-dimensional face reconstruction methods only focus on geometric information, namely the shape of a face, and neglect the texture information of the face. This is mainly due to the difficulty of large-scale acquisition of three-dimensional face data, and the large amount of data required to train neural networks in a supervised fashion. With the introduction of differentiable rendering \ cite, one can compute the loss function and back-propagate between the input face image and the rendered three-dimensional face, so that a neural network can be trained to predict a three-dimensional face from a two-dimensional face image in a self-supervised or weakly supervised fashion.

Since Blanz and Vetter 1999 proposed 3D transportable Face Models (3DMM), much work in the field of three-dimensional Face reconstruction was done based on 3 DMM. The classic method based on 3DMM three-dimensional face reconstruction is that a template Mesh is continuously optimized in an iterative mode and is fitted to an input two-dimensional face image. However, similar methods are very sensitive to the lighting, expression, and posture of the face image. While some work follows to improve the iterative optimization methods, they do not perform well on face images acquired in non-laboratory environments.

Furthermore, because the input image and the output map are not spatially aligned, a convolutional neural network cannot handle this well. Therefore, when using the self-encoder network, the input image needs to be encoded into a hidden layer encoding vector, which results in information loss. Also, performance cannot be improved by adding a jump connection to the self-encoder.

Based on this, the embodiment of the application provides a three-dimensional face reconstruction method, a three-dimensional face reconstruction device and electronic equipment, and the method can be used for relieving the technical problem that a self-encoder cannot ensure output precision in the three-dimensional face reconstruction process.

Embodiments of the present application are further described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow diagram of a three-dimensional face reconstruction method according to an embodiment of the present application. The method can be applied to a server, three-dimensional face information is obtained through the input two-dimensional face image, so that three-dimensional face reconstruction in scenes such as games, social contact and the like can be realized, and the three-dimensional effect of the face can be displayed. As shown in fig. 1, the method includes:

and S102, extracting the features of the current face image to obtain a face feature image.

The human face feature map comprises a plurality of feature categories, and each feature category corresponds to a pixel point and a pixel position.

It should be noted that, the neural network can be extracted through the features, that is, the features of the face image are extracted to obtain each feature class corresponding to the face image, and the pixel points and the positions corresponding to the feature classes.

For example, the facial image includes an eyebrow feature category, an eye feature category, a nose feature category, a mouth feature category, and the like. Each feature category may include a plurality of pixels, and each pixel corresponds to a position coordinate.

And step S104, inputting the face feature map into a face prediction neural network.

The human face prediction network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jumping connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer.

It should be noted that the diffuse reflection network is a network structure similar to U-Net, and mainly comprises an encoder (down-sampling) and a decoder (up-sampling) connected in a jump manner. Unlike the usual U-Net structure, all convolutional layers in the encoder are replaced by affine convolutional layers. The input to the diffuse reflection network is a facial feature image, which outputs a diffuse reflection map d, as shown in fig. 3.

According to the embodiment of the invention, the affine convolution layer is used for aligning the input image and each feature type of the diffuse reflection mapping, so that jump connection can be added into a network at the stage from the input image to the diffuse reflection mapping, and the affine convolution can automatically learn and process the problem that the input image 2D and the diffuse reflection mapping UV are not consistent in space.

And step S106, predicting a convolution kernel through the first auxiliary convolution layer and the face feature map, and outputting a first affine transformation matrix of each pixel point.

Wherein an affine convolutional layer comprises two ordinary convolutional layers, wherein the auxiliary convolutional layer is also a convolutional network and is named as an auxiliary convolutional layer for distinguishing from the main convolutional layer. Equivalent to an affine convolution layer involves two convolution operations. The output of the auxiliary convolution is an affine transformation matrix based on which new coordinates are calculated.

It should be noted that the first affine transformation matrix corresponding to each pixel point in the current face feature map is output through the first auxiliary convolution layer and the face feature map prediction convolution kernel. Optionally, each pixel point may correspond to a convolution kernel.

And step S108, converting the convolution kernels to target positions corresponding to the current feature types based on the first affine transformation matrix, so that the first main convolution layer extracts the diffuse reflection features corresponding to each target position, and outputting a diffuse reflection map.

As shown in fig. 2, based on the affine transformation matrix, the left 3 by 3 square convolution kernel is transformed into a rhombus, and then the feature extraction is performed on the rhombus convolution kernel through the main convolution layer, so as to obtain a right small square, and the right small square is output. It should be noted that, at this time, the face feature class extracted by the diamond convolution kernel corresponds to the feature class of the target output.

For example, the face feature class corresponding to the original square convolution is eyebrow, the feature class output by the target is eye, and the pixel coordinates of the eye feature class are transformed through an affine transformation matrix, so that the output face feature class is eye and is consistent with the feature class output by the target.

And step S110, reconstructing the three-dimensional face based on the diffuse reflection mapping.

The output characteristic category and the extracted characteristic category are consistent in transformation, so that the output diffuse reflection map is accurate, and more accurate three-dimensional face reconstruction operation is realized.

In the embodiment of the application, the problem that the characteristic categories of the input image and the output diffuse reflection image are not aligned in space can be solved in a self-encoder or a similar network structure, and meanwhile, jump connection can be added to the self-encoder, so that the self-performance of the self-encoder network is improved.

Because the symmetry proportion of the input image and the actual map may not be the same, the face eyebrow feature position in the input image may be aligned with the eye position of the reflection map, that is, at this time, the face eyebrow feature in the input image may be extracted as the eye feature in the reflection map for output, so that the precision of the diffuse reflection map output by the encoder is poor. And outputting an affine transformation matrix to transform the position of the convolution kernel by the auxiliary convolution layer according to the target position of each feature category in the input image, so that the feature corresponding to the target category can be extracted according to the convolution kernel after position transformation, and the accuracy of the output result of the encoder is ensured.

The above steps are described in detail below.

In some embodiments, the auxiliary convolution layer may generate an affine transformation matrix to spatially align feature classes of the input image and the output diffuse reflectance image. As an example, the step S106 may include the following steps:

step 1.1), predicting the convolution kernel position of each pixel point based on the texture coordinate corresponding to each target feature category in the target grid and the face feature map through the first auxiliary convolution layer, and outputting an affine transformation matrix of each pixel point so that each pixel point corresponds to the texture coordinate of the corresponding target feature category of the pixel point under the action of the affine transformation matrix.

The affine convolution network enables the convolution kernel to extract the features of any region in the image through affine transformation. The network structure of affine convolution is shown in fig. 2. And the auxiliary convolution layer outputs an affine transformation matrix based on texture UV coordinates corresponding to each feature class in the target grid and the position coordinates of the current feature class, so that the convolution kernel calculates new coordinates based on the affine transformation matrix, and the feature class corresponding to the new coordinates of the convolution kernel is consistent with the correspondingly output target feature class.

Assuming that the width, height and channel number of three dimensions of the face feature map are W, H and C respectively, the dimension of the affine transformation matrix generated by the auxiliary convolution layer is W × H × 6. For each pixel point on the feature map, its affine transformation matrix can be represented by 6 values.

For example, for one convolution kernel coordinate (x, y), given 6 values (a, b, c, d, e, f) of the affine transformation matrix, the new convolution kernel coordinate (x ', y') can be calculated by the following formula:

and calculating new coordinates of the convolution kernels at each position through an affine matrix, then transforming the corresponding convolution kernels of the main convolution layer to the new coordinates, extracting the features corresponding to the convolution kernels of the new coordinates, and outputting the features corresponding to the feature categories at the positions corresponding to the target feature categories.

In some embodiments, the face prediction network further comprises a location network in which a second encoder and a second decoder formed by a second affine convolution layer are in hopping connection, the second affine convolution layer comprising a second main convolution layer and a second auxiliary convolution layer; the more accurate three-dimensional face reconstruction can be realized by considering the dual influence factors of the position network and the diffuse reflection network. As an example, the method may further comprise the steps of:

and 2.1) predicting a convolution kernel through the second auxiliary convolution layer and the face feature map, and outputting a second affine transformation matrix of each pixel point.

And 2.2) converting the convolution kernels to target positions corresponding to the current feature classes based on the second affine transformation matrix, so that the second main convolution layer extracts position features corresponding to each target position and outputs a position map.

Wherein, step S110 further includes: and 2.3) reconstructing a three-dimensional face based on the position map and the diffuse reflection map.

The structure of the position network is basically consistent with that of the diffuse reflection network in the previous step, the input of the position network is also a face image, the output of the position network is a position map, and the position of a convolution kernel is transformed through an affine transformation matrix output by the auxiliary convolution layer so as to output the features with the same type as the target features.

Because the illumination of the real world is complex, it is difficult to completely simulate the illumination on the face image simply by using parallel light, and the illumination is difficult to represent by a small number of parameters due to the influence of various kinds of shelters (such as bang, glasses, etc.) possibly existing in the face image. The embodiment of the invention provides a method for simulating illumination information by using an illumination map, so that illumination is decoupled from a face image.

In some embodiments, by improving the representation of the lighting, the lighting network is enabled to more accurately simulate real-world lighting information. The embodiment of the invention uses the illumination map to represent the illumination information, and can simulate the illumination of the real world more freely compared with the parameters of parallel light or spherical harmonic illumination. As an example, the face prediction network further comprises an illumination network, the illumination network comprising a third encoder formed by a third affine convolution layer and a two-line upsampling layer, the third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer; the method further comprises the following steps:

and 3.1) predicting a convolution kernel through the third auxiliary convolution layer and the face feature map, and outputting a third affine transformation matrix of each pixel point.

And 3.2) converting the convolution kernels to target positions corresponding to the current feature types based on a third affine transformation matrix, so that the third main convolution layer extracts illumination features corresponding to each target position and outputs an illumination map.

Wherein, step S110 further includes: and 3.3) carrying out three-dimensional face reconstruction based on the illumination map, the position map and the diffuse reflection map.

The structure of the illumination network is different from the structures of the position network and the diffuse reflection network introduced in the embodiment, the decoder is replaced by a bilinear up-sampling layer by the illumination network, and jump connection is also removed, so that the purpose that the illumination network focuses on low-frequency information on a face image instead of high-frequency information such as pores is achieved. In addition, the input of the network is a face image and the output is a light map.

In the embodiment of the present invention, the three-dimensional face reconstruction method described in step S110 may be understood as generating a three-dimensional representation of a face according to a face image, where the three-dimensional representation includes a three-dimensional face mesh (target mesh) and a corresponding diffuse reflection map. The embodiment of the invention also predicts an illumination map at the same time according to the face image, and is used for decoupling illumination information from the face map to generate a better diffuse reflection map.

In some embodiments, to enable a more accurate three-dimensional face reconstruction process. As an example, a renderer may be introduced, e.g. the face prediction network also comprises a renderer, which is preferably differentiable in order to be able to back-propagate the gradient to train the neural network. The method further comprises the following steps:

and 4.1) inputting the illumination map, the position map and the diffuse reflection map into a renderer.

And 4.2) respectively reading position characteristic information, illumination characteristic information and color characteristic information from images such as a position map, a light map and a diffuse reflection map according to texture coordinates of the vertex of the target grid, and generating a projection rendering map of the three-dimensional face in a two-dimensional space.

The micro-renderer is used for generating a projection of a three-dimensional face on a two-dimensional space according to a given three-dimensional game face (target mesh) which comprises coordinates of mesh vertexes, definitions of triangular patches (meshes) and information such as corresponding diffuse reflection maps, position maps and illumination maps.

As another example, in order to obtain a more accurate three-dimensional face reconstruction result, the face prediction neural network may be further trained, for example, the method further includes:

and 5.1) reversely propagating and training a face prediction neural network through a micro-renderer until a loss function reaches an expectation, wherein the loss function comprises a perception loss item, a reconstruction loss item, a symmetry loss item and a skin color loss item.

In some embodiments, the perceptual loss term includes a first perceptual loss term and a second perceptual loss term, and the training method further includes:

step 6.1), extracting a first characteristic vector of the current face image and a second characteristic vector of the projection rendering image;

step 6.2), determining a first perception loss item of the face prediction neural network based on the first feature vector and the second feature vector;

among them, the purpose of the perceptual loss is to minimize the difference in feature vectors between the rendered image (projected rendering) and the input image (current face image). As an alternative embodiment, the embodiment of the present invention may use a neural network such as VGG19 pre-trained on a public data set such as ImageNet as a feature extractor to extract feature vectors of an input image and a rendered image respectively, and then calculate the difference between the two feature vectors as a perception loss term L_perc(). The formula is expressed as:

where x and x' represent the input image and the rendered image, respectively, F (-) represents the feature extractor, and F (x) represents the extracted feature vector.

And 6.3) determining a second perception loss item according to the characteristic vector comparison condition of the diffuse reflection mapping and the true value of the diffuse reflection mapping.

Where, for example, diffuse reflection mapping truth values and location mapping truth values, the truth data is derived from the public data set RGB 3D Face.

In some embodiments, in order to obtain a more accurate three-dimensional face reconstruction result, the face prediction neural network may be further trained, and the reconstruction loss term includes a first reconstruction loss term and a second reconstruction loss term, and the method may include:

step 7.1), determining a first reconstruction loss term according to a pixel difference value between the current face image and the projection rendering image;

wherein the loss term L is reconstructed_recThe corresponding pixel value difference between the current input image and the rendered image is calculated as:

L_rec(x，x′)＝||x-x′||₂

where x and x' represent the input image and the rendered image, respectively.

And 7.2) determining a second reconstruction loss item according to the pixel difference comparison condition of the diffuse reflection map and the true value of the diffuse reflection map.

In some embodiments, in order to obtain a more accurate three-dimensional face reconstruction result, the face prediction neural network may be further trained, and the method further includes:

and 8.1) determining a symmetry loss item according to the diffuse reflection map and the horizontally flipped diffuse reflection map.

Because the human face has bilateral symmetry, a symmetric loss term L is designed on the diffuse reflection map_sym. The calculation formula is as follows:

where x is the current input image,

is the image after x horizontal flipping.

step 9.1), performing Gaussian blur processing on the diffuse reflection map, and determining a skin color loss item based on the standard deviation of the color value of each pixel of the skin area.

The skin color loss is to promote the uniformity of the entire skin color of the generated texture map. In order to keep the overall skin color consistent and simultaneously not affect the details of the human face (such as wrinkles, moles and the like), the embodiment of the invention firstly carries out Gaussian blur processing on the generated diffuse reflection map, then calculates the standard deviation on the color value of each pixel of the skin area, and determines the skin color loss item L based on the standard deviation_std. When the map is blurred, the blurring radius of a proper Gaussian kernel and the standard deviation of normal distribution are selected according to the resolution of the diffuse reflection map, so that the map after the gaussian blurring can filter high-frequency features (such as wrinkles and the like) and retain low-frequency features (such as skin color of a local area and the like).

Wherein x represents a Gaussian blurred image,

represents the average value, M_skinThe diffuse reflection map generated by using the global skin color loss based on the Gaussian blur can keep the consistency of the global skin color and simultaneously reserve the personalized features on the human face.

In summary, in the training phase of the neural network, the total loss function L is as follows:

L＝L_perc(d，d_t)+L_rec(d，d_t)+L_svm(d)+L_std(d)+L_rec(p，p_t)+L_perc(i，r)+L_rec(i，r)

wherein d represents a network predicted diffuse reflection map, d_tRepresenting diffuse reflection mapping truth, p representing a network predicted location map, p_tA position map true value is represented, i represents an input face image, and r represents a rendered image. And training a face prediction neural network through back propagation according to the loss function until the loss function is in accordance with expectation.

As shown in fig. 4, the flow of the training method for the face prediction neural network further includes the following steps:

step a), initializing neural network parameters;

step b), loading data of a training data set, wherein the data can comprise a face image, a corresponding diffuse reflection mapping truth value and a position mapping truth value;

step c), using a diffuse reflection network to generate a corresponding diffuse reflection map according to the input face image;

step d), using an illumination network to generate a corresponding illumination map according to the input face image;

step e), generating a corresponding position map according to the input face image by using a position network;

step f), inputting the diffuse reflection map, the illumination map and the position map into a micro-renderer to generate a rendered image;

step g), calculating various loss functions according to the data and a loss function calculation formula, and training a face prediction neural network by using a gradient back propagation method;

and h), judging whether the loss function of the network is converged, if not, repeating the steps b) to g), and if so, finishing the training and finishing.

As shown in fig. 5, an embodiment of the present invention further includes a three-dimensional face reconstruction method, and an operation flow of the method may include:

loading pre-trained neural network parameters; loading any one face image; generating a diffuse reflection map according to the input face image by using a diffuse reflection network; generating a position map according to the input face image by using a position network; storing the generated diffuse reflection map and the position map as a three-dimensional file; the flow ends.

In the embodiment of the invention, in a self-encoder or similar network structure, the characteristics corresponding to the target category can be extracted according to the convolution kernel after position transformation, and the effect can be obviously improved when the three-dimensional face map is generated from the two-dimensional face image. In addition, the expression form of the illumination parameter is replaced by the illumination map, so that the face prediction neural network in the embodiment of the invention can better predict the illumination in the real world. Fig. 6 is a partial result example. The left side is taken as a reference, the first column on the left side is an input face image, and the second column to the fifth column on the left side are a diffuse reflection map, a position map, a light map and a rendering image generated by the face prediction neural network in sequence.

Fig. 7 provides a schematic structural diagram of a three-dimensional face reconstruction device. The device can be applied to a server. As shown in fig. 7, the three-dimensional face reconstruction apparatus 700 includes:

the extraction module 701 is configured to perform feature extraction on a current face image to obtain a face feature map, where the face feature map includes a plurality of feature categories, and each feature category corresponds to a pixel point and a pixel position;

an input module 702, configured to input the face feature map into a face prediction neural network, where the face prediction neural network includes a diffuse reflection network, and a first encoder and a first decoder in the diffuse reflection network are in jump connection, where the first affine convolution layer includes a first main convolution layer and a first auxiliary convolution layer;

a matrix output module 703, configured to output a first affine transformation matrix of each pixel point through the first auxiliary convolution layer and the face feature map prediction convolution kernel;

an affine transformation module 704, configured to convert the convolution kernel to a target position corresponding to a current feature class based on the first affine transformation matrix, so that the first main convolution layer extracts a diffuse reflection feature corresponding to each target position, and outputs a diffuse reflection map;

a reconstruction module 705, configured to perform three-dimensional face reconstruction based on the diffuse reflection map.

In some embodiments, the matrix output module 703 is further specifically configured to predict, through the first auxiliary convolution layer, a convolution kernel position of each pixel point based on a texture coordinate corresponding to each target feature category in the target grid and the face feature map, and output an affine transformation matrix of each pixel point, so that each pixel point corresponds to the texture coordinate of the corresponding target feature category of the pixel point under the action of the affine transformation matrix.

In some embodiments, the face prediction network further comprises a location network in which a second encoder and a second decoder formed by a second affine convolution layer are jump-connected, the second affine convolution layer comprising a second main convolution layer and a second auxiliary convolution layer; the matrix output module 703 is further specifically configured to output a second affine transformation matrix of each pixel point through the second auxiliary convolution layer and the face feature map prediction convolution kernel; converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the second affine transformation matrix, so that the second main convolution layer extracts the position feature corresponding to each target position, and outputting a position map;

in some embodiments, the reconstruction module 705 is further specifically configured to perform three-dimensional face reconstruction based on the location map and the diffuse reflection map.

In some embodiments, the face prediction network further comprises an illumination network comprising a third encoder comprised of a third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer and a two-line upsampling layer; the matrix output module 703 is further specifically configured to output a third affine transformation matrix of each pixel point through the third auxiliary convolution layer and the face feature map prediction convolution kernel; and converting the pixel points corresponding to each feature type to the target positions corresponding to the feature types based on the third affine transformation matrix, so that the third main convolution layer extracts the illumination features corresponding to each target position and outputs an illumination map.

In some embodiments, the reconstruction module 705 is further specifically configured to perform three-dimensional face reconstruction based on the illumination map, the location map, and the diffuse reflection map.

In some embodiments, the face prediction network further comprises a renderer, and the apparatus further comprises a rendering module for inputting the illumination map, the location map, and the diffuse reflection map into the renderer; respectively reading position characteristic information, illumination characteristic information and color characteristic information from the position map, the light map and the diffuse reflection map according to texture coordinates of the vertex of the target grid; and generating a projection rendering image of the three-dimensional face in a two-dimensional space according to the position characteristic information, the illumination characteristic information and the color characteristic information.

In some embodiments, the renderer is a micro-renderer.

In some embodiments, the apparatus further comprises a training module to train the face prediction neural network by backpropagating through the micro-renderer until a loss function reaches an expectation, wherein the loss function comprises at least one of: perceptual loss terms, reconstruction loss terms, symmetry loss terms, and skin color loss terms.

In some embodiments, the perceptual loss term includes a first perceptual loss term and a second perceptual loss term, and the training module is further specifically configured to extract a first feature vector of the current face image and a second feature vector of the projection rendering; determining a first perception loss term of the face prediction neural network based on the first feature vector and the second feature vector; and determining a second perception loss item according to the characteristic vector comparison condition of the diffuse reflection mapping and the true value of the diffuse reflection mapping.

In some embodiments, the reconstruction loss term includes a first reconstruction loss term and a second reconstruction loss term, and the training module is further specifically configured to determine the first reconstruction loss term from a pixel difference between the current face image and the projection rendering; and determining a second reconstruction loss item according to the pixel difference comparison condition of the diffuse reflection mapping and the diffuse reflection mapping truth value.

In some embodiments, the training module is further specifically configured to determine a symmetry-loss term from the diffuse reflection map and the horizontally flipped diffuse reflection map.

In some embodiments, the training module is further specifically configured to perform gaussian blurring on the diffuse reflection map, and determine the skin color loss term based on a standard deviation of color values of each pixel of the skin region.

The three-dimensional face reconstruction device provided by the embodiment of the application has the same technical characteristics as the three-dimensional face reconstruction method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

As shown in fig. 8, an electronic device 800 includes a memory 801 and a processor 802, where the memory stores a computer program that can run on the processor, and the processor executes the computer program to implement the steps of the method provided in the foregoing embodiment.

Referring to fig. 8, the electronic device further includes: a bus 803 and a communication interface 804, the processor 802, the communication interface 804, and the memory 801 being connected by the bus 803; the processor 802 is used to execute executable modules, such as computer programs, stored in the memory 801.

The Memory 801 may include a high-speed Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 804 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like may be used.

The bus 803 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but this does not indicate only one bus or one type of bus.

The memory 801 is used for storing a program, and the processor 802 executes the program after receiving an execution instruction, and the method performed by the apparatus defined by the process disclosed in any of the foregoing embodiments of the present application may be applied to the processor 802, or implemented by the processor 802.

The processor 802 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 802. The Processor 802 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 802 reads the information in the memory 801 and completes the steps of the method in combination with the hardware.

Corresponding to the three-dimensional face reconstruction method, an embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to execute the steps of the three-dimensional face reconstruction method.

The three-dimensional face reconstruction device provided by the embodiment of the application can be specific hardware on equipment or software or firmware installed on the equipment. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the system, the apparatus and the unit described above may all refer to the corresponding processes in the method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

For another example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the three-dimensional face reconstruction method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present application. Are intended to be covered by the scope of the present application.

Claims

1. A method for reconstructing a three-dimensional face, the method comprising:

inputting the face feature map into a face prediction neural network, wherein the face prediction neural network comprises a diffuse reflection network, a first decoder in the diffuse reflection network is in jumping connection with a first encoder formed by a first affine convolution layer, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;

2. The method according to claim 1, wherein the step of outputting the first affine transformation matrix of each pixel point by the first auxiliary convolution layer and the face feature map prediction convolution kernel comprises:

3. The method of claim 1, wherein the face prediction neural network further comprises a location network in which a second decoder is in hopping connection with a second encoder comprised of a second affine convolution layer, the second affine convolution layer comprising a second main convolution layer and a second auxiliary convolution layer; the method further comprises the following steps:

4. The method of claim 3, wherein the face prediction neural network further comprises an illumination network comprising a bi-linear sampling layer and a third encoder comprised of a third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer; the method further comprises the following steps:

converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the third affine transformation matrix, so that the third main convolution layer extracts the illumination feature corresponding to each target position and outputs an illumination map;

5. The method of claim 4, wherein the face prediction neural network further comprises a renderer, the method further comprising:

respectively reading position characteristic information, illumination characteristic information and color characteristic information from the position map, the light map and the diffuse reflection mapping according to the texture coordinates of the top points of the target grid;

6. The method of claim 5, wherein the renderer is a micro-renderer.

7. The method of claim 6, further comprising:

training the face prediction neural network by backpropagating through the micro-renderer until a loss function reaches an expectation, wherein the loss function comprises at least one of: perceptual loss term, reconstruction loss term, symmetry loss term, and skin tone loss term.

8. The method of claim 7, wherein the perceptual loss term comprises a first perceptual loss term and a second perceptual loss term, the method further comprising:

9. The method of claim 7, wherein the reconstruction loss terms comprise a first reconstruction loss term and a second reconstruction loss term, the method further comprising:

10. The method of claim 7, further comprising:

11. The method of claim 7, further comprising:

12. A three-dimensional face reconstruction apparatus, the apparatus comprising:

the input module is used for inputting the face feature map into a face prediction neural network, wherein the face prediction neural network comprises a diffuse reflection network, a first decoder in the diffuse reflection network is in jump connection with a first encoder formed by a first affine convolution layer, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;

13. An electronic device comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor implements the steps of the method of any of claims 1 to 11 when executing the computer program.

14. A computer readable storage medium having stored thereon computer executable instructions which, when invoked and executed by a processor, cause the processor to execute the method of any of claims 1 to 11.