CN114119923B - Three-dimensional face reconstruction method and device and electronic equipment - Google Patents

Three-dimensional face reconstruction method and device and electronic equipment Download PDF

Info

Publication number
CN114119923B
CN114119923B CN202111435940.XA CN202111435940A CN114119923B CN 114119923 B CN114119923 B CN 114119923B CN 202111435940 A CN202111435940 A CN 202111435940A CN 114119923 B CN114119923 B CN 114119923B
Authority
CN
China
Prior art keywords
face
map
diffuse reflection
convolution layer
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111435940.XA
Other languages
Chinese (zh)
Other versions
CN114119923A (en
Inventor
胡志鹏
林江科
袁燚
范长杰
卜佳俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Netease Hangzhou Network Co Ltd
Original Assignee
Zhejiang University ZJU
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU, Netease Hangzhou Network Co Ltd filed Critical Zhejiang University ZJU
Priority to CN202111435940.XA priority Critical patent/CN114119923B/en
Publication of CN114119923A publication Critical patent/CN114119923A/en
Application granted granted Critical
Publication of CN114119923B publication Critical patent/CN114119923B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • G06T3/02
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Abstract

The application provides a three-dimensional face reconstruction method, a three-dimensional face reconstruction device and electronic equipment, relates to the technical field of three-dimensional face reconstruction, and solves the technical problem of poor accuracy of output maps. The method comprises the following steps: extracting the features of the current face image to obtain a face feature image; inputting a face feature map into a face prediction neural network, wherein the face prediction network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jumping connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer; predicting a convolution kernel through the first auxiliary convolution layer and the face feature map, and outputting a first affine transformation matrix of each pixel point; converting the convolution kernels to target positions corresponding to the current feature categories based on the first affine transformation matrix so that the first main convolution layer extracts diffuse reflection features corresponding to each target position and outputs a diffuse reflection map; and performing three-dimensional face reconstruction based on the diffuse reflection mapping.

Description

Three-dimensional face reconstruction method and device and electronic equipment
Technical Field
The present application relates to the field of three-dimensional face reconstruction technologies, and in particular, to a three-dimensional face reconstruction method and apparatus, and an electronic device.
Background
At present, with the great diversity of neural networks in the field of computer vision, researchers propose a scheme of directly regressing coefficients of a three-dimensional face variability Model (3D movable Model, 3DMM) according to a face image input to the neural network. To obtain paired two-three dimensional data for supervised learning, researchers have generated synthetic data by randomly sampling deformable face models, or have created true value samples using iterative optimization methods to fit a large number of face images.
Recently, micro-renderable techniques have been introduced into the three-dimensional face reconstruction task. With differentiable rendering, facial textures such as UV maps can be optimized during the training phase. Even some researchers have directly designed a completely unsupervised network structure for three-dimensional face reconstruction. The method uses a network structure of a plurality of self-encoders (encoding-decoding), takes a face image as input, respectively outputs an albedo map, a depth map and the like of a three-dimensional face under a front view angle, and then constructs a loss function for training through a micro-renderer.
However, the self-encoder cannot guarantee the output precision and improve the performance of the self-encoder.
Disclosure of Invention
The application aims to provide a three-dimensional face reconstruction method, a three-dimensional face reconstruction device and electronic equipment, affine transformation is carried out on the pixel point position of an input face image, so that corresponding feature classes of target feature class features of an output diffuse reflection mapping from the input face image are realized, and the technical problem of poor accuracy of the output mapping is further solved.
In a first aspect, an embodiment of the present application provides a three-dimensional face reconstruction method, where the method includes:
performing feature extraction on a current face image to obtain a face feature map, wherein the face feature map comprises a plurality of feature categories;
inputting the face feature map into a face prediction neural network, wherein the face prediction network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jumping connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;
predicting a convolution kernel through the first auxiliary convolution layer and the face feature map, and outputting a first affine transformation matrix of each pixel point;
converting the convolution kernel to a target position corresponding to the current feature category based on the first affine transformation matrix so that the first main convolution layer extracts diffuse reflection features corresponding to each target position and outputs a diffuse reflection map;
and performing three-dimensional face reconstruction based on the diffuse reflection map.
In one possible implementation, the step of outputting the first affine transformation matrix of each pixel point by the first auxiliary convolution layer and the face feature map prediction convolution kernel includes:
predicting the convolution kernel position of each pixel point based on the texture coordinate corresponding to each target feature category in the target grid and the face feature map through the first auxiliary convolution layer, and outputting an affine transformation matrix of each pixel point so that each pixel point corresponds to the texture coordinate of the corresponding target feature category of the pixel point under the action of the affine transformation matrix.
In one possible implementation, the face prediction network further includes a location network in which a second encoder and a second decoder formed by a second affine convolution layer are in hopping connection, the second affine convolution layer including a second main convolution layer and a second auxiliary convolution layer; the method further comprises the following steps:
predicting a convolution kernel through the second auxiliary convolution layer and the face feature map, and outputting a second affine transformation matrix of each pixel point;
converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the second affine transformation matrix, so that the second main convolution layer extracts the position feature corresponding to each target position, and outputting a position map;
the three-dimensional face reconstruction based on the diffuse reflection mapping comprises the following steps:
and performing three-dimensional face reconstruction based on the position map and the diffuse reflection mapping.
In one possible implementation, the face prediction network further comprises an illumination network comprising a third encoder formed by a third affine convolution layer and a two-line upsampling layer, the third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer; the method further comprises the following steps:
predicting a convolution kernel through the third auxiliary convolution layer and the face feature map, and outputting a third affine transformation matrix of each pixel point;
converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the third affine transformation matrix, so that the third main convolution layer extracts the illumination feature corresponding to each target position and outputs an optical map;
the three-dimensional face reconstruction based on the diffuse reflection mapping comprises the following steps:
and performing three-dimensional face reconstruction based on the illumination map, the position map and the diffuse reflection map.
In one possible implementation, the face prediction network further includes a renderer, and the method further includes:
inputting the illumination map, the position map and the diffuse reflection map into the renderer;
respectively reading position characteristic information, illumination characteristic information and color characteristic information from the position map, the light map and the diffuse reflection map according to texture coordinates of the vertex of the target grid;
and generating a projection rendering image of the three-dimensional face in a two-dimensional space according to the position characteristic information, the illumination characteristic information and the color characteristic information.
In one possible implementation, the renderer is a micro-renderer.
In one possible implementation, the method further comprises:
training the face prediction neural network by back-propagating through the micro-renderer until a loss function reaches an expectation, wherein the loss function comprises at least one of phase shifting: perceptual loss terms, reconstruction loss terms, symmetry loss terms, and skin color loss terms.
In one possible implementation, the perceptual loss term includes a first perceptual loss term and a second perceptual loss term, and the method further includes:
extracting a first characteristic vector of the current face image and a second characteristic vector of the projection rendering image;
determining a first perception loss term of the face prediction neural network based on the first feature vector and the second feature vector;
and determining a second perception loss item according to the characteristic vector comparison condition of the diffuse reflection mapping and the true value of the diffuse reflection mapping.
In one possible implementation, the reconstruction loss terms include a first reconstruction loss term and a second reconstruction loss term, the method including:
determining a first reconstruction loss term according to the pixel difference value between the current face image and the projection rendering image;
and determining a second reconstruction loss item according to the pixel difference comparison condition of the diffuse reflection mapping and the diffuse reflection mapping truth value.
In one possible implementation, the method further comprises:
and determining a symmetry loss item according to the diffuse reflection map and the horizontally flipped diffuse reflection map.
In one possible implementation, the method further comprises:
and performing Gaussian blur processing on the diffuse reflection map, and determining the skin color loss item based on the standard deviation of the color value of each pixel of the skin area.
In a second aspect, a three-dimensional face reconstruction apparatus is provided, the apparatus comprising:
the extraction module is used for extracting the features of the current face image to obtain a face feature map, wherein the face feature map comprises a plurality of feature categories;
the input module is used for inputting the face feature map into a face prediction neural network, wherein the face prediction neural network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jump connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;
the matrix output module is used for outputting a first affine transformation matrix of each pixel point through the first auxiliary convolution layer and the face feature map prediction convolution kernel;
an affine transformation module, configured to convert the convolution kernel to a target position corresponding to a current feature class based on the first affine transformation matrix, so that the first main convolution layer extracts a diffuse reflection feature corresponding to each target position, and outputs a diffuse reflection map;
and the reconstruction module is used for reconstructing a three-dimensional face based on the diffuse reflection map.
In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the method of the first aspect when executing the computer program.
In a fourth aspect, this embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, which, when invoked and executed by a processor, cause the processor to perform the method of the first aspect.
The embodiment of the application brings the following beneficial effects:
according to the three-dimensional face reconstruction method, the three-dimensional face reconstruction device and the electronic equipment, the feature class features corresponding to the output diffuse reflection maps correspond to the feature classes of the input face images.
In the scheme, because the symmetry proportion of the input image and the actual map is possibly different, the face eyebrow feature position in the input image is possibly aligned with the eye position of the reflection map, namely, the face eyebrow feature in the input image is extracted to be used as the eye feature in the reflection map to be output, so that the precision of the diffuse reflection map output by the encoder is poor. And outputting an affine transformation matrix to transform the position of the convolution kernel by the auxiliary convolution layer according to the target position of each feature category in the input image, so that the feature corresponding to the target category can be extracted according to the convolution kernel after position transformation, and the accuracy of the output result of the encoder is ensured.
In order to make the aforementioned objects, features and advantages of the present application comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flow chart of a three-dimensional face reconstruction method according to an embodiment of the present application;
FIG. 2 illustrates an affine convolutional layer application diagram provided by an embodiment of the present application;
fig. 3 is a schematic diagram illustrating an application of a face prediction neural network according to an embodiment of the present application;
fig. 4 is a schematic diagram of a training method for a face prediction neural network according to an embodiment of the present application;
fig. 5 is another schematic flow chart of a three-dimensional face reconstruction method according to an embodiment of the present application;
fig. 6 is an image result corresponding to the three-dimensional face reconstruction method provided in the embodiment of the present application;
fig. 7 is a schematic structural diagram of a three-dimensional face reconstruction device according to an embodiment of the present application;
fig. 8 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "comprising" and "having," and any variations thereof, as referred to in the embodiments of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For the sake of understanding, the following terms of art are to be interpreted accordingly:
deep learning is a machine learning algorithm composed of large-scale neurons, and can be widely applied to multiple fields of computer vision, speech recognition, natural language processing and the like at present due to the fact that the complex nonlinear problem can be well solved.
Affine transformation (also called Affine mapping) is a process in which, in geometry, one vector space is linearly transformed once and then translated into another vector space.
Convolutional Neural Networks (CNNs) represent the use of a mathematical operation called convolution in a Network. Convolution is a special linear operation and convolutional networks are special neural networks that use convolution in at least one layer instead of general matrix multiplication.
Mesh (mesh), generally referred to as a triangular mesh in embodiments of the present invention, is a data structure for representing three-dimensional models. It is composed of vertexes in three-dimensional space and triangular patches among the three vertexes. Each vertex may contain information such as color, normal, and the like, in addition to the position coordinates.
Diffuse reflection maps (Diffuse maps) reflect the color and intensity of an object surface under Diffuse reflection, represent the inherent color and texture of an object, and are the most fundamental maps of objects. Which can also be understood directly as texture in general.
The Position Map (Position Map) is a 2D image that records the 3D coordinates of a complete point cloud while preserving the semantics of each UV Position. The process of creating a texture map, a normal map, a bump map, or the like on a two-dimensional UV space is called UV unfolding. U and V refer to horizontal and vertical axes of a 2D space because X, Y and Z have been used in a 3D space.
The three-dimensional human face variability Model is a 3D Morphable Model, the Model is composed of mesh, and each dimension numerical control controls local changes of a human face.
Three-dimensional face reconstruction is a hot problem in the field of computer vision, and takes one or more face images as input and outputs three-dimensional representation of a face. The three-dimensional face has various representation methods, and more common methods include Mesh (Mesh), Voxel (Voxel), Point Cloud (Point Cloud), Depth Map (Depth Map), and the like.
Most of the initial three-dimensional face reconstruction methods only focus on geometric information, namely the shape of a face, and neglect the texture information of the face. This is mainly due to the difficulty of large-scale acquisition of three-dimensional face data, and the large amount of data required to train neural networks in a supervised fashion. With the introduction of differentiable rendering \ cite, one can compute the loss function and back-propagate between the input face image and the rendered three-dimensional face, so that a neural network can be trained to predict a three-dimensional face from a two-dimensional face image in a self-supervised or weakly supervised fashion.
Since Blanz and Vetter 1999 proposed 3D transportable Face Models (3DMM), much work in the field of three-dimensional Face reconstruction was done based on 3 DMM. The classic method based on 3DMM three-dimensional face reconstruction is that a template Mesh is continuously optimized in an iterative mode and is fitted to an input two-dimensional face image. However, similar methods are very sensitive to the lighting, expression, and posture of the face image. While some work follows to improve the iterative optimization methods, they do not perform well on face images acquired in non-laboratory environments.
Furthermore, because the input image and the output map are not spatially aligned, a convolutional neural network cannot handle this well. Therefore, when using the self-encoder network, the input image needs to be encoded into a hidden layer encoding vector, which results in information loss. Also, performance cannot be improved by adding a jump connection to the self-encoder.
Based on this, the embodiment of the application provides a three-dimensional face reconstruction method, a three-dimensional face reconstruction device and electronic equipment, and the method can be used for relieving the technical problem that a self-encoder cannot ensure output precision in the three-dimensional face reconstruction process.
Embodiments of the present application are further described below with reference to the accompanying drawings.
Fig. 1 is a schematic flow diagram of a three-dimensional face reconstruction method according to an embodiment of the present application. The method can be applied to a server, three-dimensional face information is obtained through the input two-dimensional face image, so that three-dimensional face reconstruction in scenes such as games, social contact and the like can be realized, and the three-dimensional effect of the face can be displayed. As shown in fig. 1, the method includes:
and S102, extracting the features of the current face image to obtain a face feature image.
The human face feature map comprises a plurality of feature categories, and each feature category corresponds to a pixel point and a pixel position.
It should be noted that, the neural network can be extracted through the features, that is, the features of the face image are extracted to obtain each feature class corresponding to the face image, and the pixel points and the positions corresponding to the feature classes.
For example, the facial image includes an eyebrow feature category, an eye feature category, a nose feature category, a mouth feature category, and the like. Each feature category may include a plurality of pixels, and each pixel corresponds to a position coordinate.
And step S104, inputting the face feature map into a face prediction neural network.
The human face prediction network comprises a diffuse reflection network, a first encoder and a first decoder which are formed by a first affine convolution layer in the diffuse reflection network are in jumping connection, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer.
It should be noted that the diffuse reflection network is a network structure similar to U-Net, and mainly comprises an encoder (down-sampling) and a decoder (up-sampling) connected in a jump manner. Unlike the usual U-Net structure, all convolutional layers in the encoder are replaced by affine convolutional layers. The input to the diffuse reflection network is a facial feature image, which outputs a diffuse reflection map d, as shown in fig. 3.
According to the embodiment of the invention, the affine convolution layer is used for aligning the input image and each feature type of the diffuse reflection mapping, so that jump connection can be added into a network at the stage from the input image to the diffuse reflection mapping, and the affine convolution can automatically learn and process the problem that the input image 2D and the diffuse reflection mapping UV are not consistent in space.
And step S106, predicting a convolution kernel through the first auxiliary convolution layer and the face feature map, and outputting a first affine transformation matrix of each pixel point.
Wherein an affine convolutional layer comprises two ordinary convolutional layers, wherein the auxiliary convolutional layer is also a convolutional network and is named as an auxiliary convolutional layer for distinguishing from the main convolutional layer. Equivalent to an affine convolution layer involves two convolution operations. The output of the auxiliary convolution is an affine transformation matrix based on which new coordinates are calculated.
It should be noted that the first affine transformation matrix corresponding to each pixel point in the current face feature map is output through the first auxiliary convolution layer and the face feature map prediction convolution kernel. Optionally, each pixel point may correspond to a convolution kernel.
And step S108, converting the convolution kernels to target positions corresponding to the current feature types based on the first affine transformation matrix, so that the first main convolution layer extracts the diffuse reflection features corresponding to each target position, and outputting a diffuse reflection map.
As shown in fig. 2, based on the affine transformation matrix, the left 3 by 3 square convolution kernel is transformed into a rhombus, and then the feature extraction is performed on the rhombus convolution kernel through the main convolution layer, so as to obtain a right small square, and the right small square is output. It should be noted that, at this time, the face feature class extracted by the diamond convolution kernel corresponds to the feature class of the target output.
For example, the face feature class corresponding to the original square convolution is eyebrow, the feature class output by the target is eye, and the pixel coordinates of the eye feature class are transformed through an affine transformation matrix, so that the output face feature class is eye and is consistent with the feature class output by the target.
And step S110, reconstructing the three-dimensional face based on the diffuse reflection mapping.
The output characteristic category and the extracted characteristic category are consistent in transformation, so that the output diffuse reflection map is accurate, and more accurate three-dimensional face reconstruction operation is realized.
In the embodiment of the application, the problem that the characteristic categories of the input image and the output diffuse reflection image are not aligned in space can be solved in a self-encoder or a similar network structure, and meanwhile, jump connection can be added to the self-encoder, so that the self-performance of the self-encoder network is improved.
Because the symmetry proportion of the input image and the actual map may not be the same, the face eyebrow feature position in the input image may be aligned with the eye position of the reflection map, that is, at this time, the face eyebrow feature in the input image may be extracted as the eye feature in the reflection map for output, so that the precision of the diffuse reflection map output by the encoder is poor. And outputting an affine transformation matrix to transform the position of the convolution kernel by the auxiliary convolution layer according to the target position of each feature category in the input image, so that the feature corresponding to the target category can be extracted according to the convolution kernel after position transformation, and the accuracy of the output result of the encoder is ensured.
The above steps are described in detail below.
In some embodiments, the auxiliary convolution layer may generate an affine transformation matrix to spatially align feature classes of the input image and the output diffuse reflectance image. As an example, the step S106 may include the following steps:
step 1.1), predicting the convolution kernel position of each pixel point based on the texture coordinate corresponding to each target feature category in the target grid and the face feature map through the first auxiliary convolution layer, and outputting an affine transformation matrix of each pixel point so that each pixel point corresponds to the texture coordinate of the corresponding target feature category of the pixel point under the action of the affine transformation matrix.
The affine convolution network enables the convolution kernel to extract the features of any region in the image through affine transformation. The network structure of affine convolution is shown in fig. 2. And the auxiliary convolution layer outputs an affine transformation matrix based on texture UV coordinates corresponding to each feature class in the target grid and the position coordinates of the current feature class, so that the convolution kernel calculates new coordinates based on the affine transformation matrix, and the feature class corresponding to the new coordinates of the convolution kernel is consistent with the correspondingly output target feature class.
Assuming that the width, height and channel number of three dimensions of the face feature map are W, H and C respectively, the dimension of the affine transformation matrix generated by the auxiliary convolution layer is W × H × 6. For each pixel point on the feature map, its affine transformation matrix can be represented by 6 values.
For example, for one convolution kernel coordinate (x, y), given 6 values (a, b, c, d, e, f) of the affine transformation matrix, the new convolution kernel coordinate (x ', y') can be calculated by the following formula:
Figure BDA0003381726800000111
and calculating new coordinates of the convolution kernels at each position through an affine matrix, then transforming the corresponding convolution kernels of the main convolution layer to the new coordinates, extracting the features corresponding to the convolution kernels of the new coordinates, and outputting the features corresponding to the feature categories at the positions corresponding to the target feature categories.
In some embodiments, the face prediction network further comprises a location network in which a second encoder and a second decoder formed by a second affine convolution layer are in hopping connection, the second affine convolution layer comprising a second main convolution layer and a second auxiliary convolution layer; the more accurate three-dimensional face reconstruction can be realized by considering the dual influence factors of the position network and the diffuse reflection network. As an example, the method may further comprise the steps of:
and 2.1) predicting a convolution kernel through the second auxiliary convolution layer and the face feature map, and outputting a second affine transformation matrix of each pixel point.
And 2.2) converting the convolution kernels to target positions corresponding to the current feature classes based on the second affine transformation matrix, so that the second main convolution layer extracts position features corresponding to each target position and outputs a position map.
Wherein, step S110 further includes: and 2.3) reconstructing a three-dimensional face based on the position map and the diffuse reflection map.
The structure of the position network is basically consistent with that of the diffuse reflection network in the previous step, the input of the position network is also a face image, the output of the position network is a position map, and the position of a convolution kernel is transformed through an affine transformation matrix output by the auxiliary convolution layer so as to output the features with the same type as the target features.
Because the illumination of the real world is complex, it is difficult to completely simulate the illumination on the face image simply by using parallel light, and the illumination is difficult to represent by a small number of parameters due to the influence of various kinds of shelters (such as bang, glasses, etc.) possibly existing in the face image. The embodiment of the invention provides a method for simulating illumination information by using an illumination map, so that illumination is decoupled from a face image.
In some embodiments, by improving the representation of the lighting, the lighting network is enabled to more accurately simulate real-world lighting information. The embodiment of the invention uses the illumination map to represent the illumination information, and can simulate the illumination of the real world more freely compared with the parameters of parallel light or spherical harmonic illumination. As an example, the face prediction network further comprises an illumination network, the illumination network comprising a third encoder formed by a third affine convolution layer and a two-line upsampling layer, the third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer; the method further comprises the following steps:
and 3.1) predicting a convolution kernel through the third auxiliary convolution layer and the face feature map, and outputting a third affine transformation matrix of each pixel point.
And 3.2) converting the convolution kernels to target positions corresponding to the current feature types based on a third affine transformation matrix, so that the third main convolution layer extracts illumination features corresponding to each target position and outputs an illumination map.
Wherein, step S110 further includes: and 3.3) carrying out three-dimensional face reconstruction based on the illumination map, the position map and the diffuse reflection map.
The structure of the illumination network is different from the structures of the position network and the diffuse reflection network introduced in the embodiment, the decoder is replaced by a bilinear up-sampling layer by the illumination network, and jump connection is also removed, so that the purpose that the illumination network focuses on low-frequency information on a face image instead of high-frequency information such as pores is achieved. In addition, the input of the network is a face image and the output is a light map.
In the embodiment of the present invention, the three-dimensional face reconstruction method described in step S110 may be understood as generating a three-dimensional representation of a face according to a face image, where the three-dimensional representation includes a three-dimensional face mesh (target mesh) and a corresponding diffuse reflection map. The embodiment of the invention also predicts an illumination map at the same time according to the face image, and is used for decoupling illumination information from the face map to generate a better diffuse reflection map.
In some embodiments, to enable a more accurate three-dimensional face reconstruction process. As an example, a renderer may be introduced, e.g. the face prediction network also comprises a renderer, which is preferably differentiable in order to be able to back-propagate the gradient to train the neural network. The method further comprises the following steps:
and 4.1) inputting the illumination map, the position map and the diffuse reflection map into a renderer.
And 4.2) respectively reading position characteristic information, illumination characteristic information and color characteristic information from images such as a position map, a light map and a diffuse reflection map according to texture coordinates of the vertex of the target grid, and generating a projection rendering map of the three-dimensional face in a two-dimensional space.
The micro-renderer is used for generating a projection of a three-dimensional face on a two-dimensional space according to a given three-dimensional game face (target mesh) which comprises coordinates of mesh vertexes, definitions of triangular patches (meshes) and information such as corresponding diffuse reflection maps, position maps and illumination maps.
As another example, in order to obtain a more accurate three-dimensional face reconstruction result, the face prediction neural network may be further trained, for example, the method further includes:
and 5.1) reversely propagating and training a face prediction neural network through a micro-renderer until a loss function reaches an expectation, wherein the loss function comprises a perception loss item, a reconstruction loss item, a symmetry loss item and a skin color loss item.
In some embodiments, the perceptual loss term includes a first perceptual loss term and a second perceptual loss term, and the training method further includes:
step 6.1), extracting a first characteristic vector of the current face image and a second characteristic vector of the projection rendering image;
step 6.2), determining a first perception loss item of the face prediction neural network based on the first feature vector and the second feature vector;
among them, the purpose of the perceptual loss is to minimize the difference in feature vectors between the rendered image (projected rendering) and the input image (current face image). As an alternative embodiment, the embodiment of the present invention may use a neural network such as VGG19 pre-trained on a public data set such as ImageNet as a feature extractor to extract feature vectors of an input image and a rendered image respectively, and then calculate the difference between the two feature vectors as a perception loss term Lperc(). The formula is expressed as:
Figure BDA0003381726800000141
where x and x' represent the input image and the rendered image, respectively, F (-) represents the feature extractor, and F (x) represents the extracted feature vector.
And 6.3) determining a second perception loss item according to the characteristic vector comparison condition of the diffuse reflection mapping and the true value of the diffuse reflection mapping.
Where, for example, diffuse reflection mapping truth values and location mapping truth values, the truth data is derived from the public data set RGB 3D Face.
In some embodiments, in order to obtain a more accurate three-dimensional face reconstruction result, the face prediction neural network may be further trained, and the reconstruction loss term includes a first reconstruction loss term and a second reconstruction loss term, and the method may include:
step 7.1), determining a first reconstruction loss term according to a pixel difference value between the current face image and the projection rendering image;
wherein the loss term L is reconstructedrecThe corresponding pixel value difference between the current input image and the rendered image is calculated as:
Lrec(x,x′)=||x-x′||2
where x and x' represent the input image and the rendered image, respectively.
And 7.2) determining a second reconstruction loss item according to the pixel difference comparison condition of the diffuse reflection map and the true value of the diffuse reflection map.
In some embodiments, in order to obtain a more accurate three-dimensional face reconstruction result, the face prediction neural network may be further trained, and the method further includes:
and 8.1) determining a symmetry loss item according to the diffuse reflection map and the horizontally flipped diffuse reflection map.
Because the human face has bilateral symmetry, a symmetric loss term L is designed on the diffuse reflection mapsym. The calculation formula is as follows:
Figure BDA0003381726800000151
where x is the current input image,
Figure BDA0003381726800000152
is the image after x horizontal flipping.
In some embodiments, in order to obtain a more accurate three-dimensional face reconstruction result, the face prediction neural network may be further trained, and the method further includes:
step 9.1), performing Gaussian blur processing on the diffuse reflection map, and determining a skin color loss item based on the standard deviation of the color value of each pixel of the skin area.
The skin color loss is to promote the uniformity of the entire skin color of the generated texture map. In order to keep the overall skin color consistent and simultaneously not affect the details of the human face (such as wrinkles, moles and the like), the embodiment of the invention firstly carries out Gaussian blur processing on the generated diffuse reflection map, then calculates the standard deviation on the color value of each pixel of the skin area, and determines the skin color loss item L based on the standard deviationstd. When the map is blurred, the blurring radius of a proper Gaussian kernel and the standard deviation of normal distribution are selected according to the resolution of the diffuse reflection map, so that the map after the gaussian blurring can filter high-frequency features (such as wrinkles and the like) and retain low-frequency features (such as skin color of a local area and the like).
Figure BDA0003381726800000161
Wherein x represents a Gaussian blurred image,
Figure BDA0003381726800000162
represents the average value, MskinThe diffuse reflection map generated by using the global skin color loss based on the Gaussian blur can keep the consistency of the global skin color and simultaneously reserve the personalized features on the human face.
In summary, in the training phase of the neural network, the total loss function L is as follows:
L=Lperc(d,dt)+Lrec(d,dt)+Lsvm(d)+Lstd(d)+Lrec(p,pt)+Lperc(i,r)+Lrec(i,r)
wherein d represents a network predicted diffuse reflection map, dtRepresenting diffuse reflection mapping truth, p representing a network predicted location map, ptA position map true value is represented, i represents an input face image, and r represents a rendered image. And training a face prediction neural network through back propagation according to the loss function until the loss function is in accordance with expectation.
As shown in fig. 4, the flow of the training method for the face prediction neural network further includes the following steps:
step a), initializing neural network parameters;
step b), loading data of a training data set, wherein the data can comprise a face image, a corresponding diffuse reflection mapping truth value and a position mapping truth value;
step c), using a diffuse reflection network to generate a corresponding diffuse reflection map according to the input face image;
step d), using an illumination network to generate a corresponding illumination map according to the input face image;
step e), generating a corresponding position map according to the input face image by using a position network;
step f), inputting the diffuse reflection map, the illumination map and the position map into a micro-renderer to generate a rendered image;
step g), calculating various loss functions according to the data and a loss function calculation formula, and training a face prediction neural network by using a gradient back propagation method;
and h), judging whether the loss function of the network is converged, if not, repeating the steps b) to g), and if so, finishing the training and finishing.
As shown in fig. 5, an embodiment of the present invention further includes a three-dimensional face reconstruction method, and an operation flow of the method may include:
loading pre-trained neural network parameters; loading any one face image; generating a diffuse reflection map according to the input face image by using a diffuse reflection network; generating a position map according to the input face image by using a position network; storing the generated diffuse reflection map and the position map as a three-dimensional file; the flow ends.
In the embodiment of the invention, in a self-encoder or similar network structure, the characteristics corresponding to the target category can be extracted according to the convolution kernel after position transformation, and the effect can be obviously improved when the three-dimensional face map is generated from the two-dimensional face image. In addition, the expression form of the illumination parameter is replaced by the illumination map, so that the face prediction neural network in the embodiment of the invention can better predict the illumination in the real world. Fig. 6 is a partial result example. The left side is taken as a reference, the first column on the left side is an input face image, and the second column to the fifth column on the left side are a diffuse reflection map, a position map, a light map and a rendering image generated by the face prediction neural network in sequence.
Fig. 7 provides a schematic structural diagram of a three-dimensional face reconstruction device. The device can be applied to a server. As shown in fig. 7, the three-dimensional face reconstruction apparatus 700 includes:
the extraction module 701 is configured to perform feature extraction on a current face image to obtain a face feature map, where the face feature map includes a plurality of feature categories, and each feature category corresponds to a pixel point and a pixel position;
an input module 702, configured to input the face feature map into a face prediction neural network, where the face prediction neural network includes a diffuse reflection network, and a first encoder and a first decoder in the diffuse reflection network are in jump connection, where the first affine convolution layer includes a first main convolution layer and a first auxiliary convolution layer;
a matrix output module 703, configured to output a first affine transformation matrix of each pixel point through the first auxiliary convolution layer and the face feature map prediction convolution kernel;
an affine transformation module 704, configured to convert the convolution kernel to a target position corresponding to a current feature class based on the first affine transformation matrix, so that the first main convolution layer extracts a diffuse reflection feature corresponding to each target position, and outputs a diffuse reflection map;
a reconstruction module 705, configured to perform three-dimensional face reconstruction based on the diffuse reflection map.
In some embodiments, the matrix output module 703 is further specifically configured to predict, through the first auxiliary convolution layer, a convolution kernel position of each pixel point based on a texture coordinate corresponding to each target feature category in the target grid and the face feature map, and output an affine transformation matrix of each pixel point, so that each pixel point corresponds to the texture coordinate of the corresponding target feature category of the pixel point under the action of the affine transformation matrix.
In some embodiments, the face prediction network further comprises a location network in which a second encoder and a second decoder formed by a second affine convolution layer are jump-connected, the second affine convolution layer comprising a second main convolution layer and a second auxiliary convolution layer; the matrix output module 703 is further specifically configured to output a second affine transformation matrix of each pixel point through the second auxiliary convolution layer and the face feature map prediction convolution kernel; converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the second affine transformation matrix, so that the second main convolution layer extracts the position feature corresponding to each target position, and outputting a position map;
in some embodiments, the reconstruction module 705 is further specifically configured to perform three-dimensional face reconstruction based on the location map and the diffuse reflection map.
In some embodiments, the face prediction network further comprises an illumination network comprising a third encoder comprised of a third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer and a two-line upsampling layer; the matrix output module 703 is further specifically configured to output a third affine transformation matrix of each pixel point through the third auxiliary convolution layer and the face feature map prediction convolution kernel; and converting the pixel points corresponding to each feature type to the target positions corresponding to the feature types based on the third affine transformation matrix, so that the third main convolution layer extracts the illumination features corresponding to each target position and outputs an illumination map.
In some embodiments, the reconstruction module 705 is further specifically configured to perform three-dimensional face reconstruction based on the illumination map, the location map, and the diffuse reflection map.
In some embodiments, the face prediction network further comprises a renderer, and the apparatus further comprises a rendering module for inputting the illumination map, the location map, and the diffuse reflection map into the renderer; respectively reading position characteristic information, illumination characteristic information and color characteristic information from the position map, the light map and the diffuse reflection map according to texture coordinates of the vertex of the target grid; and generating a projection rendering image of the three-dimensional face in a two-dimensional space according to the position characteristic information, the illumination characteristic information and the color characteristic information.
In some embodiments, the renderer is a micro-renderer.
In some embodiments, the apparatus further comprises a training module to train the face prediction neural network by backpropagating through the micro-renderer until a loss function reaches an expectation, wherein the loss function comprises at least one of: perceptual loss terms, reconstruction loss terms, symmetry loss terms, and skin color loss terms.
In some embodiments, the perceptual loss term includes a first perceptual loss term and a second perceptual loss term, and the training module is further specifically configured to extract a first feature vector of the current face image and a second feature vector of the projection rendering; determining a first perception loss term of the face prediction neural network based on the first feature vector and the second feature vector; and determining a second perception loss item according to the characteristic vector comparison condition of the diffuse reflection mapping and the true value of the diffuse reflection mapping.
In some embodiments, the reconstruction loss term includes a first reconstruction loss term and a second reconstruction loss term, and the training module is further specifically configured to determine the first reconstruction loss term from a pixel difference between the current face image and the projection rendering; and determining a second reconstruction loss item according to the pixel difference comparison condition of the diffuse reflection mapping and the diffuse reflection mapping truth value.
In some embodiments, the training module is further specifically configured to determine a symmetry-loss term from the diffuse reflection map and the horizontally flipped diffuse reflection map.
In some embodiments, the training module is further specifically configured to perform gaussian blurring on the diffuse reflection map, and determine the skin color loss term based on a standard deviation of color values of each pixel of the skin region.
The three-dimensional face reconstruction device provided by the embodiment of the application has the same technical characteristics as the three-dimensional face reconstruction method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
As shown in fig. 8, an electronic device 800 includes a memory 801 and a processor 802, where the memory stores a computer program that can run on the processor, and the processor executes the computer program to implement the steps of the method provided in the foregoing embodiment.
Referring to fig. 8, the electronic device further includes: a bus 803 and a communication interface 804, the processor 802, the communication interface 804, and the memory 801 being connected by the bus 803; the processor 802 is used to execute executable modules, such as computer programs, stored in the memory 801.
The Memory 801 may include a high-speed Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 804 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like may be used.
The bus 803 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but this does not indicate only one bus or one type of bus.
The memory 801 is used for storing a program, and the processor 802 executes the program after receiving an execution instruction, and the method performed by the apparatus defined by the process disclosed in any of the foregoing embodiments of the present application may be applied to the processor 802, or implemented by the processor 802.
The processor 802 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 802. The Processor 802 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 802 reads the information in the memory 801 and completes the steps of the method in combination with the hardware.
Corresponding to the three-dimensional face reconstruction method, an embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to execute the steps of the three-dimensional face reconstruction method.
The three-dimensional face reconstruction device provided by the embodiment of the application can be specific hardware on equipment or software or firmware installed on the equipment. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the system, the apparatus and the unit described above may all refer to the corresponding processes in the method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
For another example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the three-dimensional face reconstruction method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present application. Are intended to be covered by the scope of the present application.

Claims (14)

1. A method for reconstructing a three-dimensional face, the method comprising:
performing feature extraction on a current face image to obtain a face feature map, wherein the face feature map comprises a plurality of feature categories;
inputting the face feature map into a face prediction neural network, wherein the face prediction neural network comprises a diffuse reflection network, a first decoder in the diffuse reflection network is in jumping connection with a first encoder formed by a first affine convolution layer, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;
predicting a convolution kernel through the first auxiliary convolution layer and the face feature map, and outputting a first affine transformation matrix of each pixel point;
converting the convolution kernel to a target position corresponding to the current feature category based on the first affine transformation matrix so that the first main convolution layer extracts diffuse reflection features corresponding to each target position and outputs a diffuse reflection map;
and performing three-dimensional face reconstruction based on the diffuse reflection map.
2. The method according to claim 1, wherein the step of outputting the first affine transformation matrix of each pixel point by the first auxiliary convolution layer and the face feature map prediction convolution kernel comprises:
predicting the convolution kernel position of each pixel point based on the texture coordinate corresponding to each target feature category in the target grid and the face feature map through the first auxiliary convolution layer, and outputting an affine transformation matrix of each pixel point so that each pixel point corresponds to the texture coordinate of the corresponding target feature category of the pixel point under the action of the affine transformation matrix.
3. The method of claim 1, wherein the face prediction neural network further comprises a location network in which a second decoder is in hopping connection with a second encoder comprised of a second affine convolution layer, the second affine convolution layer comprising a second main convolution layer and a second auxiliary convolution layer; the method further comprises the following steps:
predicting a convolution kernel through the second auxiliary convolution layer and the face feature map, and outputting a second affine transformation matrix of each pixel point;
converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the second affine transformation matrix, so that the second main convolution layer extracts the position feature corresponding to each target position, and outputting a position map;
the three-dimensional face reconstruction based on the diffuse reflection mapping comprises the following steps:
and performing three-dimensional face reconstruction based on the position map and the diffuse reflection mapping.
4. The method of claim 3, wherein the face prediction neural network further comprises an illumination network comprising a bi-linear sampling layer and a third encoder comprised of a third affine convolution layer comprising a third main convolution layer and a third auxiliary convolution layer; the method further comprises the following steps:
predicting a convolution kernel through the third auxiliary convolution layer and the face feature map, and outputting a third affine transformation matrix of each pixel point;
converting the pixel point corresponding to each feature type to the target position corresponding to the feature type based on the third affine transformation matrix, so that the third main convolution layer extracts the illumination feature corresponding to each target position and outputs an illumination map;
the three-dimensional face reconstruction based on the diffuse reflection mapping comprises the following steps:
and performing three-dimensional face reconstruction based on the illumination map, the position map and the diffuse reflection map.
5. The method of claim 4, wherein the face prediction neural network further comprises a renderer, the method further comprising:
inputting the illumination map, the position map and the diffuse reflection map into the renderer;
respectively reading position characteristic information, illumination characteristic information and color characteristic information from the position map, the light map and the diffuse reflection mapping according to the texture coordinates of the top points of the target grid;
and generating a projection rendering image of the three-dimensional face in a two-dimensional space according to the position characteristic information, the illumination characteristic information and the color characteristic information.
6. The method of claim 5, wherein the renderer is a micro-renderer.
7. The method of claim 6, further comprising:
training the face prediction neural network by backpropagating through the micro-renderer until a loss function reaches an expectation, wherein the loss function comprises at least one of: perceptual loss term, reconstruction loss term, symmetry loss term, and skin tone loss term.
8. The method of claim 7, wherein the perceptual loss term comprises a first perceptual loss term and a second perceptual loss term, the method further comprising:
extracting a first characteristic vector of the current face image and a second characteristic vector of the projection rendering image;
determining a first perception loss term of the face prediction neural network based on the first feature vector and the second feature vector;
and determining a second perception loss item according to the characteristic vector comparison condition of the diffuse reflection mapping and the true value of the diffuse reflection mapping.
9. The method of claim 7, wherein the reconstruction loss terms comprise a first reconstruction loss term and a second reconstruction loss term, the method further comprising:
determining a first reconstruction loss term according to the pixel difference value between the current face image and the projection rendering image;
and determining a second reconstruction loss item according to the pixel difference comparison condition of the diffuse reflection mapping and the diffuse reflection mapping truth value.
10. The method of claim 7, further comprising:
and determining a symmetry loss item according to the diffuse reflection map and the horizontally flipped diffuse reflection map.
11. The method of claim 7, further comprising:
and performing Gaussian blur processing on the diffuse reflection map, and determining the skin color loss item based on the standard deviation of the color value of each pixel of the skin area.
12. A three-dimensional face reconstruction apparatus, the apparatus comprising:
the extraction module is used for extracting the features of the current face image to obtain a face feature map, wherein the face feature map comprises a plurality of feature categories;
the input module is used for inputting the face feature map into a face prediction neural network, wherein the face prediction neural network comprises a diffuse reflection network, a first decoder in the diffuse reflection network is in jump connection with a first encoder formed by a first affine convolution layer, and the first affine convolution layer comprises a first main convolution layer and a first auxiliary convolution layer;
the matrix output module is used for outputting a first affine transformation matrix of each pixel point through the first auxiliary convolution layer and the face feature map prediction convolution kernel;
an affine transformation module, configured to convert the convolution kernel to a target position corresponding to a current feature class based on the first affine transformation matrix, so that the first main convolution layer extracts a diffuse reflection feature corresponding to each target position, and outputs a diffuse reflection map;
and the reconstruction module is used for reconstructing a three-dimensional face based on the diffuse reflection map.
13. An electronic device comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor implements the steps of the method of any of claims 1 to 11 when executing the computer program.
14. A computer readable storage medium having stored thereon computer executable instructions which, when invoked and executed by a processor, cause the processor to execute the method of any of claims 1 to 11.
CN202111435940.XA 2021-11-29 2021-11-29 Three-dimensional face reconstruction method and device and electronic equipment Active CN114119923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111435940.XA CN114119923B (en) 2021-11-29 2021-11-29 Three-dimensional face reconstruction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111435940.XA CN114119923B (en) 2021-11-29 2021-11-29 Three-dimensional face reconstruction method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN114119923A CN114119923A (en) 2022-03-01
CN114119923B true CN114119923B (en) 2022-07-19

Family

ID=80367857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111435940.XA Active CN114119923B (en) 2021-11-29 2021-11-29 Three-dimensional face reconstruction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114119923B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993216A (en) * 2017-11-22 2018-05-04 腾讯科技(深圳)有限公司 A kind of image interfusion method and its equipment, storage medium, terminal
CN109741456A (en) * 2018-12-17 2019-05-10 深圳市航盛电子股份有限公司 3D based on GPU concurrent operation looks around vehicle assistant drive method and system
CN111652960A (en) * 2020-05-07 2020-09-11 浙江大学 Method for solving human face reflection material from single image based on micro-renderer
CN112348947A (en) * 2021-01-07 2021-02-09 南京理工大学智能计算成像研究院有限公司 Three-dimensional reconstruction method for deep learning based on reference information assistance
CN112435343A (en) * 2020-11-24 2021-03-02 杭州唯实科技有限公司 Point cloud data processing method and device, electronic equipment and storage medium

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107103285B (en) * 2017-03-24 2020-03-03 深圳市未来媒体技术研究院 Face depth prediction method based on convolutional neural network
CN109147048B (en) * 2018-07-23 2021-02-26 复旦大学 Three-dimensional mesh reconstruction method by utilizing single-sheet colorful image
CN109766895A (en) * 2019-01-03 2019-05-17 京东方科技集团股份有限公司 The training method and image Style Transfer method of convolutional neural networks for image Style Transfer
CN113728335A (en) * 2019-02-08 2021-11-30 新加坡健康服务有限公司 Method and system for classification and visualization of 3D images
CN112132739B (en) * 2019-06-24 2023-07-18 北京眼神智能科技有限公司 3D reconstruction and face pose normalization method, device, storage medium and equipment
GB2585645B (en) * 2019-07-08 2024-04-17 Toshiba Kk Computer vision method and system
CN111241998B (en) * 2020-01-09 2023-04-28 中移(杭州)信息技术有限公司 Face recognition method, device, electronic equipment and storage medium
CN112085836A (en) * 2020-09-03 2020-12-15 华南师范大学 Three-dimensional face reconstruction method based on graph convolution neural network
CN112115860A (en) * 2020-09-18 2020-12-22 深圳市威富视界有限公司 Face key point positioning method and device, computer equipment and storage medium
CN112257645B (en) * 2020-11-02 2023-09-01 浙江大华技术股份有限公司 Method and device for positioning key points of face, storage medium and electronic device
CN112652058A (en) * 2020-12-31 2021-04-13 广州华多网络科技有限公司 Human face image replay method and device, computer equipment and storage medium
CN113240792B (en) * 2021-04-29 2022-08-16 浙江大学 Image fusion generation type face changing method based on face reconstruction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993216A (en) * 2017-11-22 2018-05-04 腾讯科技(深圳)有限公司 A kind of image interfusion method and its equipment, storage medium, terminal
CN109741456A (en) * 2018-12-17 2019-05-10 深圳市航盛电子股份有限公司 3D based on GPU concurrent operation looks around vehicle assistant drive method and system
CN111652960A (en) * 2020-05-07 2020-09-11 浙江大学 Method for solving human face reflection material from single image based on micro-renderer
CN112435343A (en) * 2020-11-24 2021-03-02 杭州唯实科技有限公司 Point cloud data processing method and device, electronic equipment and storage medium
CN112348947A (en) * 2021-01-07 2021-02-09 南京理工大学智能计算成像研究院有限公司 Three-dimensional reconstruction method for deep learning based on reference information assistance

Also Published As

Publication number Publication date
CN114119923A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
US11257279B2 (en) Systems and methods for providing non-parametric texture synthesis of arbitrary shape and/or material data in a unified framework
US11037274B2 (en) Denoising Monte Carlo renderings using progressive neural networks
Georgoulis et al. Reflectance and natural illumination from single-material specular objects using deep learning
CN109859296B (en) Training method of SMPL parameter prediction model, server and storage medium
US20210074052A1 (en) Three-dimensional (3d) rendering method and apparatus
Petersen et al. Pix2vex: Image-to-geometry reconstruction using a smooth differentiable renderer
CN113838176B (en) Model training method, three-dimensional face image generation method and three-dimensional face image generation equipment
Müller et al. Compression and Real-Time Rendering of Measured BTFs Using Local PCA.
CN112215050A (en) Nonlinear 3DMM face reconstruction and posture normalization method, device, medium and equipment
US20100295850A1 (en) Apparatus and method for finding visible points in a cloud point
CN102136156B (en) System and method for mesoscopic geometry modulation
US20090309877A1 (en) Soft shadow rendering
CN114746904A (en) Three-dimensional face reconstruction
CN116109798B (en) Image data processing method, device, equipment and medium
JP7129529B2 (en) UV mapping to 3D objects using artificial intelligence
CN116416376A (en) Three-dimensional hair reconstruction method, system, electronic equipment and storage medium
CN112862807A (en) Data processing method and device based on hair image
Li et al. Detailed 3D human body reconstruction from multi-view images combining voxel super-resolution and learned implicit representation
Marques et al. Deep spherical harmonics light probe estimator for mixed reality games
CN114450719A (en) Human body model reconstruction method, reconstruction system and storage medium
Tiwary et al. Towards learning neural representations from shadows
CN117333637B (en) Modeling and rendering method, device and equipment for three-dimensional scene
Zhang et al. DIMNet: Dense implicit function network for 3D human body reconstruction
CN115713585B (en) Texture image reconstruction method, apparatus, computer device and storage medium
CN114119923B (en) Three-dimensional face reconstruction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant