CN117422829A

CN117422829A - Face image synthesis optimization method based on nerve radiation field

Info

Publication number: CN117422829A
Application number: CN202311379379.7A
Authority: CN
Inventors: 魏明强; 赵安; 朱哲; 郭延文; 王伟明; 谢浩然; 王富利
Original assignee: Shenzhen Research Institute Of Nanjing University Of Aeronautics And Astronautics; Nanjing University of Aeronautics and Astronautics
Current assignee: Shenzhen Research Institute Of Nanjing University Of Aeronautics And Astronautics; Nanjing University of Aeronautics and Astronautics
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-19

Abstract

The invention discloses a face image synthesis optimization method based on a nerve radiation field, which comprises the steps of obtaining monocular RGB face video, carrying out frame extraction and two-dimensional face image fitting, carrying out real-time face tracking and analysis on a two-dimensional face image, and obtaining camera internal parameters, pose parameters and expression parameters; adopting camera internal parameters and pose parameters to sample light of a two-dimensional face image, obtaining corresponding three-dimensional coordinates, and carrying out position coding on the obtained three-dimensional coordinates; inputting the coded position and expression parameters and latent codes into a neural network formed by MLP to train a nerve radiation field, and obtaining RGB color values and densities; synthesizing RGB color values and density into a two-dimensional face image of a two-dimensional new view angle by utilizing volume rendering; and (3) optimizing the two-dimensional face image synthesized in the step (S4) by adopting a generating countermeasure network. The neural network training time is short, the generated face image has less artifacts, the details are more abundant, and the neural network training method has wider prospects.

Description

Face image synthesis optimization method based on nerve radiation field

Technical Field

The invention belongs to the technical field of three-dimensional reconstruction and image new view angle synthesis, and particularly relates to a face image synthesis optimization method based on a nerve radiation field.

Background

Virtual digital man is a human body simulation technology based on computer technology, and can be used for simulating and reproducing the appearance, actions, behaviors and the like of human beings. Technically, virtual digital people have strong comprehensiveness and crossover, and relate to multiple fields of three-dimensional vision, computer graphics, natural language processing, including bionics, behavioural psychology, behavioural logic, and the like. Industrial applications around virtual digital people have also begun to land successively thanks to the continued development of digital man technology, for example in games and virtual reality, where virtual digital people can achieve highly realistic character simulation and motion capture, improving the immersion and realism of games and virtual reality.

As an important component of virtual digital man-made technology, three-dimensional face reconstruction has been a hot research direction in the fields of computer vision, computer graphics, and three-dimensional reconstruction. Because the facial features are basically similar in distribution position, the facial features can be mapped to the constructed low-dimensional parameterized face model, so that the digital high-efficiency expression of the face is realized.

Most of the existing researches are based on face reconstruction technology of a generated type countermeasure network, and by constructing an effective network structure and restricting the generated result to be consistent with the data distribution of a pre-collected data set, the related method can directly bypass the traditional three-dimensional explicit modeling and directly render high-quality face images with high resolution at photo level. However, when the pose parameters of the camera are changed, the reconstructed face image is difficult to maintain the consistency of the visual angles. In recent years, face reconstruction based on an implicit nerve function starts to be raised, and by taking HeadNeRF as an example, the method implicitly expresses information such as facial expression, identity and the like by means of a 3DMM low-dimensional face model. However, for images deviating from the training data, only results close to the training data can be returned in the relevant fitting task, so that the fitting cannot be performed accurately, and since the training data rarely relate to images with headwear, it is difficult to render headwear relevant content in the fitting results.

Disclosure of Invention

The invention aims to solve the technical problem of providing a face image synthesis optimization method based on a nerve radiation field aiming at the defects of the prior art.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

a face image synthesis optimization method based on a nerve radiation field comprises the following steps:

s1, acquiring monocular RGB face video, carrying out frame extraction and two-dimensional face image fitting, and carrying out real-time face tracking and analysis on the two-dimensional face image to obtain camera internal parameters, pose parameters and expression parameters;

s2, carrying out layered light sampling on the two-dimensional face image by adopting camera internal parameters and pose parameters to obtain corresponding three-dimensional coordinates, and carrying out position coding on the three-dimensional coordinates;

s3, inputting the coded position, expression parameters and latent codes into a neural network formed by the MLP together for training a nerve radiation field to obtain RGB color values and densities;

s4, synthesizing RGB color values and density into a two-dimensional face image of a two-dimensional new view angle by utilizing volume rendering;

and S5, optimizing the two-dimensional face image synthesized in the step S4 by adopting a generation countermeasure network.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the above S1 uses a flap low dimensional face three dimensional model to perform two dimensional face image fitting to estimate head geometry, appearance and facial expression from a single Zhang Ren face image.

The S2 performs ray sampling on the real 3D scene by using a hierarchical sampling manner, performs dense sampling near points with large color contribution, performs sparse sampling near points with small contribution, and performs proportional distribution on samples to optimize two subsequent MLP networks before inputting the samples into the MLP networks: coarse and fine networks, thereby improving rendering efficiency.

The coding equation adopted in the above step S2 is as follows:

γ(p)＝(sin(2 ⁰ πp),cos(2 ⁰ πp),…,sin(2 ^L-1 πp),cos(2 ^L-1 πp))

gamma (-) acts on each component (X, y, z) in the three-dimensional space coordinate X and the unit vector of the viewing angle direction respectivelyIs a dimension, p is a parameter of a gamma (·) function, for representing a three-dimensional coordinate X and a viewing angle vector +.>

The neural network formed by the MLP described in the above S3 is the main body of the neural radiation fieldThe backbone part consists of 8 fully-connected layers, each layer has 256 neurons, the layers are connected through a ReLu activation function, the backbone part is used for transmitting calculation position codes, the final output density sigma, and then 24-dimensional visual angle direction codes v are added ^→ The 3-dimensional RGB color values are output after passing through the 4-layer full connection layer.

The loss function of the neural network in S3 is as follows:

wherein R is the collection of rays in each batch in the S2 hierarchical sampling, R is each ray in the collection, C (R),The colors of the light rays RGB output in the real scene, the coarse network and the fine network in the step 3 are respectively;

and S3, accelerating the training process by utilizing a NerfAcc technology in a mode of skipping the empty area and the shielding area.

The volume rendering formula of S4 is as follows:

wherein rt (t) =o+td represents light, σ _θ Representing density parameters, RGB _θ The color parameter is represented by a color parameter,representing the viewing angle direction vector, z _far And z _near Respectively representing a far plane and a near plane, T (T) represents the light ray at T _n The transmittance within the distance t, i.e. the probability that a ray can propagate without hitting any other particle, is defined as follows:

and S5, optimizing the two-dimensional face image synthesized in the S4 by using a HiFaceGAN network, wherein the network adopts a front-end suppression module to suppress heterogeneous degradation, and encodes robust hierarchical semantic information so as to guide a subsequent replenishment module to reconstruct a correspondingly lifelike-detailed renovated face, and after semantic features are acquired from the front-end suppression module, guiding detail replenishment is performed by using encoding features.

The invention has the following beneficial effects:

the invention relates to a combined face tracking, nerve radiation field (NeRF), generation type countermeasure network (GAN) and the like, and only a group of monocular RGB face video sequences are required to be input, so that a new view angle face image which can be edited and has rich details can be reconstructed; the training of the nerve radiation field is accelerated by using a NerfAcc acceleration technology, so that the time required by training is greatly shortened, and the time required by training can be compressed on the premise of ensuring the accuracy of the result by skipping an empty region and stopping the ray in a shielding region in advance; generating a new view angle image of the head of the human face in a volume rendering mode; a user-friendly visual interface can be developed, and the expression and the pose of the generated face image can be explicitly edited by modifying corresponding parameters in the interface, so that the model has wider prospect in the face reconstruction; the generation countermeasure network module HiFaceGAN is used for optimizing the picture generated by volume rendering, so that rendering result optimization is realized, noise reduction, artifact removal and detail enhancement are performed on the reconstructed face picture, and the generated face image artifact is reduced and the detail is richer.

Drawings

Fig. 1 is a schematic diagram of the overall design of the present invention.

FIG. 2 is a schematic diagram of a portion of a neural network according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Although the steps of the present invention are arranged by reference numerals, the order of the steps is not limited, and the relative order of the steps may be adjusted unless the order of the steps is explicitly stated or the execution of a step requires other steps as a basis. It is to be understood that the term "and/or" as used herein relates to and encompasses any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, the invention refers to a face image synthesis optimization method based on a nerve radiation field, which comprises the following steps:

s1, acquiring monocular RGB face video, carrying out frame extraction and two-dimensional face image fitting, and carrying out real-time face tracking and analysis on the two-dimensional face image to obtain camera internal parameters, pose parameters and expression parameters; specifically, a monocular RGB face video sequence is obtained, frame extraction is carried out, an available frame is obtained, a FLAME low-dimensional face three-dimensional model is used for fitting, and face tracking is used for analyzing the available frame to obtain relevant parameters.

S2, carrying out layered light sampling on the two-dimensional face image by adopting camera internal parameters and pose parameters to obtain corresponding three-dimensional coordinates, and carrying out position coding on the three-dimensional coordinates; specifically, object position parameters obtained in light sampling are encoded, and coordinates are mapped into a space with higher dimension, so that MLP can represent a function with higher frequency, and the geometry and texture of the object surface are more vivid;

S2-S3, sampling each extracted human face available frame image by utilizing the pose matrix and the camera internal and external parameters obtained in the S1, and simultaneously adding a latent code for reducing deviation in a preliminary model fitting process as much as possible, wherein the output trained by an MLP network is density and RGB color values;

s4, synthesizing RGB color values and density into a two-dimensional face image of a two-dimensional new view angle by utilizing volume rendering; accelerating training by using a NerfAcc technology, and synthesizing a new view by using volume rendering;

and S5, optimizing the two-dimensional face image synthesized in the step S4 by adopting a generation countermeasure network. And inputting the preliminarily rendered face image into a HiFaceGAN network, and carrying out noise reduction, artifact removal and detail enhancement on the reconstructed face image.

In an embodiment, the S1 input is a set of monocular RGB face video sequences that require the camera position to be kept fixed. After preprocessing the original data, the face of the person is tracked and analyzed in real time by a face tracking technology, and the required information such as camera internal and external parameters, displacement matrixes, expression parameters and the like is obtained. Comprising the following steps:

step S11, for a group of input monocular RGB face videos, firstly extracting frames to obtain available frames, then fitting two-dimensional face images by combining a face low-dimensional model FLAME by using a face tracking technology, and estimating head geometry and appearance from the single Zhang Ren face images. The method is characterized in that three-dimensional faces are regarded as linear combination superposition of base vectors such as shapes, textures, expressions and the like, and each group of three-dimensional face data can be represented by combination among base vector spaces in a database, so that a model for solving any three-dimensional face is equivalent to coefficients for solving each base vector. Wherein, the related parameters of the FLAME model comprise spherical harmonic illumination parameters, texture geometric parameters, facial expression parameters, camera internal and external parameters and the like; then, using the parameters of the participation pose in the camera to sample light;

step S12, for the first frame, the iteration times are increased to obtain a more accurate initialization model, and for the rest frames, the initialization can be performed according to the previous estimation;

and S13, after finishing data preprocessing, generating a corresponding folder for each face two-dimensional picture, wherein the folder comprises the picture and 2D position coordinates of 68 face key points and is used for fitting a face low-dimensional deformation model in the next stage.

By editing the configuration file, 5 pictures with significant differences in expression and pose are manually selected as key frames for model fitting (including texture and shape) in the face tracking process.

In the embodiment, S2 performs light sampling on the two-dimensional face image by using the internal and external parameters of the camera and the pose matrix in the face tracking result, wherein the expression parameters and the pose parameters are editable parts, and the facial expression and the pose can be explicitly edited by modifying the corresponding parameters.

(1) A coarse-to-fine hierarchical sampling (Hierarchical volume sampling) technique is adopted to sample densely near points with large color contributions and sparsely near points with small contributions so as to reduce the necessary sampling times and fully sample the whole high-frequency scene representation. The samples are scaled according to their expected impact on final rendering for subsequent optimization of the two MLP networks: coarse and fine networks to improve rendering efficiency.

The layered sampling firstly divides the interval range formed by near and far on the camera ray equally, then uniformly samples in each cell to obtain a sampling point, N is the total _c Sampling points:

wherein far=1, near=0;

for the light sampling part, the number of randomly sampled light rays is 2048, the chunksize value in the training process is 2048, the chunksize value in the verification process is 65536, and the number of sampling points of each light ray on the coarse network and the fine network is 64.

(2) The three-dimensional coordinates obtained by light sampling are subjected to position coding, mapped to a higher-dimensional space through a high-frequency function and then transmitted to a network, so that data containing high-frequency change can be better fitted, and a picture generated by learning is not excessively blurred;

the position coding is to change the original representation function into a combination of two functions:

wherein F' _Θ The function is an MLP requiring learning over the network, while the gamma function does not require learning, and is here only a mapping function, mapping from space R to high-dimensional spaceThe coding equation used is:

γ(p)＝(sin(2 ⁰ πp),cos(2 ⁰ πp),…,sin(2 ^L-1 πp),cos(2 ^L-1 πp))

gamma (-) acts on each component (X, y, z) in the three-dimensional space coordinate X and the unit vector of the viewing angle direction respectivelyIs included in the three components of (a). Finally, the encoded result is normalized to the interval [ -1,1 by using the sinh function]Between them. p is a parameter of a gamma (·) function for representing the three-dimensional coordinate X and the view vector.

The choice of dimension L, which is related to the complexity of the scene and the hardware computational effort used, also determines the magnitude of the highest frequency that the neural network can learn. For the spatial coordinate X, l=10, for the viewing angle directionFor l=4;

in the embodiment, the S3 MLP network is used as a main part of the nerve radiation field, the backbone part of the S3 MLP network is composed of 8 fully-connected layers, each layer has 256 neurons, the layers are connected through a ReLu activation function, and the S3 MLP network is used for transmitting calculation position codes and finally outputting density sigma. Then 24-dimensional view direction code v is added ^→ Outputting 3-dimensional RGB color values after 4-layer full connection layers (each layer contains 128 neurons);

since the coordinate system is converted into the head specification space, the near plane is set to 0.2 and the far plane is set to 0.8. The number of position coding layers of the two networks is 10, and the number of coding layers of the view direction is 4.

Inputting the encoded position parameters and visual angle parameters into a nerve radiation field, and adding 32-dimensional latent encoding parameters into the input of the MLP in order to reduce possible deviation in the face tracking process as much as possible;

the optimizer adopts Adam and the initial learning rate lr _origin For 0.0005, the learning rate update formula is as follows:

wherein lr is _decay Set to 250 lr _{decay_factor} Set to 0.1.

The coded position is input into a neural network (shown in figure 2) formed by MLP together with the expression parameters of 76 dimensions and the latent codes of 32 dimensions to train the nerve radiation field, and the loss function of the MLP in the nerve radiation field is as follows:

wherein R is the collection of rays in each batch in the S2 hierarchical sampling, R is each ray in the collection, C (R),The colors of the light rays RGB output in the real scene (group Truth), the coarse network and the fine network in the step 3 are respectively;

the neural network is a neural radiation field backbone network, and outputs RGB color values and densities for subsequent image synthesis.

Using a NerfAcc technique, accelerating training of the neural radiation field by skipping empty regions and occlusion regions, comprising:

the method has a faster reading speed by using an occupied grid estimator (Occupancy Grid Estimator) to buffer the density in the scene using a binarized voxel grid, passing the light through the grid in a preset step size when sampling, and skipping the blank area by querying the voxel grid.

For a ray, if it strikes an occluding object, the point occluded by the object may be ignored, i.e., the color expected by the sampled ray is the color of the occluding object.

In NerfAcc, a threshold T is set, the density sigma of each point is calculated in the light projection process, so that the corresponding T value is calculated, if the T value of the point is smaller than the set threshold, the ray is hit on an shielding object at the moment, and the light projection process can be terminated.

In the embodiment, S4 synthesizes the density and RGB color output by the network into a two-dimensional face image with two-dimensional new viewing angles by using volume rendering. The principle of the volume rendering technique is to sample in a three-dimensional dataset, then perform ray tracing on each sampling point, and finally generate a two-dimensional image. In this process, the value of each sampling point represents the color and transparency, and since each pixel value of the two-dimensional image can be regarded as a cumulative superposition of all points on one ray emitted by the camera, the color of each pixel can be obtained by integrating the colors based on the density, thereby rendering the two-dimensional image under the view angle.

The sampling results are accumulated, and the distance from the camera to the near plane is assumed to be t _n Distance to far plane t _f For the sampling ray r (t) =o+td, the final desired color calculation formula is as follows:

wherein r (t) =o+td represents light, σ _θ Representing density parameters, RGB _θ The color parameter is represented by a color parameter,representing the viewing angle direction vector, z _far And z _near Respectively representing a far plane and a near plane, T (T) represents the light ray at T _n The transmittance within the distance t, i.e. the probability that a ray can propagate without hitting any other particle, is defined as follows:

in an embodiment, S5 optimizes the face image rendered in step S4 using a HiFaceGAN network, which is collectively referred to as facial repair through collaborative suppression and replenishment. And (3) denoising and removing artifacts and improving detail precision for the image synthesized by the volume rendering by using the network.

(1) The suppression module aims at suppressing heterogeneous degradation and encoding robust hierarchical semantic information to guide a subsequent replenishment module to reconstruct a correspondingly lifelike-detailed renovated face.

(2) After semantic features are acquired from the front-end suppression module, guide detail supplementation is performed by using coding features.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. The face image synthesis optimization method based on the nerve radiation field is characterized by comprising the following steps of:

2. The method of claim 1, wherein S1 uses a flag low-dimensional face model to fit two-dimensional face images, and estimates head geometry, appearance and facial expression from a single Zhang Ren face image.

3. The face image synthesis optimization method based on the neural radiation field according to claim 1, wherein S2 performs ray sampling on a real 3D scene by using a hierarchical sampling technique from thick to thin, performs dense sampling near points with large color contribution, performs sparse sampling near points with small color contribution, and performs proportional distribution on samples according to the expected effect of the samples on final rendering before inputting to the MLP network, for subsequent optimization of two MLP networks: coarse network and fine network, raise and render the efficiency.

4. The face image synthesis optimization method based on the nerve radiation field according to claim 1, wherein the coding equation adopted by S2 is:

γ(p)＝(sin(2 ⁰ πp)，cos(2 ⁰ πp)，…，sin(2 ^L-1 πp)，cos(2 ^L-1 πp))

wherein gamma (-) acts on each component (X, y, z) in the three-dimensional space coordinate X and the unit vector of the viewing angle direction respectivelyIs included in the three components of (a); l is dimension, p is gamma (& gt) parameter, which is used for representing three-dimensional coordinate X and view angle vector +.>

5. The face image synthesis optimization method based on nerve radiation field according to claim 1, wherein the neural network formed by the MLP in S3 is a main body part of the nerve radiation field, a backbone part of the neural network is composed of 8 fully connected layers, each layer has 256 neurons, the layers are connected through a ReLu activation function, the backbone part is used for transmitting calculated position codes, the final output density sigma, and then 24-dimensional visual angle direction codes v are added ^→ The 3-dimensional RGB color values are output after passing through the 4-layer full connection layer.

6. The face image synthesis optimization method based on the neural radiation field according to claim 1, wherein the loss function of the neural network in S3 is as follows:

wherein R is the collection of rays in each batch in the S2 hierarchical sampling, R is each ray in the collection, C (R),The colors of the light rays RGB output in the real 3D scene, the coarse network and the fine network in the step 3 are respectively.

7. The face image synthesis optimization method based on the nerve radiation field according to claim 1, wherein the step S3 is to accelerate the training process by skipping the empty region and the shielding region by using the nerface technology.

8. The face image synthesis optimization method based on the nerve radiation field according to claim 1, wherein the volume rendering formula of S4 is as follows:

9. the face image synthesis optimization method based on the nerve radiation field according to claim 1, wherein the S5 optimizes the two-dimensional face image synthesized by the S4 by using a HiFaceGAN network, the network adopts a front-end suppression module to suppress heterogeneous degradation and encodes robust hierarchical semantic information to guide a subsequent replenishment module to reconstruct a correspondingly lifelike face, and the encoding feature is utilized to guide detail replenishment after semantic features are acquired from the front-end suppression module.