CN116778061B

CN116778061B - Three-dimensional object generation method based on non-realistic picture

Info

Publication number: CN116778061B
Application number: CN202311070901.3A
Authority: CN
Inventors: 徐浩然; 李泽健; 陈培; 孙凌云; 王小松; 陈晓皎
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-10-27
Anticipated expiration: 2043-08-24
Also published as: CN116778061A

Abstract

The invention discloses a three-dimensional object generation method based on a non-realistic picture, which is characterized in that the probability distribution of a generated image is obtained through a pre-training diffusion model based on text prompt words and a depth map, and the loss function is carried out through the probability distribution of the generated image and the KL divergence of the probability distribution of a target image so as to update the parameters of a nerve radiation field, so that the updated three-dimensional geometric model generated by the nerve radiation field can be more accurate without depending on a depth estimator. The invention suppresses the density of non-subject matter outside the semantic mask through the floating artifact loss function, namely the loss function related to the density map and the subject semantic mask, thereby eliminating the generation of floating artifacts, encouraging the density increase in the semantic mask to form a more accurate three-dimensional geometric model.

Description

Three-dimensional object generation method based on non-realistic picture

Technical Field

The invention belongs to the field of deep learning image processing technology and three-dimensional object generation, and particularly relates to a three-dimensional object generation method based on a non-realistic picture.

Background

In recent years, neural radiation fields (Neural Radiance Fields, neRF) have made tremendous progress in the field of modeling to generate realistic three-dimensional objects. NeRF-based three-dimensional generation methods learn a three-dimensional model from a series of two-dimensional images by training a neural network. On the other hand, diffusion Models (Diffusion Models) have significantly driven the development of text-to-image generation. The prior distribution of the image is generated by using the diffusion model, and the NeRF-based three-dimensional generation method can realize a generation flow without camera parameters and positions.

In the existing diffusion model distillation-guided NeRF generation method, a rendering graph (distribution) obtained by NeRF through a differentiable rendering process is required to be close to a generated target image (distribution) of a diffusion model, and the method comprises two methods of score distillation sampling (Score Distillation Sampling, SDS) and variation score distillation (Variational Score Distillation, VSD). Wherein the former optimizes the mean square error between a single target graph and a single rendered graph and the latter optimizes the KL divergence between the target graph distribution and the rendered graph distribution. In the nerve radiation field method based on diffusion model distillation, the diffusion model can accept various modes (such as characters and line manuscripts) as input conditions, meets the wide production scene, and has good research and application values.

The existing NeRF generation method guided by diffusion model distillation is based on a realistic picture. When non-realistic modeling is performed based on the non-realistic picture (for example, three-dimensional object cartoon style generation modeling is performed based on the Chinese wind-art plane cartoon picture), the modeling quality of the existing method is obviously reduced: on the one hand, the existing NeRF generation method is mostly based on a specific illumination model (such as Lambertian diffuse reflection illumination model), but the non-realistic picture mostly does not follow the specific illumination model, and on the basis, errors (such as a large amount of floating artifacts) are introduced in the optimization of the view-dependent neural radiation field; on the other hand, the existing NeRF generation method based on diffusion model distillation guidance depends on the accuracy of a depth estimator in geometric optimization, and the depth of a non-photorealistic picture is difficult to estimate by using or fine-tuning an existing depth estimator, so that the optimization of the geometric structure of a nerve radiation field is influenced.

Therefore, in the neural radiation field generation method based on diffusion model distillation, for non-realistic picture guidance, floating artifacts caused by non-Lambertian diffuse reflection illumination in the model training process need to be removed so as to form a correct geometric structure, and further downstream production applications such as grid generation and the like are ensured; furthermore, a geometric optimization method independent of depth estimation is needed to meet the requirements of non-realistic modeling.

Disclosure of Invention

The invention provides a three-dimensional object generation method based on a non-realistic picture, which can accurately obtain a three-dimensional geometric model based on the non-realistic picture.

The embodiment of the invention provides a three-dimensional object generation method based on a non-realistic picture, which comprises the following steps:

fine tuning the basic diffusion model based on the non-realistic picture set to obtain a pre-training diffusion model;

the method comprises the steps of constructing a training system comprising a pre-training diffusion model, a nerve radiation field, a control Net network and a semantic segmentation network, wherein text prompt words are input into the pre-training diffusion model to obtain probability distribution of a target image, probability distribution of a rendered image, a depth map and a density map corresponding to the rendered image are obtained based on the nerve radiation field, the pre-training diffusion model is controlled by the control Net network according to the depth map to obtain probability distribution of a generated image based on the text prompt words, and semantic segmentation is carried out on the rendered image by the semantic segmentation network to obtain a main semantic mask;

constructing a total loss function comprising a variational fractional distillation loss function, a geometrically optimized loss function, and a floating artifact loss function, wherein:

distilling the loss function by a variation score such that the expectation of the KL-divergence of the probability distribution of the rendered image and the target image is not higher than a first loss threshold, the expectation of the KL-divergence of the probability distribution of the generated image and the target image is not higher than a second loss threshold, and the expectation of the loss function value with respect to the density map and the subject semantic mask is not higher than a third loss threshold by a floating artifact loss function;

training a group of nerve radiation fields by a training system based on the text prompt words and the camera poses by utilizing a total loss function to obtain a plurality of final nerve radiation fields;

inputting the camera pose into a randomly selected one of a plurality of final nerve radiation fields to obtain a rendering diagram of the three-dimensional object.

Further, the geometric optimization loss function is obtained based on expected construction of KL divergence of probability distribution of the target image and the generated image in different diffusion steps, the generated image is close to the target image through the geometric optimization loss function to update the depth map, and parameters of the nerve radiation field are updated through updating the depth map.

Further, the floating artifact loss function is constructed based on density maps and a main semantic mask under different camera poses, the density maps are updated through the floating artifact loss function, and parameters of the nerve radiation field are updated through the updated density maps, so that floating artifacts of the rendered image are removed.

Further, the variation fraction distillation loss function is constructed based on the expectation of KL divergence between the probability distribution of the target image and the rendered image under different camera poses and different diffusion steps, and the rendered image is made to approach the target image through the variation fraction distillation loss function so as to update the parameters of the nerve radiation field.

Further, when the training iteration turns reach the set turn superparameter, the density map is updated through the floating artifact loss function, and the parameters of the nerve radiation field are updated through the updated density map, so that the floating artifact of the three-dimensional object is eliminated.

Further, fine tuning the diffusion model based on the non-realistic picture set to obtain a pre-trained diffusion model, including:

and taking the non-realistic picture as a pre-training data set, bypassing the up-and-down sampling multi-layer perceptron which is additionally arranged on the basic diffusion model, and training the multi-layer perceptron through an image synthesis technology based on the pre-training data set to obtain a pre-training diffusion model.

Further, the multi-layer perceptron of increased up-down sampling is bypassed at the base diffusion model, comprising:

and adding up-down sampling multi-layer perceptrons on the bypass of the basic diffusion model by adopting a low-rank matrix fine tuning method, and respectively superposing the output results of the basic diffusion model after training and the up-down sampling multi-layer perceptrons on the basis of text prompt words to obtain a target image.

Further, based on one final neural radiation field obtained by randomly sampling a plurality of final neural radiation fields, a signed distance function value corresponding to the three-dimensional object is obtained based on one final neural radiation field obtained by randomly sampling, and the generation of the geometric grid of the three-dimensional object is performed by adopting a depth travelling tetrahedron technology based on the signed distance function value.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the probability distribution of the generated image is obtained through the pre-training diffusion model based on the text prompt words and the depth map, and the loss function is carried out on the probability distribution of the generated image and the KL divergence of the probability distribution of the target image so as to update the parameters of the nerve radiation field, so that the three-dimensional geometric model generated by the updated nerve radiation field can be more accurate without depending on the depth estimator.

The floating artifact suppression method provided by the invention suppresses the density of non-subject matters outside the semantic mask through the loss function of the density map and the subject semantic mask, so as to eliminate the generation of floating artifacts, and encourages the density increase in the semantic mask to accelerate the convergence of a training system and form an accurate three-dimensional geometric model.

Drawings

FIG. 1 is a flow chart of a method for generating a three-dimensional object based on a non-realistic picture according to an embodiment of the present invention;

fig. 2 is a data path diagram of a three-dimensional object generating method based on a non-realistic picture according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

The embodiment of the invention provides a variational fractional distillation loss function, a geometric optimization loss function and a floating artifact loss function to train a nerve radiation field, so that the nerve radiation field can obtain a more accurate three-dimensional geometric model aiming at a non-realistic image, and the method is specifically described as follows:

the embodiment of the invention provides a three-dimensional object generation method based on a non-realistic picture, which comprises the following steps as shown in fig. 1 and 2:

s1, fine tuning is carried out on a basic diffusion model based on a non-realistic picture set to obtain a pre-training diffusion model, and the specific steps are as follows:

the embodiment of the invention obtains a non-realistic picture set, wherein the non-realistic picture set comprises a plurality of non-realistic pictures, specifically, as shown in fig. 2, generally 10-50 non-realistic pictures, the plurality of non-realistic pictures are new country lotus pictures and white backgrounds, and the non-realistic picture set is used as a pre-training data set.

The specific embodiment of the invention adopts a low-rank matrix fine tuning method (Low Rank Adaptation, loRA) to add up-and-down sampled multi-layer perceptron on a bypass of a basic diffusion model, and based on text prompt words, output results of the basic diffusion model after training and the up-and-down sampled multi-layer perceptron are respectively overlapped to obtain a target image, wherein in one embodiment, the basic diffusion model is a potential space diffusion model (Latent Diffusion Model, LDM).

The object image provided by the embodiment of the inventionThe method comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,based on expansionLoose model parameters (Lei's)>For multi-layer perceptron parameters->Is gaussian noise.

The embodiment of the invention discloses a multi-layer perceptron which is sampled up and down and is additionally arranged on a bypass of a basic diffusion model, and the multi-layer perceptron is trained by an image synthesis technology (streamBooth) based on a pre-training data set to obtain a pre-training diffusion model capable of generating target style contents based on given characters.

S2, constructing a training system comprising a pre-training diffusion model, a nerve radiation field, a control network (ControlNet) and a semantic segmentation network (SAM, segment Anything Model). Inputting text prompt words into a pre-training diffusion model to obtain probability distribution of a target image, obtaining probability distribution of a rendered image and a depth map and a density map corresponding to the rendered image based on a nerve radiation field, controlling the pre-training diffusion model to obtain probability distribution of a generated image according to the depth map by adopting a control Net network based on the text prompt words, and performing semantic segmentation on the rendered image by adopting a semantic segmentation network to obtain a main semantic mask.

In one embodiment, text prompt words are obtainedText prompt->As input to the training system. Text prompt word ++>And inputting the pre-training diffusion model to obtain probability distribution of the target image. Parametrization of->Random initialization of neural radiation fields using a differentiable renderer +.>Rendering operation is carried out on the nerve radiation field, and the pose of the camera is given>Renderer->Emitting rays pixel by pixel and calculating each ray +.>Color weighting along the line sample points to get a rendered image +.>Thereby obtaining a probability distribution of the rendered image, wherein +.>Is the origin of rays, < >>Is a time parameter, +.>Is the direction of the ray. Computing depth maps from nerve radiation fieldsAnd Density map->. Text prompt word ++>And depth map->And inputting a pre-training diffusion model and a control Net network to obtain probability distribution of the generated image. Inputting the rendering graph into a semantic segmentation network SAM to obtain a main semantic mask +.> 。

S3, constructing a total loss function comprising a variation fractional distillation loss function, a geometric optimization loss function and a floating artifact loss function, wherein:

the specific embodiment of the invention enables the expectation of KL divergence of probability distribution of the rendered image and the target image to be not higher than a set first loss threshold value through changing the fractional distillation loss function.

In a specific embodiment, the variation fraction distillation loss function provided in this embodiment is constructed based on the expectation of KL divergence between the probability distributions of the target image and the rendered image under different camera poses and different diffusion steps. In view of optimizing complexity, the present embodiment updates parameters of the neural radiation field by varying the fractional distillation loss function such that the noisy rendered image distribution approximates the noisy target image distribution.

The embodiment of the invention provides a variation fractional distillation loss functionThe method comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,for diffusion step->And camera pose->Is>And->Super-parameters for fractional distillation loss function of variation, +.>As the first in the fractional distillation loss functionWeight under s diffusion step, +.>For KL divergence function, +.>Is the firstsNoisy rendered image under diffusion step +.>Probability distribution of->Is->Noisy target image under diffusion step +.>Probability distribution of->For the pose of the camera, the user is added with->Is text prompt word, is->For giving +.>All possible routes->Probability distribution of parameterized three-dimensional representation, +.>Noisy rendered image of the individual diffusion step +.>Wherein->Is a unit sheetQuantity (S)>Representing a gaussian distribution. Updating the three-dimensional representation using a variation fractional distillation (VSD, variational Score Distillation) method with the expectation that the probability distribution KL divergence (Kullback-Leibler Divergence) of the two noisy rendered images and the target image is not higher than a set first loss threshold>Further update the nerve radiation field parameter +.>。

Particular embodiments of the present invention geometrically optimize the loss function such that the expectation of KL divergence of the probability distribution of the generated image and the target image is not higher than the second loss threshold.

In a specific embodiment, the geometric optimization loss function provided in this embodiment is based on the expected construction of KL divergences of probability distributions of the target image and the generated image in different diffusion steps, the generated image is made to approach the target image by the geometric optimization loss function to update the depth map, and parameters of the neural radiation field are updated by updating the depth map.

The geometric optimization loss function provided by the embodimentThe method comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,for diffusion step->And camera pose->The period of (2)Function of looking at->And->Super-parameters for geometrically optimized loss function, +.>For the first +.>The weight of the diffusion step is calculated,is->Depth map based diffusion step->And text prompt +.>Pre-training diffusion model noisy generated graph +.>Probability distribution of->Is->Noisy target image of the individual diffusion step +.>Probability distribution of (2); depth map based->And text prompt +.>Is generated by a pre-trained diffusion model of (2)The probability distribution of the generated map characterizes the pictures consistent with the current geometry of the neuro-radiation field, while the probability distribution of the target image generated by the pre-trained diffusion model without depth conditions characterizes all pictures, when the neuro-radiation field forms the correct geometry,/if>And->With small differences, on the basis of this, a geometrical optimization of the nerve radiation field without depth estimation is performed, i.e. the optimization +.>To minimize depth map basedDiffusion model generation map (with noise) distribution and depth-free map->The expectation of KL divergence between the generated map (noisy) distributions of the conditional predictive diffusion model is less than the set second loss threshold.

The depth map of the obtained rendering image visual angle provided by the embodiment of the inventionThe method comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,is the pose of the camera->Down->Density value at>And->The start and end points, respectively, of the penetration of the radiation through the object.

The specific embodiment of the invention generates the three-dimensional object based on the non-realistic picture, has floating artifacts and needs to be restrained. Depth estimation of non-realistic pictures using conventional depth estimators (e.g., dense Prediction Transformer, DPT) will suffer from a greater error rate, resulting in incorrect gradient return of existing neural radiation field-related methods on the loss function of the relevant depth information. When training on non-realistic pictures, the neural radiation field will generate a significant amount of floating artifacts. The specific embodiment of the invention is in the initial training stage of the nerve radiation field， />For round superparameter,/->For iterative rounds) training with the above-described distillation loss function and geometrically optimized loss function, which training phase neural radiation field mainly generates a three-dimensional object prototype, after which, additionally introducing a floating artifact loss function based on subject identification, eliminating the generated floating artifact, and guiding the subject density to increase, in particular, when->Calculating corresponding density map of neural radiation field rendering image +.>The method comprises the following steps:

pre-training diffusion using semantic segmentation network SAMThe model generation diagram is subjected to main body semantic segmentation to form a main body semantic maskDefine floating artifact loss function +.>Suppressing the density of non-subjects outside the semantic mask thereby eliminating the creation of floating artifacts and encouraging density growth within the semantic mask to increase the model convergence rate.

The floating artifact loss function provided by the embodiment of the inventionThe method comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,for the pose of the camera>Is>For the subject semantic mask, ++>For element-by-element product>Is [0,1]An increasing function (e.g，）， />Is [0,1]A decreasing function (e.g) Wherein->Is a super parameter). The expectation regarding the loss function value of the density map and the subject semantic mask is made not higher than the third loss threshold by floating the artifact loss function.

The embodiment of the invention provides the total loss functionThe method comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,、/>and->The weight values of the variational fractional distillation loss function, the floating artifact loss function and the geometric optimization loss function are respectively.

S4, training a group of nerve radiation fields through a training system to obtain a plurality of final nerve radiation fields by utilizing a total loss function based on the text prompt words.

S5, randomly sampling a final nerve radiation field from the multiple final nerve radiation fields, and inputting any camera pose into the final nerve radiation field to obtain a rendering diagram of the three-dimensional object under any view angle. Based on the final neural radiation field, a signed distance function value (Signed Distance Function, SDF) corresponding to the three-dimensional representation may be calculated, and a depth travelling tetrahedron technique (Deep Marching Tetrahedra, dmet) may be used to generate a geometric mesh of the three-dimensional object based on the signed distance function value.

The present embodiment is only for explanation of the present invention and is not to be construed as limiting the present invention, and modifications to the present embodiment, which may not creatively contribute to the present invention as required by those skilled in the art after reading the present specification, are all protected by patent laws within the scope of claims of the present invention.

Claims

1. The three-dimensional object generation method based on the non-realistic picture is characterized by comprising the following steps of:

2. The method for generating the three-dimensional object based on the non-realistic picture according to claim 1, wherein the geometric optimization loss function is obtained based on expected construction of KL divergences of probability distributions of the target image and the generated image in different diffusion steps, the generated image is made to approach the target image by the geometric optimization loss function to update a depth map, and parameters of the neural radiation field are updated by updating the depth map.

3. The method for generating the three-dimensional object based on the non-realistic picture according to claim 1, wherein the floating artifact loss function is constructed based on a density map and a main semantic mask under different camera poses, the density map is updated through the floating artifact loss function, and parameters of the nerve radiation field are updated through updating the density map, so that floating artifacts of the rendered image are removed.

4. The method for generating a three-dimensional object based on a non-realistic picture according to claim 1, wherein the variational fractional distillation loss function is constructed based on the expectation of KL divergence between the probability distribution of the target image and the rendered image under different camera pose and different diffusion steps, and the parameters of the neural radiation field are updated by bringing the rendered image close to the target image by the variational fractional distillation loss function.

5. The method for generating a three-dimensional object based on a non-realistic picture according to claim 1, wherein when the training iteration round reaches the set round super-parameters, the density map is updated by the floating artifact loss function, and the parameters of the neural radiation field are updated by updating the density map, so that the floating artifact of the three-dimensional object is removed.

6. The method for generating a three-dimensional object based on a non-realistic picture according to claim 1, wherein fine tuning the diffusion model based on the non-realistic picture set results in a pre-trained diffusion model, comprising:

7. The method of generating a three-dimensional object based on non-realistic pictures of claim 6, wherein the basic diffusion model bypasses the added up-down sampled multi-layer perceptron, comprising:

8. The method for generating a three-dimensional object based on a non-realistic picture according to claim 1, wherein the generating of the geometric grid of the three-dimensional object is performed by using a depth travelling tetrahedron technique based on the signed distance function value, based on one final neural radiation field obtained by randomly sampling a plurality of final neural radiation fields, based on one final neural radiation field obtained by randomly sampling to obtain the signed distance function value corresponding to the three-dimensional object.