WO2023225891A1

WO2023225891A1 - Neural rendering method based on multi-resolution network structure

Info

Publication number: WO2023225891A1
Application number: PCT/CN2022/094877
Authority: WO
Inventors: 周昆; 吴鸿智; 任重; 马晟杰
Original assignee: 浙江大学; 杭州相芯科技有限公司
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2023-11-30

Abstract

Disclosed in the present application is a neural rendering method based on a multi-resolution network structure. The method comprises: first photographing images of an object to be modeled under different viewing angles and illumination, and obtaining camera parameters, light source positions, proxy geometry, neural textures, foreground matting, radiance cues and UV image data, and building and training a neural rendering pipeline model; and finally photographing images under specified camera parameters and illumination conditions, processing same and obtaining the radiance clues and a UV image, and synthesizing a new image or animation by using the neural rendering pipeline model. Compared with a traditional neural network, different spatial frequency components are explicitly split in the present application, so that a sequence of a synthesized image has better stability in a time domain. According to the present application, the problem of high-frequency loss caused by hybrid coding of different frequency components in a traditional method is solved, so that more detailed textures are reserved for the synthesized image, and higher fidelity is achieved.

Description

A neural rendering method based on multi-resolution network structure

Technical field

The present invention relates to the field of image-based rendering and material capture and modeling, and in particular to a method of synthesizing images of modeling objects under new viewing angles and new lighting conditions.

Background technique

Relighting technology supports the digitization of real-world scenes, allowing creators to arbitrarily modify the subject's viewing angle and lighting, and synthesize new images that comply with physical laws, which has broad application prospects. Existing work can be mainly divided into two categories: model-based and image-based methods.

Model-based methods fit a priori models to measured data and rely on the prior models to perform interpolation and extrapolation to new viewing angles and lighting conditions. However, the reconstruction quality of this type of method has major limitations, because the a priori model is usually designed manually and cannot perfectly explain all measurement data, and the reconstruction quality is also largely affected by multiple factors, such as geometry. The accuracy of model and camera calibration will affect the reliability of fitting parameters.

Image-based methods do not rely on a priori models, but adopt more direct and data-driven ideas. In contrast to model-based methods, such methods are not affected by many factors in their estimation accuracy. In recent years, with the development of deep learning, the quality of images synthesized by this data-driven method has been greatly improved. However, despite being able to synthesize realistic-looking images, existing methods suffer from blurring of high-frequency details and temporal stability issues.

Contents of the invention

The purpose of the present invention is to provide a neural rendering method based on a multi-resolution network structure to address the shortcomings of existing re-illumination technology. It solves the problems of blurred high-frequency details and poor temporal stability, reaches the most advanced level of re-lighting technology, and has high practical value.

The object of the present invention is achieved through the following technical solutions, including the following steps:

(1) Image acquisition and preprocessing: Take images of the object to be modeled under different viewing angles and lighting, process the data and obtain: camera parameters and light source position, proxy geometry and neural texture, foreground matting, radiometric clues and UV map ;

(2) Construction and training of neural rendering pipeline model: Build a neural rendering pipeline model including a neural texture sampling module and a multi-resolution neural network. The neural texture sampling module takes UV images and neural textures as input to generate projected neural textures. , and then spliced with radiometric clues, input into the multi-resolution neural network and obtain the rendering results. The rendering results and the corresponding real collected images are used to calculate the loss function and back-propagation gradient, and the neural texture and multi-resolution neural network are jointly optimized based on the back-propagation gradient. Parameters to implement the training of the neural rendering pipeline model;

(3) Generation of new images and animations: Generate radiometric cues and UV maps under specified camera parameters and lighting conditions, and use the neural rendering pipeline model to synthesize new images or animations.

Further, the step 1 includes the following sub-steps:

(1.1) Collect images: Use two cameras to simultaneously shoot the object to be modeled in a dark environment. One camera is responsible for lighting at the same time, keeping the flash on, and the other camera keeps the flash off; shoot and obtain the object to be modeled. Two associated image sequences of the object; an additional image sequence of the object to be modeled, taken with one camera under natural lighting, used only to generate the proxy geometry;

(1.2) Generate camera parameters and light source position: generate the internal and external parameter sequences of the two cameras, and calculate the spatial trajectory of the light source from the external parameter sequence of the lighting camera;

(1.3) Generate surrogate geometry: by COLMAP algorithm (

Johannes L.,and Jan-Michael Frahm.Structure-from-motion revisited.Proceedings of the IEEE conference on computer vision and pattern recognition.2016.)(

Johannes L., et al.Pixelwise view selection for unstructured multi-view stereo.European Conference on Computer Vision.Springer, Cham, 2016.) Generate an inaccurate geometric model of the object to be modeled, called proxy geometry; use UV Unfolding algorithm (Kun Zhou, John Synder, Baining Guo, et al. Iso-charts: stretch-driven mesh parameterization using spectral analysis. In ACM SIGGRAPH symposium on Geometry processing. 2004: 45-54.) calculates vertex UV coordinates for proxy geometry ; Bind an optimizeable texture map to the proxy geometry, called a neural texture.

(1.4) Calculate the foreground matting: For each frame of the captured image sequence, the camera parameters draw the proxy geometry to the screen to obtain the foreground, background and undetermined area, and use this as a basis to run the closed solution matting algorithm to obtain the foreground mask. . Multiply the foreground mask and the image to obtain the foreground matting of the captured image, while removing the background as the fitting target of the algorithm.

(1.5) Generate radiometric clues: For each frame of the captured image sequence, the camera parameters and lighting parameters are used to render and set proxy geometric images of different materials, and the results are spliced as radiometric clues.

The material includes an ideal diffuse surface model and four Cook-Torrance BRDF models with roughnesses of 0.02, 0.05, 0.13 and 0.34. The rendering process is implemented by a physically based path tracing renderer.

(1.6) Generate UV map: For each frame of the captured image sequence, the camera parameters draw the proxy geometry to the screen. According to the UV coordinates of the model vertices, the UV coordinate values corresponding to each screen pixel are interpolated to generate a screen space UV map.

Further, the step (2) includes the following sub-steps:

(2.1) Define training data: the radiometric clues, UV images and foreground matting corresponding to each frame

As a set of training data, radiometric cues and UV maps are used as inputs to the neural rendering pipeline model, and the foreground matting

as a fitting target;

(2.2) Build a neural texture sampling module: The sampling module takes the UV map obtained in step (1.6) as input. For each pixel, the value in the UV map is used as the coordinate, at the corresponding position of the neural texture described in step (1.3). Take the value to get the projected neural texture.

(2.3) Build a multi-resolution neural network: The neural network model uses the projected neural texture and the splicing of radiometric cues described in step (1.5) as input to generate a set of multi-resolution representations; for each of the multi-resolution representations At one level, the input is processed into intermediate features through a feature transformation module. The intermediate features pass through a post-processing module to output an output image of the corresponding resolution; the intermediate features are passed to the next higher resolution level through an upsampling module. And spliced with the representation of this level as the input of this level; for foreground cutout

Also apply a set of spatial filters to generate a set of multi-resolution representations as fitting targets for the output images at each resolution level;

The feature transformation module includes the following structure: first, a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, then a separate normalization layer, and finally a modified linear activation layer;

The post-processing layer module is a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a step size of 1;

The upsampling module includes the following structure: first, a nearest neighbor upsampling operation that amplifies twice, then a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, and is separately normalized. layers, and finally the modified linear activation layer;

(2.4) Define the loss function: impose constraints on the output image of each multi-resolution level, and jointly optimize the neural texture and multi-resolution neural network parameters. The mathematical description is:

in,

represents the L1 loss function, N represents the total number of training images, and L represents the total number of layers of multi-resolution representation,

represents the multi-resolution neural network, θ _T represents the neural texture parameters,

Represents multi-resolution neural network parameters; i is the picture serial number; l represents the resolution level serial number,

Represents foreground matting images at different resolution levels as fitting targets,

Represents the predicted images of different resolution levels output by the network, and λ _l represents the weighting factor of loss at different resolution levels.

The beneficial effects of the present invention are as follows: the present invention is the first time that the effective prior model structure of multi-resolution representation is used in the field of re-illumination. Compared with traditional neural networks, multi-resolution neural networks explicitly split different The spatial frequency components reduce potential mutual interference and impose additional regular constraints on different resolution levels, making the synthetic image sequence more stable in the time domain; and due to the independent high-frequency processing module, it solves It solves the problem of high-frequency loss caused by mixed encoding of different frequency components in traditional methods, allowing the synthesized image to retain more detailed textures and achieve higher fidelity. This method reaches the current level of the most advanced relighting technology and can be used in applications such as e-commerce, digital protection of cultural relics, virtual reality and augmented reality.

Description of the drawings

Figure 1 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the first acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;

Figure 2 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the second acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;

Figure 3 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the third acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;

Figure 4 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the fourth acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;

Figure 5 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the fifth acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image.

Detailed ways

The core technology of the invention lies in a novel multi-resolution neural network, which first synthesizes projected neural texture and radiance cues as network input given the viewing angle, illumination, and agent geometry. , the input is then processed through a multi-resolution network into the final composite image. This multi-resolution neural network structure is superior to other existing network structures in terms of image details and temporal stability of synthesized animations. This method is mainly divided into three main steps: image acquisition and preprocessing, construction and training of neural rendering pipeline models, and generation of new images and animations.

Each step of the invention will be described in detail below with reference to Figures 1-5:

Image acquisition and preprocessing

1.1 Collect images

This invention refers to the neural relighting algorithm (Duan Gao, Guojun Chen, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2020. Deferred neural lighting: free-viewpoint relighting from unstructured photographs. ACM Transactions on Graphics (TOG) 39, 6(2020), 1–15) Collect images of the object to be modeled under different lighting and different viewing angles. The specific method is: ensure that the collection site is completely dark and there is no interference from other light sources. Two cameras are required (one camera C ₁ needs the flash to be always on, and the other C ₂ needs to be turned off) to shoot the video of the object to be modeled, recorded as sequence A ₁ and _A2 . During the shooting process, the two cameras should move in a certain pattern around the object to be modeled to ensure that the image covers a variety of viewing angles and lighting combinations. The typical number of image acquisitions is several thousand, determined by the geometric and material complexity of the object. The higher the complexity, the greater the number of images required. Then use natural light to illuminate the object to be modeled, and use a single camera with the flash turned off to take photos of the object from various angles, which is recorded as sequence B. It is used to generate subsequent proxy geometry. The scale of the images that need to be collected is dozens of images.

1.2 Generate camera posture and light source position

The present invention performs a multi-view stereo vision algorithm on sequences A ₁ and A ₂ (Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In CVPR.519–528.), calibrate the internal parameters of the camera, and obtain the movement trajectories and postures of the two cameras (P ₁ and P ₂ ) during the entire shooting process. Since the light source position is bound to one of the cameras (C ₁ ), it can be Obtain the movement trajectory of the light source during the shooting process (i.e. P ₁ ).

1.3 Generate proxy geometry

The present invention executes the COLMAP algorithm (Johannes Lutz

and Jan-Michael Frahm.2016.Structure-from-Motion Revisited.In CVPR.), can obtain an inaccurate geometric model of the object to be modeled, called proxy geometry. The present invention then uses the UV expansion algorithm to generate UV coordinates for the model vertices of the proxy geometry, and binds a texture map with a resolution of 512*512 and a channel number of 16 to the model and randomly initializes it. Since the map will be combined with the neural network parameters The joint optimization is called neural texture. The optimized neural texture refers to Figure 1-5(b).

1.4 Calculate foreground cutout

Since shooting will inevitably capture background objects outside the object to be modeled, it is necessary to generate a foreground mask to remove them. For each image, the present invention uses a raster shader combined with the camera parameters of the frame to draw the proxy geometry, and uses The expansion and erosion operations mark the area around the object outline as a pending area, and finally run the closed form matting algorithm (Anat Levin, Dani Lischinski, and Yair Weiss. 2008. A Closed-Form Solution to Natural Image Matting. IEEE PAMI 30,2(Feb 2008),228–242.) Obtain the foreground mask. Training the neural network model only requires the sequence A ₂ . The present invention only generates the corresponding mask sequence M ₂ of the sequence A ₂ . For each image of the sequence, the product of the image and the mask is calculated, and finally the foreground matting with the background removed is obtained. Image sequence A′ ₂ .

1.5 Generate radiometric clues

The present invention builds a physically based path tracing renderer (NVidia OptiX framework). This invention sets five different materials for the agent geometry, which are the ideal diffuse reflection model (Lambertian BRDF) and the Cook-Torrance model (Cook-Torrance BRDF) with roughnesses of 0.02, 0.05, 0.13, and 0.34 respectively. For each frame of A′ ₂ , given the light source position (P ₁ ) and camera parameters (P ₂ ), the path tracing renderer can be used to draw 5 images, corresponding to 5 preset materials. The five images are spliced together to form a three-dimensional tensor, called a radiometric clue, as shown in Figure 1-5(a). The radiometric clue sequence after A′ ₂ processing is recorded as R ₂ .

1.6 UV map generation

For each frame of the training image sequence A′ ₂ , the present invention applies a rasterization shader combined with the camera parameters of the frame to draw the proxy geometry to the screen space. According to the UV coordinates of the model vertices described in 1.3, for each screen pixel, interpolation And fill in the corresponding UV coordinates to obtain the UV map. Please refer to Figure 1-5(c), and then record the UV map sequence as U ₂ .

Construction and training of neural rendering pipeline model

2.1 Neural texture sampling module

2.2 The present invention uses the UV map obtained in step 1.6 as input. For each pixel, the value in the UV map is used as the coordinate, and the value is taken at the corresponding position of the neural texture described in step 1.3 to obtain a three-dimensional tensor, which is called the post-projection tensor. Neural texture, refer to Figure 1-5(d). Since the neural texture will be continuously updated during the training process, the projected neural texture needs to be recalculated in each iteration. Multi-resolution neural network model. The multi-resolution neural network used in the present invention takes the spliced projected neural texture and radiometric clues as input, and the input goes through a set of mean pooling operations with a window size of 2 and a step size of 2. pooling), generating a set of 5-layer multi-resolution representations (mipmaps). For each level in the multi-resolution representation, the input goes through a feature transformation module to output intermediate features, and the intermediate features go through a post-processing module to output an output image of the corresponding resolution. In addition, the intermediate features will also pass through an upsampling module, passed to the next finer level, spliced with the representation of this layer, input to the next feature transformation module, and so on, and finally output the full resolution image. The feature transformation modules at each level are independent and do not share parameters with each other. Multi-resolution neural networks work in the logarithmic domain to represent a larger dynamic range, so the input needs to be mapped to the logarithmic domain in advance and the network output is mapped back to linear space.

This invention uses a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, followed by a separate normalization layer (Instance Normalization), and finally a modified linear activation layer (ReLU) as a feature transformation. module. The upsampling module first is a nearest neighbor upsampling operation that amplifies twice, followed by a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, a separate normalization layer, and finally Corrected linear activation layer;

2.3 Network training

The complete training data of the multi-resolution neural network used in the present invention includes: the foreground matting sequence A′ ₂ , the projected neural texture generated in real time by the neural texture sampling module described in 2.1, and the radiometric clue sequence R ₂ . We appropriately crop and stretch the training data and adjust its resolution to 512*512. The multi-resolution neural network used in this invention has corresponding image output in each resolution level, and we apply the L1 loss function to the results of each layer. The target image of each layer can be generated from the image sequence A′ _2. We denote a single image in the sequence A′ ₂ as

i represents the serial number of the image, and the superscript represents the resolution level of the image. , we use a set of mean pooling operations with a window size of 2 and a stride of 2 to generate a set of 5-layer multi-resolution representations.

u _i represents the UV map mentioned in 1.6, and i represents the sequence number of the image. We jointly optimize the parameters of neural textures and multi-resolution neural networks. The mathematical description of the training process is as follows:

R ₂ ={r _i |i＝1,2,…N}

U ₂ ={u _i |i＝1,2,…N}

in

represents the L1 loss function, N represents the total number of training images, S represents the neural texture sampling module,

represents the multi-resolution neural network parameters,

Represents a set of 5-layer multi-resolution prediction images output by the multi-resolution neural network; λ _l represents the weighting factor of different resolution losses. We set the weight of the finest resolution level to 1 and the remaining levels to 0.01.

Generation of new images and animations

3.1 Preparation for network input

Synthesizing a new image or animation requires specifying the corresponding camera internal parameters, camera movement trajectory and posture, and light source movement trajectory and posture. Then it is necessary to synthesize radiometric clues and UV maps as the input of the neural rendering pipeline model. The generation method of radiometric clues and UV maps is completely similar to the method described in 1.5 and 1.6. You only need to change the camera and light source parameters to the ones that need to be generated. Just create a new sequence.

3.2 Run the neural rendering pipeline model

Input the UV map and radiometric clues generated in Section 3.1 into the neural rendering pipeline model to synthesize a new image of the modeled object under the specified viewing angle and lighting conditions, refer to Figure 1-5(e).

Implementation examples

The inventor implemented the implementation example of the present invention on a server equipped with Intel Xeon Platinum 8268 central processor and NVidia Tesla V100 graphics processor (32GB). The inventor adopted all parameter values listed in the specific embodiments and obtained all experimental results shown in Figures 1-5. The present invention can synthesize images of modeling objects under different viewing angles and lighting conditions as well as time-domain stable image sequences (animations). For a 512*512 image, the entire processing process takes about 1.9 seconds: the UV map and radiometric clues are generated by the rasterization shader and the path tracing renderer respectively, which takes about 1.4 seconds; the neural rendering pipeline model is forwarded by Tensorflow Implementation takes about 460 milliseconds in total, of which data IO takes 385 milliseconds and network forward takes 75 milliseconds. In addition, for a specific modeling object, it takes 20 hours to train a multi-resolution neural network.

Claims

A neural rendering method based on a multi-resolution network structure, which is characterized by including the following steps:

(1) Image acquisition and preprocessing: Take images of the object to be modeled under different viewing angles and lighting, process the data and obtain: camera parameters and light source position, proxy geometry and neural texture, foreground matting, radiometric clues and UV map ;

(2) Construction and training of neural rendering pipeline model: Build a neural rendering pipeline model including a neural texture sampling module and a multi-resolution neural network. The neural texture sampling module takes UV images and neural textures as input to generate projected neural textures. , and then spliced with radiometric clues, input into the multi-resolution neural network and obtain the rendering results. The rendering results and the corresponding real collected images are used to calculate the loss function and back-propagation gradient, and the neural texture and multi-resolution neural network are jointly optimized based on the back-propagation gradient. Parameters to implement the training of the neural rendering pipeline model;

(3) Generation of new images and animations: Take images under specified camera parameters and lighting conditions, process and obtain radiometric clues and UV maps, and use the neural rendering pipeline model to synthesize new images or animations.
The neural rendering method based on multi-resolution network structure according to claim 1, characterized in that the step (1) includes the following sub-steps:

(1.1) Collect images: Use two cameras to simultaneously shoot the object to be modeled in a dark environment. One camera is responsible for lighting at the same time, keeping the flash on, and the other camera keeps the flash off; shoot and obtain the object to be modeled. Two associated image sequences of the object; an additional sequence of images of the object to be modeled, taken with one camera under natural lighting, used only to generate the proxy geometry;

(1.2) Generate camera parameters and light source position: generate the internal and external parameter sequences of the two cameras, and calculate the spatial trajectory of the light source from the external parameter sequence of the lighting camera;

(1.3) Generate proxy geometry and neural texture: use the COLMAP algorithm to generate an inaccurate geometric model of the object to be modeled as the proxy geometry; then use the UV expansion algorithm to calculate the UV coordinates of the vertices of the proxy geometry; bind a map to the proxy geometry Optimized texture mapping to obtain neural texture;

(1.4) Calculate the foreground matting: For each frame of the captured image sequence, the camera parameters draw the proxy geometry to the screen to obtain the foreground, background and undetermined area, and use this as a basis to run the closed solution matting algorithm to obtain the foreground mask. , multiply the foreground mask and the image to obtain the foreground cutout of the captured image;

(1.5) Generate radiometric clues: For each frame of the captured image sequence, the camera parameters and lighting parameters are used to render and set proxy geometric images of different materials, and the results are spliced as radiometric clues; the materials include: ideal diffusion Surface model and 4 Cook-Torrance models with roughnesses of 0.02, 0.05, 0.13 and 0.34 respectively; the rendering process uses a ray tracing renderer based on path tracing algorithm;

(1.6) Generate UV map: For each frame of the captured image sequence, a screen-space UV map is generated from the camera parameters and proxy geometry.
The neural rendering method based on multi-resolution network structure according to claim 2, characterized in that in step (2), building and training the neural rendering pipeline model is as follows:

(2.1) Define training data: the radiometric clues, UV images and foreground matting corresponding to each frame
As a set of training data, radiometric cues and UV maps are used as inputs to the neural rendering pipeline model, and the foreground matting
as a fitting target;

(2.2) Build a neural texture sampling module: The sampling module takes the UV map obtained in step (1.6) as input. For each pixel, the value in the UV map is used as the coordinate, at the corresponding position of the neural texture described in step (1.3). Take the value to get the projected neural texture;

(2.3) Build a multi-resolution neural network: The neural network model uses the projected neural texture and the splicing of radiometric cues described in step (1.5) as input to generate a set of multi-resolution representations; for each of the multi-resolution representations At one level, the input is processed into intermediate features through a feature transformation module. The intermediate features pass through a post-processing module to output an output image of the corresponding resolution; the intermediate features are passed to the next higher resolution level through an upsampling module. And spliced with the representation of this level as the input of this level; for foreground cutout
Also apply a set of spatial filters to generate a set of multi-resolution representations as fitting targets for the output images at each resolution level;

The feature transformation module includes the following structure: first, a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, then a separate normalization layer, and finally a modified linear activation layer;

The post-processing layer module is a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a step size of 1;

The upsampling module includes the following structure: first, a nearest neighbor upsampling operation that amplifies twice, then a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, and then a separate regression layer. Unification layer, and finally the modified linear activation layer;

(2.4) Define the loss function: impose constraints on the output image of each multi-resolution level, and jointly optimize the neural texture and multi-resolution neural network parameters. The mathematical description is:

in,
represents the L1 loss function, N represents the total number of training images, and L represents the total number of layers of multi-resolution representation,
represents the multi-resolution neural network, θ T represents the neural texture parameters,
Represents multi-resolution neural network parameters; i is the picture serial number; l represents the resolution level serial number,
Represents foreground matting images at different resolution levels as fitting targets,
Represents the predicted images of different resolution levels output by the network, and λ l represents the weighting factor of loss at different resolution levels.