WO2023225891A1 - Neural rendering method based on multi-resolution network structure - Google Patents

Neural rendering method based on multi-resolution network structure Download PDF

Info

Publication number
WO2023225891A1
WO2023225891A1 PCT/CN2022/094877 CN2022094877W WO2023225891A1 WO 2023225891 A1 WO2023225891 A1 WO 2023225891A1 CN 2022094877 W CN2022094877 W CN 2022094877W WO 2023225891 A1 WO2023225891 A1 WO 2023225891A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural
resolution
images
texture
image
Prior art date
Application number
PCT/CN2022/094877
Other languages
French (fr)
Chinese (zh)
Inventor
周昆
吴鸿智
任重
马晟杰
Original Assignee
浙江大学
杭州相芯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学, 杭州相芯科技有限公司 filed Critical 浙江大学
Priority to PCT/CN2022/094877 priority Critical patent/WO2023225891A1/en
Publication of WO2023225891A1 publication Critical patent/WO2023225891A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image

Definitions

  • the present invention relates to the field of image-based rendering and material capture and modeling, and in particular to a method of synthesizing images of modeling objects under new viewing angles and new lighting conditions.
  • Relighting technology supports the digitization of real-world scenes, allowing creators to arbitrarily modify the subject's viewing angle and lighting, and synthesize new images that comply with physical laws, which has broad application prospects.
  • Existing work can be mainly divided into two categories: model-based and image-based methods.
  • Model-based methods fit a priori models to measured data and rely on the prior models to perform interpolation and extrapolation to new viewing angles and lighting conditions.
  • the reconstruction quality of this type of method has major limitations, because the a priori model is usually designed manually and cannot perfectly explain all measurement data, and the reconstruction quality is also largely affected by multiple factors, such as geometry. The accuracy of model and camera calibration will affect the reliability of fitting parameters.
  • Image-based methods do not rely on a priori models, but adopt more direct and data-driven ideas. In contrast to model-based methods, such methods are not affected by many factors in their estimation accuracy. In recent years, with the development of deep learning, the quality of images synthesized by this data-driven method has been greatly improved. However, despite being able to synthesize realistic-looking images, existing methods suffer from blurring of high-frequency details and temporal stability issues.
  • the purpose of the present invention is to provide a neural rendering method based on a multi-resolution network structure to address the shortcomings of existing re-illumination technology. It solves the problems of blurred high-frequency details and poor temporal stability, reaches the most advanced level of re-lighting technology, and has high practical value.
  • Image acquisition and preprocessing Take images of the object to be modeled under different viewing angles and lighting, process the data and obtain: camera parameters and light source position, proxy geometry and neural texture, foreground matting, radiometric clues and UV map ;
  • step 1 includes the following sub-steps:
  • Collect images Use two cameras to simultaneously shoot the object to be modeled in a dark environment. One camera is responsible for lighting at the same time, keeping the flash on, and the other camera keeps the flash off; shoot and obtain the object to be modeled. Two associated image sequences of the object; an additional image sequence of the object to be modeled, taken with one camera under natural lighting, used only to generate the proxy geometry;
  • the material includes an ideal diffuse surface model and four Cook-Torrance BRDF models with roughnesses of 0.02, 0.05, 0.13 and 0.34.
  • the rendering process is implemented by a physically based path tracing renderer.
  • step (2) includes the following sub-steps:
  • (2.1) Define training data: the radiometric clues, UV images and foreground matting corresponding to each frame
  • radiometric cues and UV maps are used as inputs to the neural rendering pipeline model, and the foreground matting as a fitting target;
  • the sampling module takes the UV map obtained in step (1.6) as input. For each pixel, the value in the UV map is used as the coordinate, at the corresponding position of the neural texture described in step (1.3). Take the value to get the projected neural texture.
  • the neural network model uses the projected neural texture and the splicing of radiometric cues described in step (1.5) as input to generate a set of multi-resolution representations; for each of the multi-resolution representations At one level, the input is processed into intermediate features through a feature transformation module. The intermediate features pass through a post-processing module to output an output image of the corresponding resolution; the intermediate features are passed to the next higher resolution level through an upsampling module. And spliced with the representation of this level as the input of this level; for foreground cutout Also apply a set of spatial filters to generate a set of multi-resolution representations as fitting targets for the output images at each resolution level;
  • the feature transformation module includes the following structure: first, a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, then a separate normalization layer, and finally a modified linear activation layer;
  • the post-processing layer module is a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a step size of 1;
  • the upsampling module includes the following structure: first, a nearest neighbor upsampling operation that amplifies twice, then a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, and is separately normalized. layers, and finally the modified linear activation layer;
  • N represents the total number of training images
  • L represents the total number of layers of multi-resolution representation
  • ⁇ T represents the neural texture parameters
  • i is the picture serial number
  • l represents the resolution level serial number
  • ⁇ l represents the weighting factor of loss at different resolution levels.
  • the present invention is the first time that the effective prior model structure of multi-resolution representation is used in the field of re-illumination.
  • multi-resolution neural networks explicitly split different The spatial frequency components reduce potential mutual interference and impose additional regular constraints on different resolution levels, making the synthetic image sequence more stable in the time domain; and due to the independent high-frequency processing module, it solves It solves the problem of high-frequency loss caused by mixed encoding of different frequency components in traditional methods, allowing the synthesized image to retain more detailed textures and achieve higher fidelity.
  • This method reaches the current level of the most advanced relighting technology and can be used in applications such as e-commerce, digital protection of cultural relics, virtual reality and augmented reality.
  • Figure 1 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the first acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
  • Figure 2 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the second acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
  • Figure 3 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the third acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
  • Figure 4 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the fourth acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
  • Figure 5 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the fifth acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image.
  • the core technology of the invention lies in a novel multi-resolution neural network, which first synthesizes projected neural texture and radiance cues as network input given the viewing angle, illumination, and agent geometry. , the input is then processed through a multi-resolution network into the final composite image.
  • This multi-resolution neural network structure is superior to other existing network structures in terms of image details and temporal stability of synthesized animations. This method is mainly divided into three main steps: image acquisition and preprocessing, construction and training of neural rendering pipeline models, and generation of new images and animations.
  • This invention refers to the neural relighting algorithm (Duan Gao, Guojun Chen, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2020. Deferred neural lighting: free-viewpoint relighting from unstructured photographs.
  • ACM Transactions on Graphics (TOG) 39, 6(2020), 1–15) Collect images of the object to be modeled under different lighting and different viewing angles. The specific method is: ensure that the collection site is completely dark and there is no interference from other light sources. Two cameras are required (one camera C 1 needs the flash to be always on, and the other C 2 needs to be turned off) to shoot the video of the object to be modeled, recorded as sequence A 1 and A2 .
  • the two cameras should move in a certain pattern around the object to be modeled to ensure that the image covers a variety of viewing angles and lighting combinations.
  • the typical number of image acquisitions is several thousand, determined by the geometric and material complexity of the object. The higher the complexity, the greater the number of images required.
  • the scale of the images that need to be collected is dozens of images.
  • the present invention performs a multi-view stereo vision algorithm on sequences A 1 and A 2 (Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In CVPR.519–528.), calibrate the internal parameters of the camera, and obtain the movement trajectories and postures of the two cameras (P 1 and P 2 ) during the entire shooting process. Since the light source position is bound to one of the cameras (C 1 ), it can be Obtain the movement trajectory of the light source during the shooting process (i.e. P 1 ).
  • the present invention executes the COLMAP algorithm (Johannes Lutz and Jan-Michael Frahm.2016.Structure-from-Motion Revisited.In CVPR.), can obtain an inaccurate geometric model of the object to be modeled, called proxy geometry.
  • the present invention uses the UV expansion algorithm to generate UV coordinates for the model vertices of the proxy geometry, and binds a texture map with a resolution of 512*512 and a channel number of 16 to the model and randomly initializes it. Since the map will be combined with the neural network parameters The joint optimization is called neural texture.
  • the optimized neural texture refers to Figure 1-5(b).
  • the present invention uses a raster shader combined with the camera parameters of the frame to draw the proxy geometry, and uses The expansion and erosion operations mark the area around the object outline as a pending area, and finally run the closed form matting algorithm (Anat Levin, Dani Lischinski, and Yair Weiss. 2008. A Closed-Form Solution to Natural Image Matting. IEEE PAMI 30,2(Feb 2008),228–242.) Obtain the foreground mask. Training the neural network model only requires the sequence A 2 . The present invention only generates the corresponding mask sequence M 2 of the sequence A 2 . For each image of the sequence, the product of the image and the mask is calculated, and finally the foreground matting with the background removed is obtained. Image sequence A′ 2 .
  • the present invention builds a physically based path tracing renderer (NVidia OptiX framework).
  • This invention sets five different materials for the agent geometry, which are the ideal diffuse reflection model (Lambertian BRDF) and the Cook-Torrance model (Cook-Torrance BRDF) with roughnesses of 0.02, 0.05, 0.13, and 0.34 respectively.
  • the path tracing renderer can be used to draw 5 images, corresponding to 5 preset materials.
  • the five images are spliced together to form a three-dimensional tensor, called a radiometric clue, as shown in Figure 1-5(a).
  • the radiometric clue sequence after A′ 2 processing is recorded as R 2 .
  • the present invention applies a rasterization shader combined with the camera parameters of the frame to draw the proxy geometry to the screen space.
  • a rasterization shader combined with the camera parameters of the frame to draw the proxy geometry to the screen space.
  • the UV coordinates of the model vertices described in 1.3 for each screen pixel, interpolation And fill in the corresponding UV coordinates to obtain the UV map. Please refer to Figure 1-5(c), and then record the UV map sequence as U 2 .
  • the present invention uses the UV map obtained in step 1.6 as input. For each pixel, the value in the UV map is used as the coordinate, and the value is taken at the corresponding position of the neural texture described in step 1.3 to obtain a three-dimensional tensor, which is called the post-projection tensor.
  • Neural texture refer to Figure 1-5(d). Since the neural texture will be continuously updated during the training process, the projected neural texture needs to be recalculated in each iteration.
  • Multi-resolution neural network model The multi-resolution neural network used in the present invention takes the spliced projected neural texture and radiometric clues as input, and the input goes through a set of mean pooling operations with a window size of 2 and a step size of 2.
  • the input goes through a feature transformation module to output intermediate features, and the intermediate features go through a post-processing module to output an output image of the corresponding resolution.
  • the intermediate features will also pass through an upsampling module, passed to the next finer level, spliced with the representation of this layer, input to the next feature transformation module, and so on, and finally output the full resolution image.
  • the feature transformation modules at each level are independent and do not share parameters with each other.
  • Multi-resolution neural networks work in the logarithmic domain to represent a larger dynamic range, so the input needs to be mapped to the logarithmic domain in advance and the network output is mapped back to linear space.
  • This invention uses a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, followed by a separate normalization layer (Instance Normalization), and finally a modified linear activation layer (ReLU) as a feature transformation. module.
  • the upsampling module first is a nearest neighbor upsampling operation that amplifies twice, followed by a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, a separate normalization layer, and finally Corrected linear activation layer;
  • the complete training data of the multi-resolution neural network used in the present invention includes: the foreground matting sequence A′ 2 , the projected neural texture generated in real time by the neural texture sampling module described in 2.1, and the radiometric clue sequence R 2 .
  • the multi-resolution neural network used in this invention has corresponding image output in each resolution level, and we apply the L1 loss function to the results of each layer.
  • the target image of each layer can be generated from the image sequence A′ 2.
  • N the total number of training images
  • S the neural texture sampling module
  • ⁇ T the neural texture parameters
  • ⁇ l the weighting factor of different resolution losses.
  • Synthesizing a new image or animation requires specifying the corresponding camera internal parameters, camera movement trajectory and posture, and light source movement trajectory and posture. Then it is necessary to synthesize radiometric clues and UV maps as the input of the neural rendering pipeline model.
  • the generation method of radiometric clues and UV maps is completely similar to the method described in 1.5 and 1.6. You only need to change the camera and light source parameters to the ones that need to be generated. Just create a new sequence.
  • the inventor implemented the implementation example of the present invention on a server equipped with Intel Xeon Platinum 8268 central processor and NVidia Tesla V100 graphics processor (32GB). The inventor adopted all parameter values listed in the specific embodiments and obtained all experimental results shown in Figures 1-5.
  • the present invention can synthesize images of modeling objects under different viewing angles and lighting conditions as well as time-domain stable image sequences (animations).
  • the entire processing process takes about 1.9 seconds: the UV map and radiometric clues are generated by the rasterization shader and the path tracing renderer respectively, which takes about 1.4 seconds; the neural rendering pipeline model is forwarded by Tensorflow Implementation takes about 460 milliseconds in total, of which data IO takes 385 milliseconds and network forward takes 75 milliseconds. In addition, for a specific modeling object, it takes 20 hours to train a multi-resolution neural network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Image Processing (AREA)

Abstract

Disclosed in the present application is a neural rendering method based on a multi-resolution network structure. The method comprises: first photographing images of an object to be modeled under different viewing angles and illumination, and obtaining camera parameters, light source positions, proxy geometry, neural textures, foreground matting, radiance cues and UV image data, and building and training a neural rendering pipeline model; and finally photographing images under specified camera parameters and illumination conditions, processing same and obtaining the radiance clues and a UV image, and synthesizing a new image or animation by using the neural rendering pipeline model. Compared with a traditional neural network, different spatial frequency components are explicitly split in the present application, so that a sequence of a synthesized image has better stability in a time domain. According to the present application, the problem of high-frequency loss caused by hybrid coding of different frequency components in a traditional method is solved, so that more detailed textures are reserved for the synthesized image, and higher fidelity is achieved.

Description

一种基于多分辨率网络结构的神经绘制方法A neural rendering method based on multi-resolution network structure 技术领域Technical field
本发明涉及基于图像的渲染以及材质捕捉和建模领域,尤其涉及一种在新视角和新光照条件下合成建模对象图像的方法。The present invention relates to the field of image-based rendering and material capture and modeling, and in particular to a method of synthesizing images of modeling objects under new viewing angles and new lighting conditions.
背景技术Background technique
重新照明技术支持现实世界场景的数字化,使得创作者能够任意修改被摄对象观察视角和照明,并合成符合物理规律的新图像,有广阔的应用前景。现有的工作主要可以分为两类:基于模型和基于图像的方法。Relighting technology supports the digitization of real-world scenes, allowing creators to arbitrarily modify the subject's viewing angle and lighting, and synthesize new images that comply with physical laws, which has broad application prospects. Existing work can be mainly divided into two categories: model-based and image-based methods.
基于模型的方法将先验模型拟合到测量数据,并依靠先验模型执行插值和外推到新的视角和照明条件。然而,这类方法的重建质量有较大的局限性,因为先验模型通常是人手工设计的,无法完美解释所有测量数据,并且重建质量还很大程度上受到多个因素的影响,比如几何模型和相机校准的准确度,会影响拟合参数的可靠性。Model-based methods fit a priori models to measured data and rely on the prior models to perform interpolation and extrapolation to new viewing angles and lighting conditions. However, the reconstruction quality of this type of method has major limitations, because the a priori model is usually designed manually and cannot perfectly explain all measurement data, and the reconstruction quality is also largely affected by multiple factors, such as geometry. The accuracy of model and camera calibration will affect the reliability of fitting parameters.
基于图像的方法不依赖于先验模型,而采用更直接和数据驱动的思路。和基于模型的方法相比,此类方法不受许多因素估计准确性的影响。近年来,随着深度学习的发展,这种数据驱动方法合成的图像质量得到了很大的提升。然而,尽管能合成看上去十分逼真的图像,现存方法存在着高频细节模糊和时域稳定性问题。Image-based methods do not rely on a priori models, but adopt more direct and data-driven ideas. In contrast to model-based methods, such methods are not affected by many factors in their estimation accuracy. In recent years, with the development of deep learning, the quality of images synthesized by this data-driven method has been greatly improved. However, despite being able to synthesize realistic-looking images, existing methods suffer from blurring of high-frequency details and temporal stability issues.
发明内容Contents of the invention
本发明的目的在于针对现存重新照明技术的不足,提供了一种基于多分辨率网络结构的神经绘制方法。解决了高频细节模糊和时域稳定性欠佳的问题,达到了最先进的重新照明技术水平,具有很高的实用价值。The purpose of the present invention is to provide a neural rendering method based on a multi-resolution network structure to address the shortcomings of existing re-illumination technology. It solves the problems of blurred high-frequency details and poor temporal stability, reaches the most advanced level of re-lighting technology, and has high practical value.
本发明的目的是通过以下技术方案来实现的,包括以下步骤:The object of the present invention is achieved through the following technical solutions, including the following steps:
(1)图像采集和预处理:拍摄待建模对象在不同视角和光照下的图像,处理数据并得到:相机参数及光源位置、代理几何及神经纹理、前景抠图、辐射度线索和UV图;(1) Image acquisition and preprocessing: Take images of the object to be modeled under different viewing angles and lighting, process the data and obtain: camera parameters and light source position, proxy geometry and neural texture, foreground matting, radiometric clues and UV map ;
(2)神经渲染管线模型的搭建和训练:搭建包含神经纹理采样模块和多分辨率神经网络的神经渲染管线模型,所述神经纹理采样模块将UV图和神经纹理作为输入,生成投影后神经纹理,继而与辐射度线索拼接,输入多分辨率神经网络并得到绘制结果,将绘制结果与对应的真实采集图像计算损失函数和反传梯度,根据反传梯度联合优化神经纹理和多分辨率神经网络参数,实现神经渲染管线模型的训练;(2) Construction and training of neural rendering pipeline model: Build a neural rendering pipeline model including a neural texture sampling module and a multi-resolution neural network. The neural texture sampling module takes UV images and neural textures as input to generate projected neural textures. , and then spliced with radiometric clues, input into the multi-resolution neural network and obtain the rendering results. The rendering results and the corresponding real collected images are used to calculate the loss function and back-propagation gradient, and the neural texture and multi-resolution neural network are jointly optimized based on the back-propagation gradient. Parameters to implement the training of the neural rendering pipeline model;
(3)新图像和动画的生成:生成指定相机参数和光照条件下的辐射度线索和UV图,用神经渲染管线模型合成新图像或动画。(3) Generation of new images and animations: Generate radiometric cues and UV maps under specified camera parameters and lighting conditions, and use the neural rendering pipeline model to synthesize new images or animations.
进一步地,所述步骤1包括如下子步骤:Further, the step 1 includes the following sub-steps:
(1.1)采集图像:在黑暗环境中用两台相机一起同步拍摄待建模对象,其中一台相机同时负责打光,保持闪光灯常开,另一台相机保持闪光灯关闭;拍摄并获得待建模对象的两组关联的图像序列;在自然光照下用一台相机拍摄待建模对象的额外图像序列,该序列仅用于生成代理几何;(1.1) Collect images: Use two cameras to simultaneously shoot the object to be modeled in a dark environment. One camera is responsible for lighting at the same time, keeping the flash on, and the other camera keeps the flash off; shoot and obtain the object to be modeled. Two associated image sequences of the object; an additional image sequence of the object to be modeled, taken with one camera under natural lighting, used only to generate the proxy geometry;
(1.2)生成相机参数、光源位置:生成两台相机的内参及外参序列,由打光相机的外参序列推算光源的空间轨迹;(1.2) Generate camera parameters and light source position: generate the internal and external parameter sequences of the two cameras, and calculate the spatial trajectory of the light source from the external parameter sequence of the lighting camera;
(1.3)生成代理几何:由COLMAP算法(
Figure PCTCN2022094877-appb-000001
Johannes L.,and Jan-Michael Frahm.Structure-from-motion revisited.Proceedings of the IEEE conference on computer vision and pattern recognition.2016.)(
Figure PCTCN2022094877-appb-000002
Johannes L.,et al.Pixelwise view selection for unstructured multi-view stereo.European Conference on Computer Vision.Springer,Cham,2016.)生成待建模对象的一个不准确的几何模型,称为代理几何;用UV展开算法(Kun Zhou,John Synder,Baining Guo,et al.Iso-charts:stretch-driven mesh parameterization using spectral analysis.In ACM SIGGRAPH symposium on Geometry processing.2004:45-54.)为代理几何计算顶点UV坐标;为代理几何绑定一张可优化的纹理贴图,称为神经纹理。
(1.3) Generate surrogate geometry: by COLMAP algorithm (
Figure PCTCN2022094877-appb-000001
Johannes L.,and Jan-Michael Frahm.Structure-from-motion revisited.Proceedings of the IEEE conference on computer vision and pattern recognition.2016.)(
Figure PCTCN2022094877-appb-000002
Johannes L., et al.Pixelwise view selection for unstructured multi-view stereo.European Conference on Computer Vision.Springer, Cham, 2016.) Generate an inaccurate geometric model of the object to be modeled, called proxy geometry; use UV Unfolding algorithm (Kun Zhou, John Synder, Baining Guo, et al. Iso-charts: stretch-driven mesh parameterization using spectral analysis. In ACM SIGGRAPH symposium on Geometry processing. 2004: 45-54.) calculates vertex UV coordinates for proxy geometry ; Bind an optimizeable texture map to the proxy geometry, called a neural texture.
(1.4)计算前景抠图:对于拍摄图像序列的每一帧,由相机参数将代理几何绘制到屏幕获得前景、背景和待定区域,以此作为依据运行闭式解抠图算法,获得前景遮罩。将前景遮罩和图像相乘,得到拍摄图像的前景抠图,同时移除背景,作为算法的拟合目标。(1.4) Calculate the foreground matting: For each frame of the captured image sequence, the camera parameters draw the proxy geometry to the screen to obtain the foreground, background and undetermined area, and use this as a basis to run the closed solution matting algorithm to obtain the foreground mask. . Multiply the foreground mask and the image to obtain the foreground matting of the captured image, while removing the background as the fitting target of the algorithm.
(1.5)生成辐射度线索:对于拍摄图像序列的每一帧,由相机参数和光照参数,渲染设置不同材质的代理几何图像,将结果拼接,作为辐射度线索。(1.5) Generate radiometric clues: For each frame of the captured image sequence, the camera parameters and lighting parameters are used to render and set proxy geometric images of different materials, and the results are spliced as radiometric clues.
所述材质包括理想漫射表面模型及4种粗糙度分别为0.02、0.05、0.13和0.34的库克-托伦斯模型(Cook-Torrance BRDF)。所述渲染过程由基于物理的路径追踪渲染器实现。The material includes an ideal diffuse surface model and four Cook-Torrance BRDF models with roughnesses of 0.02, 0.05, 0.13 and 0.34. The rendering process is implemented by a physically based path tracing renderer.
(1.6)生成UV图:对于拍摄图像序列的每一帧,由相机参数将代理几何绘制到屏幕,根据模型顶点UV坐标,插值计算每个屏幕像素对应的UV坐标值,生成一张屏幕空间的UV图。(1.6) Generate UV map: For each frame of the captured image sequence, the camera parameters draw the proxy geometry to the screen. According to the UV coordinates of the model vertices, the UV coordinate values corresponding to each screen pixel are interpolated to generate a screen space UV map.
进一步地,所述步骤(2)包括如下子步骤:Further, the step (2) includes the following sub-steps:
(2.1)定义训练数据:将每一帧对应的辐射度线索,UV图和前景抠图
Figure PCTCN2022094877-appb-000003
作为一组训练数据,其中辐射度线索和UV图作为神经渲染管线模型输入,前景抠图
Figure PCTCN2022094877-appb-000004
作为拟合目标;
(2.1) Define training data: the radiometric clues, UV images and foreground matting corresponding to each frame
Figure PCTCN2022094877-appb-000003
As a set of training data, radiometric cues and UV maps are used as inputs to the neural rendering pipeline model, and the foreground matting
Figure PCTCN2022094877-appb-000004
as a fitting target;
(2.2)搭建神经纹理采样模块:采样模块将步骤(1.6)得到的UV图作为输入,对于每 个像素,以UV图中的值作为坐标,在步骤(1.3)所述的神经纹理的对应位置取值,得到投影后神经纹理。(2.2) Build a neural texture sampling module: The sampling module takes the UV map obtained in step (1.6) as input. For each pixel, the value in the UV map is used as the coordinate, at the corresponding position of the neural texture described in step (1.3). Take the value to get the projected neural texture.
(2.3)搭建多分辨率神经网络:神经网络模型以投影后神经纹理和步骤(1.5)所述的辐射度线索的拼接作为输入,生成一组多分辨率表示;对于多分辨率表示中的每一个层次,通过一个特征变换模块,将输入处理为中间特征,中间特征经过一个后处理模块,输出相应分辨率的输出图像;中间特征经由一个上采样模块,传递到下一个更高分辨率层次,并与该层次的表示拼接,作为该层次的输入;对于前景抠图
Figure PCTCN2022094877-appb-000005
同样应用一组空间滤波器,生成一组多分辨率表示,作为每个分辨率层次输出图像的拟合目标;
(2.3) Build a multi-resolution neural network: The neural network model uses the projected neural texture and the splicing of radiometric cues described in step (1.5) as input to generate a set of multi-resolution representations; for each of the multi-resolution representations At one level, the input is processed into intermediate features through a feature transformation module. The intermediate features pass through a post-processing module to output an output image of the corresponding resolution; the intermediate features are passed to the next higher resolution level through an upsampling module. And spliced with the representation of this level as the input of this level; for foreground cutout
Figure PCTCN2022094877-appb-000005
Also apply a set of spatial filters to generate a set of multi-resolution representations as fitting targets for the output images at each resolution level;
所述特征变换模块,包括以下结构:首先是输出通道数为128、卷积核尺寸为3、步长为1的卷积层,然后是单独归一化层,最后是修正线性激活层;The feature transformation module includes the following structure: first, a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, then a separate normalization layer, and finally a modified linear activation layer;
所述后处理层模块为输出通道数128、卷积核尺寸为3、步长为1的卷积层;The post-processing layer module is a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a step size of 1;
所述上采样模块,包括以下结构:首先是放大两倍的最近邻上采样操作,然后是输出通道数为128、卷积核尺寸为3、步长为1的卷积层,单独归一化层,最后是修正线性激活层;The upsampling module includes the following structure: first, a nearest neighbor upsampling operation that amplifies twice, then a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, and is separately normalized. layers, and finally the modified linear activation layer;
(2.4)定义损失函数:对每个多分辨率层次的输出图像施加约束,并联合优化神经纹理以及多分辨率神经网络参数,数学描述为:(2.4) Define the loss function: impose constraints on the output image of each multi-resolution level, and jointly optimize the neural texture and multi-resolution neural network parameters. The mathematical description is:
Figure PCTCN2022094877-appb-000006
Figure PCTCN2022094877-appb-000006
其中,
Figure PCTCN2022094877-appb-000007
表示L1损失函数,N表示训练图像总数,L表示多分辨率表示的总层数,
Figure PCTCN2022094877-appb-000008
表示多分辨率神经网络,θ T表示神经纹理参数,
Figure PCTCN2022094877-appb-000009
表示多分辨率神经网络参数;i是图片序号;l表示分辨率层次序号,
Figure PCTCN2022094877-appb-000010
表示不同分辨率层次的前景抠图,作为拟合目标,
Figure PCTCN2022094877-appb-000011
表示网络输出的不同分辨率层次的预测图像,λ l表示不同分辨率层次损失的加权因子。
in,
Figure PCTCN2022094877-appb-000007
represents the L1 loss function, N represents the total number of training images, and L represents the total number of layers of multi-resolution representation,
Figure PCTCN2022094877-appb-000008
represents the multi-resolution neural network, θ T represents the neural texture parameters,
Figure PCTCN2022094877-appb-000009
Represents multi-resolution neural network parameters; i is the picture serial number; l represents the resolution level serial number,
Figure PCTCN2022094877-appb-000010
Represents foreground matting images at different resolution levels as fitting targets,
Figure PCTCN2022094877-appb-000011
Represents the predicted images of different resolution levels output by the network, and λ l represents the weighting factor of loss at different resolution levels.
本发明的有益效果如下:本发明是首次将多分辨率表示这一有效的先验模型结构用于重新照明领域,与传统的神经网络相比,多分辨率神经网络显式地拆分了不同的空间频率成分,减少了潜在的互相干扰,并对不同分辨率层级施加额外的正则约束,使得合成图像序列在时域上有更好的稳定性;且由于有独立的高频处理模块,解决了传统方法混杂编码不同频率成分导致的高频丢失问题,使得合成图像保留了更多的细节纹理,达到了更高的保真度。本方法达到当前最先进的重新照明技术的水平,可以用于电子商务、文物数字化保护、虚拟现实及增强现实等应用。The beneficial effects of the present invention are as follows: the present invention is the first time that the effective prior model structure of multi-resolution representation is used in the field of re-illumination. Compared with traditional neural networks, multi-resolution neural networks explicitly split different The spatial frequency components reduce potential mutual interference and impose additional regular constraints on different resolution levels, making the synthetic image sequence more stable in the time domain; and due to the independent high-frequency processing module, it solves It solves the problem of high-frequency loss caused by mixed encoding of different frequency components in traditional methods, allowing the synthesized image to retain more detailed textures and achieve higher fidelity. This method reaches the current level of the most advanced relighting technology and can be used in applications such as e-commerce, digital protection of cultural relics, virtual reality and augmented reality.
附图说明Description of the drawings
图1是应用本发明的方法合成第一个采集对象的重新照明图像的中间结果及结果图,其 中,(a)为辐射度线索图,(b)为神经纹理图,(c)为UV图,(d)为投影后神经纹理图,(e)为合成图像;Figure 1 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the first acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
图2是应用本发明的方法合成第二个采集对象的重新照明图像的中间结果及结果图,其中,(a)为辐射度线索图,(b)为神经纹理图,(c)为UV图,(d)为投影后神经纹理图,(e)为合成图像;Figure 2 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the second acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
图3是应用本发明的方法合成第三个采集对象的重新照明图像的中间结果及结果图,其中,(a)为辐射度线索图,(b)为神经纹理图,(c)为UV图,(d)为投影后神经纹理图,(e)为合成图像;Figure 3 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the third acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
图4是应用本发明的方法合成第四个采集对象的重新照明图像的中间结果及结果图,其中,(a)为辐射度线索图,(b)为神经纹理图,(c)为UV图,(d)为投影后神经纹理图,(e)为合成图像;Figure 4 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the fourth acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image;
图5是应用本发明的方法合成第五个采集对象的重新照明图像的中间结果及结果图,其中,(a)为辐射度线索图,(b)为神经纹理图,(c)为UV图,(d)为投影后神经纹理图,(e)为合成图像。Figure 5 is the intermediate result and result diagram of applying the method of the present invention to synthesize the re-illuminated image of the fifth acquisition object, in which (a) is the radiometric clue map, (b) is the neural texture map, and (c) is the UV map. , (d) is the neural texture map after projection, (e) is the synthesized image.
具体实施方式Detailed ways
本发明的核心技术在于一个新颖的多分辨率神经网络,在给定视角、光照、代理几何的情况下,先合成投影后神经纹理(projected neural texture)和辐射度线索(radiance cues)作为网络输入,然后通过多分辨率网络将输入处理成最终的合成图像。该多分辨率神经网络结构在合成动画的图像细节和时域稳定性上优于现有其他网络结构。该方法主要分为如下三个主要步骤:图像采集和预处理,神经渲染管线模型的搭建和训练,新图像和动画的生成。The core technology of the invention lies in a novel multi-resolution neural network, which first synthesizes projected neural texture and radiance cues as network input given the viewing angle, illumination, and agent geometry. , the input is then processed through a multi-resolution network into the final composite image. This multi-resolution neural network structure is superior to other existing network structures in terms of image details and temporal stability of synthesized animations. This method is mainly divided into three main steps: image acquisition and preprocessing, construction and training of neural rendering pipeline models, and generation of new images and animations.
下面结合附图1-5详细说明发明的各个步骤:Each step of the invention will be described in detail below with reference to Figures 1-5:
图像采集和预处理Image acquisition and preprocessing
1.1采集图像1.1 Collect images
本发明参考神经重照明算法(Duan Gao,Guojun Chen,Yue Dong,Pieter Peers,Kun Xu,and Xin Tong.2020.Deferred neural lighting:free-viewpoint relighting from unstructured photographs.ACM Transactions on Graphics(TOG)39,6(2020),1–15)采集待建模对象的在不同光照和不同视角下的图像。具体做法是:确保采集场地全黑,没有其他光源干扰,需要两台相机(其中一个相机C 1需要闪光灯常亮,另外一个C 2关闭)拍摄待建模对象的视频,记作序列A 1和A 2。拍摄过程中,两台相机应按一定模式围绕待建模对象移动,确保图像覆盖各种不同的观察视角和打光组合。典型的图像采集数量为几千张,由对象的几何和材质复杂度决定,复杂度越高,需要的图像数量就越多。接着用自然光照明待建模 对象,用单个关闭闪光灯的相机从各个角度拍摄对象的照片记作序列B,用于后继代理几何的生成,需要采集的图像规模为几十张。 This invention refers to the neural relighting algorithm (Duan Gao, Guojun Chen, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2020. Deferred neural lighting: free-viewpoint relighting from unstructured photographs. ACM Transactions on Graphics (TOG) 39, 6(2020), 1–15) Collect images of the object to be modeled under different lighting and different viewing angles. The specific method is: ensure that the collection site is completely dark and there is no interference from other light sources. Two cameras are required (one camera C 1 needs the flash to be always on, and the other C 2 needs to be turned off) to shoot the video of the object to be modeled, recorded as sequence A 1 and A2 . During the shooting process, the two cameras should move in a certain pattern around the object to be modeled to ensure that the image covers a variety of viewing angles and lighting combinations. The typical number of image acquisitions is several thousand, determined by the geometric and material complexity of the object. The higher the complexity, the greater the number of images required. Then use natural light to illuminate the object to be modeled, and use a single camera with the flash turned off to take photos of the object from various angles, which is recorded as sequence B. It is used to generate subsequent proxy geometry. The scale of the images that need to be collected is dozens of images.
1.2生成相机姿态、光源位置1.2 Generate camera posture and light source position
本发明在序列A 1和A 2上执行多视图立体视觉算法(Steven M.Seitz,Brian Curless,James Diebel,Daniel Scharstein,and Richard Szeliski.2006.A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms.In CVPR.519–528.),标定相机内参,并获得整个拍摄过程中两个相机的移动轨迹和姿态(P 1和P 2),由于光源位置和其中一个相机(C 1)绑定,同时可获得拍摄过程中光源的移动轨迹(即P 1)。 The present invention performs a multi-view stereo vision algorithm on sequences A 1 and A 2 (Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In CVPR.519–528.), calibrate the internal parameters of the camera, and obtain the movement trajectories and postures of the two cameras (P 1 and P 2 ) during the entire shooting process. Since the light source position is bound to one of the cameras (C 1 ), it can be Obtain the movement trajectory of the light source during the shooting process (i.e. P 1 ).
1.3生成代理几何1.3 Generate proxy geometry
本发明在序列B上执行COLMAP算法(Johannes Lutz
Figure PCTCN2022094877-appb-000012
and Jan-Michael Frahm.2016.Structure-from-Motion Revisited.In CVPR.),能够获得待建模对象的一个不准确的几何模型,称为代理几何(proxy geometry)。本发明继而用UV展开算法为代理几何的模型顶点生成UV坐标,并为模型绑定一张分辨率为512*512,通道数为16的纹理贴图并随机初始化,由于该贴图将和神经网络参数共同优化,称之为神经纹理(neural texture),优化后的神经纹理参考附图1-5(b)。
The present invention executes the COLMAP algorithm (Johannes Lutz
Figure PCTCN2022094877-appb-000012
and Jan-Michael Frahm.2016.Structure-from-Motion Revisited.In CVPR.), can obtain an inaccurate geometric model of the object to be modeled, called proxy geometry. The present invention then uses the UV expansion algorithm to generate UV coordinates for the model vertices of the proxy geometry, and binds a texture map with a resolution of 512*512 and a channel number of 16 to the model and randomly initializes it. Since the map will be combined with the neural network parameters The joint optimization is called neural texture. The optimized neural texture refers to Figure 1-5(b).
1.4计算前景抠图1.4 Calculate foreground cutout
由于拍摄会不可避免地拍到待建模对象外的背景物体,需要生成前景遮罩将它们除去,对每张图像,本发明用光栅化着色器结合该帧的相机参数绘制代理几何,并通过膨胀和腐蚀操作将物体轮廓周围区域标记为待定区域,最后运行闭式解抠图(closed form matting)算法(Anat Levin,Dani Lischinski,and Yair Weiss.2008.A Closed-Form Solution to Natural Image Matting.IEEE PAMI 30,2(Feb 2008),228–242.)获得前景遮罩。训练神经网络模型只需要序列A 2,本发明只生成序列A 2的对应遮罩序列M 2,对于序列的每张图像,计算图像和遮罩的乘积,最终得到移除背景的前景抠图的图像序列A′ 2Since shooting will inevitably capture background objects outside the object to be modeled, it is necessary to generate a foreground mask to remove them. For each image, the present invention uses a raster shader combined with the camera parameters of the frame to draw the proxy geometry, and uses The expansion and erosion operations mark the area around the object outline as a pending area, and finally run the closed form matting algorithm (Anat Levin, Dani Lischinski, and Yair Weiss. 2008. A Closed-Form Solution to Natural Image Matting. IEEE PAMI 30,2(Feb 2008),228–242.) Obtain the foreground mask. Training the neural network model only requires the sequence A 2 . The present invention only generates the corresponding mask sequence M 2 of the sequence A 2 . For each image of the sequence, the product of the image and the mask is calculated, and finally the foreground matting with the background removed is obtained. Image sequence A′ 2 .
1.5生成辐射度线索1.5 Generate radiometric clues
本发明构建了一个基于物理的路径追踪渲染器(NVidia OptiX框架)。本发明给代理几何设置5种不同的材质,分别为理想漫反射模型(Lambertian BRDF)和粗糙度分别为0.02、0.05、0.13、0.34的库克-托伦斯模型(Cook-Torrance BRDF)。对于A′ 2的每一帧,给定光源位置(P 1)和相机参数(P 2),用路径追踪渲染器可绘制5张图像,对应5种预设材质。将5张图像拼接在一起,组成一个三维张量,称为辐射度线索,见附图1-5(a)。将A′ 2处理后的辐射度线索序列记作R 2The present invention builds a physically based path tracing renderer (NVidia OptiX framework). This invention sets five different materials for the agent geometry, which are the ideal diffuse reflection model (Lambertian BRDF) and the Cook-Torrance model (Cook-Torrance BRDF) with roughnesses of 0.02, 0.05, 0.13, and 0.34 respectively. For each frame of A′ 2 , given the light source position (P 1 ) and camera parameters (P 2 ), the path tracing renderer can be used to draw 5 images, corresponding to 5 preset materials. The five images are spliced together to form a three-dimensional tensor, called a radiometric clue, as shown in Figure 1-5(a). The radiometric clue sequence after A′ 2 processing is recorded as R 2 .
1.6 UV图生成1.6 UV map generation
本发明对于训练图像序列A′ 2的每一帧,结合该帧的相机参数应用光栅化着色器将代理几何绘制到屏幕空间,根据1.3所述的模型顶点UV坐标,对于每个屏幕像素,插值并填入相应的UV坐标,得到UV图,请参考附图1-5(c),然后将UV图序列记作U 2For each frame of the training image sequence A′ 2 , the present invention applies a rasterization shader combined with the camera parameters of the frame to draw the proxy geometry to the screen space. According to the UV coordinates of the model vertices described in 1.3, for each screen pixel, interpolation And fill in the corresponding UV coordinates to obtain the UV map. Please refer to Figure 1-5(c), and then record the UV map sequence as U 2 .
神经渲染管线模型的搭建和训练Construction and training of neural rendering pipeline model
2.1神经纹理采样模块2.1 Neural texture sampling module
2.2本发明以步骤1.6得到的UV图作为输入,对于每个像素,以UV图中的值作为坐标,在步骤1.3所述的神经纹理对应位置取值,得到一个三维张量,称为投影后神经纹理,参考附图1-5(d)。由于训练过程中将不断更新神经纹理,因此在每次迭代中,都需要重新计算投影后神经纹理。多分辨率神经网络模型,本发明使用的多分辨率神经网络以拼接的投影后神经纹理和辐射度线索作为输入,输入经过一组窗口尺寸为2,步长为2的均值池化操作(mean pooling),生成一组5层的多分辨率表示(mipmap)。对于多分辨率表示中的每个层次,输入会经过一个特征变换模块,输出中间特征,中间特征经过一个后处理模块,输出相应分辨率的输出图像。此外,中间特征还将经过一个上采样模块,传递到下一个更精细的层次,并与该层的表示拼接,输入下一个特征变换模块,以此类推,最后输出完整分辨率的图像。每一层次的特征变换模块是独立的,互相不共享参数。多分辨率神经网络工作在对数域中,以表示更大的动态范围,因此需要提前将输入映射到对数域,并将网络输出映射回线性空间。2.2 The present invention uses the UV map obtained in step 1.6 as input. For each pixel, the value in the UV map is used as the coordinate, and the value is taken at the corresponding position of the neural texture described in step 1.3 to obtain a three-dimensional tensor, which is called the post-projection tensor. Neural texture, refer to Figure 1-5(d). Since the neural texture will be continuously updated during the training process, the projected neural texture needs to be recalculated in each iteration. Multi-resolution neural network model. The multi-resolution neural network used in the present invention takes the spliced projected neural texture and radiometric clues as input, and the input goes through a set of mean pooling operations with a window size of 2 and a step size of 2. pooling), generating a set of 5-layer multi-resolution representations (mipmaps). For each level in the multi-resolution representation, the input goes through a feature transformation module to output intermediate features, and the intermediate features go through a post-processing module to output an output image of the corresponding resolution. In addition, the intermediate features will also pass through an upsampling module, passed to the next finer level, spliced with the representation of this layer, input to the next feature transformation module, and so on, and finally output the full resolution image. The feature transformation modules at each level are independent and do not share parameters with each other. Multi-resolution neural networks work in the logarithmic domain to represent a larger dynamic range, so the input needs to be mapped to the logarithmic domain in advance and the network output is mapped back to linear space.
本发明使用是输出通道数128,卷积核尺寸为3,步长为1的卷积层,然后是单独归一化层(Instance Normalization),最后是修正线性激活层(ReLU),作为特征变换模块。所述上采样模块,首先是放大两倍的最近邻上采样操作,后接输出通道数是128,卷积核尺寸为3,步长为1的卷积层,单独归一化层,最后是修正线性激活层;This invention uses a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, followed by a separate normalization layer (Instance Normalization), and finally a modified linear activation layer (ReLU) as a feature transformation. module. The upsampling module first is a nearest neighbor upsampling operation that amplifies twice, followed by a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, a separate normalization layer, and finally Corrected linear activation layer;
2.3网络训练2.3 Network training
本发明使用的多分辨率神经网络的完整训练数据包括:前景抠图序列A′ 2,由2.1所述神经纹理采样模块实时生成的投影后神经纹理,辐射度线索序列R 2。我们适当裁剪和拉伸训练数据,将其分辨率调整到512*512。本发明使用的多分辨率神经网络在每个分辨率层次中都有对应的图像输出,我们对每一层的结果都施加L1损失函数。每层的目标图像可由图像序列A′ 2生成,我们记A′ 2序列中的单张图像为
Figure PCTCN2022094877-appb-000013
i表示图像的序号,上标表示图像的分辨率层次。,我们采用一组窗口尺寸为2,步长为2的均值池化操作,生成一组5层的多分辨率表示
Figure PCTCN2022094877-appb-000014
u i表示1.6所述UV图,i表示图像的序号。我们联合优化神经纹理和多分辨率神经网络的参数,训练过程的数学描述如下:
The complete training data of the multi-resolution neural network used in the present invention includes: the foreground matting sequence A′ 2 , the projected neural texture generated in real time by the neural texture sampling module described in 2.1, and the radiometric clue sequence R 2 . We appropriately crop and stretch the training data and adjust its resolution to 512*512. The multi-resolution neural network used in this invention has corresponding image output in each resolution level, and we apply the L1 loss function to the results of each layer. The target image of each layer can be generated from the image sequence A′ 2. We denote a single image in the sequence A′ 2 as
Figure PCTCN2022094877-appb-000013
i represents the serial number of the image, and the superscript represents the resolution level of the image. , we use a set of mean pooling operations with a window size of 2 and a stride of 2 to generate a set of 5-layer multi-resolution representations.
Figure PCTCN2022094877-appb-000014
u i represents the UV map mentioned in 1.6, and i represents the sequence number of the image. We jointly optimize the parameters of neural textures and multi-resolution neural networks. The mathematical description of the training process is as follows:
R 2={r i|i=1,2,…N} R 2 ={r i |i=1,2,…N}
U 2={u i|i=1,2,…N} U 2 ={u i |i=1,2,…N}
Figure PCTCN2022094877-appb-000015
Figure PCTCN2022094877-appb-000015
Figure PCTCN2022094877-appb-000016
Figure PCTCN2022094877-appb-000016
Figure PCTCN2022094877-appb-000017
Figure PCTCN2022094877-appb-000017
其中
Figure PCTCN2022094877-appb-000018
表示L1损失函数,N表示训练图像总数,S表示神经纹理采样模块,
Figure PCTCN2022094877-appb-000019
表示多分辨率神经网络,θ T表示神经纹理参数,
Figure PCTCN2022094877-appb-000020
表示多分辨率神经网络参数,
Figure PCTCN2022094877-appb-000021
表示多分辨率神经网络输出的一组5层的多分辨率预测图像;λ l表示不同分辨率损失的加权因子,我们将最精细分辨率层次的权设为1,其余层次设为0.01。
in
Figure PCTCN2022094877-appb-000018
represents the L1 loss function, N represents the total number of training images, S represents the neural texture sampling module,
Figure PCTCN2022094877-appb-000019
represents the multi-resolution neural network, θ T represents the neural texture parameters,
Figure PCTCN2022094877-appb-000020
represents the multi-resolution neural network parameters,
Figure PCTCN2022094877-appb-000021
Represents a set of 5-layer multi-resolution prediction images output by the multi-resolution neural network; λ l represents the weighting factor of different resolution losses. We set the weight of the finest resolution level to 1 and the remaining levels to 0.01.
新图像和动画的生成Generation of new images and animations
3.1网络输入的准备3.1 Preparation for network input
合成新的图像或动画需要指定相应的相机内参,相机移动轨迹和姿态,光源的移动轨迹和姿态。进而需要合成辐射度线索和UV图作为神经渲染管线模型的输入,辐射度线索和UV图的生成方法与1.5,1.6所述方法是完全类似的,只需将相机和光源参数改成需要生成的新序列即可。Synthesizing a new image or animation requires specifying the corresponding camera internal parameters, camera movement trajectory and posture, and light source movement trajectory and posture. Then it is necessary to synthesize radiometric clues and UV maps as the input of the neural rendering pipeline model. The generation method of radiometric clues and UV maps is completely similar to the method described in 1.5 and 1.6. You only need to change the camera and light source parameters to the ones that need to be generated. Just create a new sequence.
3.2运行神经渲染管线模型3.2 Run the neural rendering pipeline model
将3.1节中生成的UV图和辐射度线索输入神经渲染管线模型,即可合成指定视角和光照条件下建模对象的新图像,参考附图1-5(e)。Input the UV map and radiometric clues generated in Section 3.1 into the neural rendering pipeline model to synthesize a new image of the modeled object under the specified viewing angle and lighting conditions, refer to Figure 1-5(e).
实施实例Implementation examples
发明人在一台配备Intel Xeon Platinum 8268中央处理器,NVidia Tesla V100图形处理器(32GB)的服务器上实现了本发明的实施实例。发明人采用了所有在具体实施方案中列出的参数值,得到了附图1-5中所示的所有实验结果。本发明可以合成出建模对象在不同视角和光照条件下的图像以及时域稳定的图像序列(动画)。对于一张512*512的图像,整个处理流程大概需要1.9秒:其中UV图和辐射度线索分别由光栅化着色器和路径追踪渲染器生成,大概需要1.4秒;神经渲染管线模型前向由Tensorflow实现,总共大约需要460毫秒,其中数据IO占用385毫秒,网络前向占用75毫秒。另外针对特定的建模对象,训练一个多分辨率神经网络需要20小时。The inventor implemented the implementation example of the present invention on a server equipped with Intel Xeon Platinum 8268 central processor and NVidia Tesla V100 graphics processor (32GB). The inventor adopted all parameter values listed in the specific embodiments and obtained all experimental results shown in Figures 1-5. The present invention can synthesize images of modeling objects under different viewing angles and lighting conditions as well as time-domain stable image sequences (animations). For a 512*512 image, the entire processing process takes about 1.9 seconds: the UV map and radiometric clues are generated by the rasterization shader and the path tracing renderer respectively, which takes about 1.4 seconds; the neural rendering pipeline model is forwarded by Tensorflow Implementation takes about 460 milliseconds in total, of which data IO takes 385 milliseconds and network forward takes 75 milliseconds. In addition, for a specific modeling object, it takes 20 hours to train a multi-resolution neural network.

Claims (3)

  1. 一种基于多分辨率网络结构的神经绘制方法,其特征在于,包括以下步骤:A neural rendering method based on a multi-resolution network structure, which is characterized by including the following steps:
    (1)图像采集和预处理:拍摄待建模对象在不同视角和光照下的图像,处理数据并得到:相机参数及光源位置、代理几何及神经纹理、前景抠图、辐射度线索和UV图;(1) Image acquisition and preprocessing: Take images of the object to be modeled under different viewing angles and lighting, process the data and obtain: camera parameters and light source position, proxy geometry and neural texture, foreground matting, radiometric clues and UV map ;
    (2)神经渲染管线模型的搭建和训练:搭建包含神经纹理采样模块和多分辨率神经网络的神经渲染管线模型,所述神经纹理采样模块将UV图和神经纹理作为输入,生成投影后神经纹理,继而与辐射度线索拼接,输入多分辨率神经网络并得到绘制结果,将绘制结果与对应的真实采集图像计算损失函数和反传梯度,根据反传梯度联合优化神经纹理和多分辨率神经网络参数,实现神经渲染管线模型的训练;(2) Construction and training of neural rendering pipeline model: Build a neural rendering pipeline model including a neural texture sampling module and a multi-resolution neural network. The neural texture sampling module takes UV images and neural textures as input to generate projected neural textures. , and then spliced with radiometric clues, input into the multi-resolution neural network and obtain the rendering results. The rendering results and the corresponding real collected images are used to calculate the loss function and back-propagation gradient, and the neural texture and multi-resolution neural network are jointly optimized based on the back-propagation gradient. Parameters to implement the training of the neural rendering pipeline model;
    (3)新图像和动画的生成:拍摄指定相机参数和光照条件下的图像,处理并得到辐射度线索和UV图,用神经渲染管线模型合成新图像或动画。(3) Generation of new images and animations: Take images under specified camera parameters and lighting conditions, process and obtain radiometric clues and UV maps, and use the neural rendering pipeline model to synthesize new images or animations.
  2. 根据权利要求1所述的基于多分辨率网络结构的神经绘制方法,其特征在于,所述步骤(1)包括如下子步骤:The neural rendering method based on multi-resolution network structure according to claim 1, characterized in that the step (1) includes the following sub-steps:
    (1.1)采集图像:在黑暗环境中用两台相机一起同步拍摄待建模对象,其中一台相机同时负责打光,保持闪光灯常开,另一台相机保持闪光灯关闭;拍摄并获得待建模对象的两组关联的图像序列;在自然光照下用一台相机拍摄待建模对象的额外图像序列,该额外图像序列仅用于生成代理几何;(1.1) Collect images: Use two cameras to simultaneously shoot the object to be modeled in a dark environment. One camera is responsible for lighting at the same time, keeping the flash on, and the other camera keeps the flash off; shoot and obtain the object to be modeled. Two associated image sequences of the object; an additional sequence of images of the object to be modeled, taken with one camera under natural lighting, used only to generate the proxy geometry;
    (1.2)生成相机参数、光源位置:生成两台相机的内参及外参序列,由打光相机的外参序列推算光源的空间轨迹;(1.2) Generate camera parameters and light source position: generate the internal and external parameter sequences of the two cameras, and calculate the spatial trajectory of the light source from the external parameter sequence of the lighting camera;
    (1.3)生成代理几何及神经纹理:由COLMAP算法生成待建模对象的一个不准确的几何模型为代理几何;再使用UV展开算法计算代理几何的顶点UV坐标;为代理几何绑定一张可优化的纹理贴图,得到神经纹理;(1.3) Generate proxy geometry and neural texture: use the COLMAP algorithm to generate an inaccurate geometric model of the object to be modeled as the proxy geometry; then use the UV expansion algorithm to calculate the UV coordinates of the vertices of the proxy geometry; bind a map to the proxy geometry Optimized texture mapping to obtain neural texture;
    (1.4)计算前景抠图:对于拍摄图像序列的每一帧,由相机参数将代理几何绘制到屏幕获得前景、背景和待定区域,以此作为依据运行闭式解抠图算法,获得前景遮罩,将前景遮罩和图像相乘,得到拍摄图像的前景抠图;(1.4) Calculate the foreground matting: For each frame of the captured image sequence, the camera parameters draw the proxy geometry to the screen to obtain the foreground, background and undetermined area, and use this as a basis to run the closed solution matting algorithm to obtain the foreground mask. , multiply the foreground mask and the image to obtain the foreground cutout of the captured image;
    (1.5)生成辐射度线索:对于拍摄图像序列的每一帧,由相机参数和光照参数,渲染设置不同材质的代理几何图像,将结果拼接,作为辐射度线索;所述材质包括:理想漫射表面模型及4种粗糙度分别为0.02、0.05、0.13和0.34的库克-托伦斯模型;渲染过程使用基于路径追踪算法的光线追踪渲染器;(1.5) Generate radiometric clues: For each frame of the captured image sequence, the camera parameters and lighting parameters are used to render and set proxy geometric images of different materials, and the results are spliced as radiometric clues; the materials include: ideal diffusion Surface model and 4 Cook-Torrance models with roughnesses of 0.02, 0.05, 0.13 and 0.34 respectively; the rendering process uses a ray tracing renderer based on path tracing algorithm;
    (1.6)生成UV图:对于拍摄图像序列的每一帧,由相机参数和代理几何生成一张屏幕空间的UV图。(1.6) Generate UV map: For each frame of the captured image sequence, a screen-space UV map is generated from the camera parameters and proxy geometry.
  3. 根据权利要求2所述的基于多分辨率网络结构的神经绘制方法,其特征在于,所述步骤(2)中搭建并训练神经渲染管线模型如下步骤:The neural rendering method based on multi-resolution network structure according to claim 2, characterized in that in step (2), building and training the neural rendering pipeline model is as follows:
    (2.1)定义训练数据:将每一帧对应的辐射度线索,UV图和前景抠图
    Figure PCTCN2022094877-appb-100001
    作为一组训练数据,其中辐射度线索和UV图作为神经渲染管线模型输入,前景抠图
    Figure PCTCN2022094877-appb-100002
    作为拟合目标;
    (2.1) Define training data: the radiometric clues, UV images and foreground matting corresponding to each frame
    Figure PCTCN2022094877-appb-100001
    As a set of training data, radiometric cues and UV maps are used as inputs to the neural rendering pipeline model, and the foreground matting
    Figure PCTCN2022094877-appb-100002
    as a fitting target;
    (2.2)搭建神经纹理采样模块:采样模块将步骤(1.6)得到的UV图作为输入,对于每个像素,以UV图中的值作为坐标,在步骤(1.3)所述的神经纹理的对应位置取值,得到投影后神经纹理;(2.2) Build a neural texture sampling module: The sampling module takes the UV map obtained in step (1.6) as input. For each pixel, the value in the UV map is used as the coordinate, at the corresponding position of the neural texture described in step (1.3). Take the value to get the projected neural texture;
    (2.3)搭建多分辨率神经网络:神经网络模型以投影后神经纹理和步骤(1.5)所述的辐射度线索的拼接作为输入,生成一组多分辨率表示;对于多分辨率表示中的每一个层次,通过一个特征变换模块,将输入处理为中间特征,中间特征经过一个后处理模块,输出相应分辨率的输出图像;中间特征经由一个上采样模块,传递到下一个更高分辨率层次,并与该层次的表示拼接,作为该层次的输入;对于前景抠图
    Figure PCTCN2022094877-appb-100003
    同样应用一组空间滤波器,生成一组多分辨率表示,作为每个分辨率层次输出图像的拟合目标;
    (2.3) Build a multi-resolution neural network: The neural network model uses the projected neural texture and the splicing of radiometric cues described in step (1.5) as input to generate a set of multi-resolution representations; for each of the multi-resolution representations At one level, the input is processed into intermediate features through a feature transformation module. The intermediate features pass through a post-processing module to output an output image of the corresponding resolution; the intermediate features are passed to the next higher resolution level through an upsampling module. And spliced with the representation of this level as the input of this level; for foreground cutout
    Figure PCTCN2022094877-appb-100003
    Also apply a set of spatial filters to generate a set of multi-resolution representations as fitting targets for the output images at each resolution level;
    所述特征变换模块,包括以下结构:首先是输出通道数为128、卷积核尺寸为3、步长为1的卷积层,然后是单独归一化层,最后是修正线性激活层;The feature transformation module includes the following structure: first, a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, then a separate normalization layer, and finally a modified linear activation layer;
    所述后处理层模块为输出通道数128、卷积核尺寸为3、步长为1的卷积层;The post-processing layer module is a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a step size of 1;
    所述上采样模块,包括以下结构:首先是放大两倍的最近邻上采样操作,然后是输出通道数为128、卷积核尺寸为3、步长为1的卷积层,然后是单独归一化层,最后是修正线性激活层;The upsampling module includes the following structure: first, a nearest neighbor upsampling operation that amplifies twice, then a convolution layer with an output channel number of 128, a convolution kernel size of 3, and a stride of 1, and then a separate regression layer. Unification layer, and finally the modified linear activation layer;
    (2.4)定义损失函数:对每个多分辨率层次的输出图像施加约束,并联合优化神经纹理以及多分辨率神经网络参数,数学描述为:(2.4) Define the loss function: impose constraints on the output image of each multi-resolution level, and jointly optimize the neural texture and multi-resolution neural network parameters. The mathematical description is:
    Figure PCTCN2022094877-appb-100004
    Figure PCTCN2022094877-appb-100004
    其中,
    Figure PCTCN2022094877-appb-100005
    表示L1损失函数,N表示训练图像总数,L表示多分辨率表示的总层数,
    Figure PCTCN2022094877-appb-100006
    表示多分辨率神经网络,θ T表示神经纹理参数,
    Figure PCTCN2022094877-appb-100007
    表示多分辨率神经网络参数;i是图片序号;l表示分辨率层次序号,
    Figure PCTCN2022094877-appb-100008
    表示不同分辨率层次的前景抠图,作为拟合目标,
    Figure PCTCN2022094877-appb-100009
    表示网络输出的不同分辨率层次的预测图像,λ l表示不同分辨率层次损失的加权因子。
    in,
    Figure PCTCN2022094877-appb-100005
    represents the L1 loss function, N represents the total number of training images, and L represents the total number of layers of multi-resolution representation,
    Figure PCTCN2022094877-appb-100006
    represents the multi-resolution neural network, θ T represents the neural texture parameters,
    Figure PCTCN2022094877-appb-100007
    Represents multi-resolution neural network parameters; i is the picture serial number; l represents the resolution level serial number,
    Figure PCTCN2022094877-appb-100008
    Represents foreground matting images at different resolution levels as fitting targets,
    Figure PCTCN2022094877-appb-100009
    Represents the predicted images of different resolution levels output by the network, and λ l represents the weighting factor of loss at different resolution levels.
PCT/CN2022/094877 2022-05-25 2022-05-25 Neural rendering method based on multi-resolution network structure WO2023225891A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/094877 WO2023225891A1 (en) 2022-05-25 2022-05-25 Neural rendering method based on multi-resolution network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/094877 WO2023225891A1 (en) 2022-05-25 2022-05-25 Neural rendering method based on multi-resolution network structure

Publications (1)

Publication Number Publication Date
WO2023225891A1 true WO2023225891A1 (en) 2023-11-30

Family

ID=88918038

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094877 WO2023225891A1 (en) 2022-05-25 2022-05-25 Neural rendering method based on multi-resolution network structure

Country Status (1)

Country Link
WO (1) WO2023225891A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333627A (en) * 2023-12-01 2024-01-02 南方科技大学 Reconstruction and complement method, system and storage medium for automatic driving scene

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113538664A (en) * 2021-07-14 2021-10-22 清华大学 Vehicle de-illumination three-dimensional reconstruction method and device, electronic equipment and storage medium
RU2764144C1 (en) * 2020-07-27 2022-01-13 Самсунг Электроникс Ко., Лтд. Rapid two-layer neural network synthesis of realistic images of a neural avatar based on a single image
CN114514561A (en) * 2019-10-03 2022-05-17 谷歌有限责任公司 Neural light transmission

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114514561A (en) * 2019-10-03 2022-05-17 谷歌有限责任公司 Neural light transmission
RU2764144C1 (en) * 2020-07-27 2022-01-13 Самсунг Электроникс Ко., Лтд. Rapid two-layer neural network synthesis of realistic images of a neural avatar based on a single image
CN113538664A (en) * 2021-07-14 2021-10-22 清华大学 Vehicle de-illumination three-dimensional reconstruction method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GAO DUAN; CHEN GUOJUN; DONG YUE; PEERS PIETER; XU KUN; TONG XIN: "Deferred neural lighting", ACM TRANSACTIONS ON GRAPHICS, ACM, NY, US, vol. 39, no. 6, 26 November 2020 (2020-11-26), US , pages 1 - 15, XP059134795, ISSN: 0730-0301, DOI: 10.1145/3414685.3417767 *
JUSTUS THIES ; MICHAEL ZOLLHöFER ; MATTHIAS NIEßNER: "Deferred neural rendering", ACM TRANSACTIONS ON GRAPHICS, ACM, NY, US, vol. 38, no. 4, 12 July 2019 (2019-07-12), US , pages 1 - 12, XP058439437, ISSN: 0730-0301, DOI: 10.1145/3306346.3323035 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333627A (en) * 2023-12-01 2024-01-02 南方科技大学 Reconstruction and complement method, system and storage medium for automatic driving scene
CN117333627B (en) * 2023-12-01 2024-04-02 南方科技大学 Reconstruction and complement method, system and storage medium for automatic driving scene

Similar Documents

Publication Publication Date Title
Rückert et al. Adop: Approximate differentiable one-pixel point rendering
Martin-Brualla et al. Nerf in the wild: Neural radiance fields for unconstrained photo collections
CN111656407B (en) Fusing, texturing and rendering views of a dynamic three-dimensional model
Srinivasan et al. Lighthouse: Predicting lighting volumes for spatially-coherent illumination
CN106803267B (en) Kinect-based indoor scene three-dimensional reconstruction method
Meilland et al. 3d high dynamic range dense visual slam and its application to real-time object re-lighting
CN112802173B (en) Relightable texture for use in rendering images
CN111243071A (en) Texture rendering method, system, chip, device and medium for real-time three-dimensional human body reconstruction
Weng et al. Vid2actor: Free-viewpoint animatable person synthesis from video in the wild
Li et al. Physically-based editing of indoor scene lighting from a single image
Maier et al. Super-resolution keyframe fusion for 3D modeling with high-quality textures
KR100834157B1 (en) Method for Light Environment Reconstruction for Image Synthesis and Storage medium storing program therefor.
US11887256B2 (en) Deferred neural rendering for view extrapolation
CN113345063B (en) PBR three-dimensional reconstruction method, system and computer storage medium based on deep learning
Wang et al. Stereo vision–based depth of field rendering on a mobile device
CN115428027A (en) Neural opaque point cloud
Ji et al. Geometry-aware single-image full-body human relighting
Unger et al. Spatially varying image based lighting using HDR-video
WO2023225891A1 (en) Neural rendering method based on multi-resolution network structure
Sevastopolsky et al. Relightable 3d head portraits from a smartphone video
Xu et al. Scalable image-based indoor scene rendering with reflections
Qiao et al. Dynamic mesh-aware radiance fields
Ye et al. High-fidelity 3D real-time facial animation using infrared structured light sensing system
Ma et al. Neural compositing for real-time augmented reality rendering in low-frequency lighting environments
Xu et al. Renerf: Relightable neural radiance fields with nearfield lighting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943096

Country of ref document: EP

Kind code of ref document: A1