WO2021096190A1 - Method for synthesizing 2d image of scene as viewed from desired viewpoint and electronic computing device implementing the same - Google Patents

Method for synthesizing 2d image of scene as viewed from desired viewpoint and electronic computing device implementing the same Download PDF

Info

Publication number
WO2021096190A1
WO2021096190A1 PCT/KR2020/015686 KR2020015686W WO2021096190A1 WO 2021096190 A1 WO2021096190 A1 WO 2021096190A1 KR 2020015686 W KR2020015686 W KR 2020015686W WO 2021096190 A1 WO2021096190 A1 WO 2021096190A1
Authority
WO
WIPO (PCT)
Prior art keywords
point
image
machine learning
viewpoint
ray
Prior art date
Application number
PCT/KR2020/015686
Other languages
French (fr)
Inventor
Kara-Ali Alibulatovich ALIEV
Maria Vladimirovna KOLOS
Victor Sergeevich LEMPITSKY
Artem Mikhailovich SEVASTOPOLSKIY
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from RU2020113525A external-priority patent/RU2749749C1/en
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2021096190A1 publication Critical patent/WO2021096190A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/08Volume rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/36Level of detail
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/56Particle system, point based geometry or rendering

Definitions

  • the present invention relates generally to the fields of computer vision and computer graphics to produce 2D images of a 3D scene as viewed from different viewpoints, in particular, to a method for synthesizing a 2D image of a scene as viewed from desired viewpoint and an electronic computing device implementing the method.
  • the present invention follows the neural point-based graphics framework (2), which uses neural modeling based on point cloud geometry representations.
  • Point clouds have several attractive properties compared to mesh representations and volumetric representations. First, unlike volumetric representations, they scale well to large scenes, as points in the cloud need not be uniformly or near-uniformly distributed. Second, while meshes may be unsuitable to represent various complex phenomena such as thin objects, point clouds can model them efficiently. In general, point clouds of natural scenes are easier to obtain than their mesh representations, with the meshing process being one of the most brittle steps of the traditional image-based modeling pipelines.
  • point clouds arise as an intermediate representation early in such pipelines, whenever a scene is captured using a depth sensor, which generates a collection of depth scans, or by passive multi-view stereo, which usually also generates a collection of dense or semi-dense depth maps.
  • the rendering pipeline (2) starts with a "hard” rasterization of points using OpenGL z-buffering. This may introduce significant noise and overfitting into the learning process when the point cloud is noisy, as the outlier points occlude the true surface points. While, in principle, the learning process can identify such outlier points and learn to "inpaint them out” , this requires extra network capacity and may lead to overfitting, when outliers are observed in very few views.
  • the success of (2) depends on the choice of the point radius used for their rasterization. If the selected radius does not roughly match point density, the results may degrade considerably, with either fine details being lost, or invisible surfaces "bleeding" through the visible ones.
  • Point clouds are simple to process, as they are stored as two real-valued arrays - points coordinates with respect to some world system and points colors. Their expressiveness depends only on the number of points, and clouds of varying spatial density can be reflected.
  • SDF Signed Distance Functions
  • Voxel representations are also natural to be learned and used for any kind of processing, however, they occupy a large portion of memory and cannot adapt to varying resolution.
  • Meshes basically being point clouds with nondifferentiable triangles, are much harder to process and are mainly employed for rendering (19, 5).
  • Differentiable rendering frameworks allow to generate gradients with respect to various scene parameters, such as intrinsic parameters and extrinsic parameters of a camera, spatial and physical properties of a given 3D representation (e.g. mesh vertices positions, colors, or reflectance), and lighting.
  • Significant amount of works is devoted to mesh rendering since mesh topology allows to leverage geometry deformation and use generic priors for various reconstruction tasks.
  • Soft Rasterizer (15) suggests probabilistic smoothing of the discrete sampling operation, while OpenDR (17) and Kato (9) explicitly derive approximate partial derivatives which, however, rely on numerical methods.
  • Authors of (32) address this issue by computing the gradients through integrating pixel intensity function.
  • Pix2Vex (22) features semi-transparent Z-Buffer and cyclic image/geometry training pipeline.
  • triangle rasterization ray tracing allows to be extended in a differentiable manner: in particular, a method from (12) enforces edge sampling with further integration to approximate gradients with respect to accurate occlusion handling.
  • Point clouds despite being a more simple representation for automated processing than meshes due to the absence of non-differentiable topology, are significantly harder to be realistically rendered.
  • Differentiable Surface Splatting (33) method evaluates a rendered image by projecting points onto a canvas (an element for producing a 2D image, an image plane) and blending them with truncated Gaussian kernels. Due to truncation introduced for the sake of efficiency, derivatives are calculated approximately.
  • neural rendering implies learning arbitrary scene representation in order to generate realistic imagery and manipulate its appearance (from scene attributes manipulation to inpainting).
  • Neural Volumes (16) is based on prediction of 4D volume (RGB + opacity) for a model based on several photos by variational autoencoder, warping the volume and its opacity-aware integration.
  • Deferred Neural Rendering (31) learns to estimate a neural texture of an object based on its UV coordinate maps and is aimed to synthesize an image based on this deep texture samples.
  • DeepVoxels (28) takes a similar approach and estimate a volumetric latent neural code of an object by a combination of CNN and RNN (GRU), is followed up by Scene Representation Networks (29) which only employ RNN for learning point depths.
  • Neural Point-Based Graphics (2) undertakes an in-between approach: it involves learning embeddings of points in a cloud, splats visible points with a large kernel onto a canvas via fast Z-Buffer and desparsifies the result by a CNN of U-Net type. Nevertheless, due to hard Z-Buffer this pipeline is not fully differentiable.
  • the present invention has been created to eliminate at least one of the above shortcomings and to provide at least one of the advantages described below.
  • a differentiable neural renderer of point clouds obtained from scans of scenes reconstructed from real-world imagery is introduced.
  • the system is capable of synthesizing realistic and high quality looks of 3D scenes represented as point clouds, even when occlusions, noise, reflections and other complications take place.
  • Suggested neural architecture consists of a recurrent neural network for sequential processing of points grouped by imaginary rays forming the camera frustum and of a fully-convolutional neural network which refines the obtained image.
  • the renderer is trained on a number of 3D scenes captured in-the-wild with a corresponding set of their photographs from several viewpoints, and is able to generate novel photorealistic views of a new scene after training as perceived by an arbitrarily located camera in the real world. Decent results for scenes taken by commodity RGB-D scanners and for point clouds compiled from RGB photos are presented.
  • the aim of the present invention is to leverage flexibility of point cloud representation within the end-to-end fully differentiable framework that allows to solve various computer graphics tasks along with handling flaws occurring naturally in the real world scans.
  • two improvements to the neural point-based graphics pipeline (2) are introduced that address two shortcomings.
  • the hard rasterization process with recurrent rasterization is replaced, where an LSTM network (7) performs a neural analog to the z-buffer algorithm.
  • Such replacement allows much more graceful handling of outlier points, as the LSTM network can learn to render them fully transparent.
  • LSTM rendering precedes the convolutional rendering and can be trained jointly with it.
  • the point radius selection problem is avoided by using the neural analog to the classical Mipmapping algorithm from computer graphics.
  • a convolutional MipMapNet architecture that rasterizes point cloud several times at different resolutions while always using a single-pixel radius for point rasterization.
  • the resulting rasterizations are fused inside the MipMapNet, so that the fusion process implicitly selects the optimal point radius based on the local point density.
  • the present invention demonstrates both improvements (neural z-buffer and neural Mipmapping) of the neural point-based graphics framework that the both improvements lead to renderings that are more compelling, temporally-stable, and less prone to artifacts in the presence of the noisy points.
  • Technologies of the present invention allow to synthesize, from the point clouds, a 2D image of a scene as viewed from desired viewpoint with high quality and low computational cost.
  • One aspect of the present invention provides a method for synthesizing a 2D image of a scene as viewed from desired viewpoint, the method comprising: receiving (S101) a 3D point cloud obtained from a plurality of 2D images of the same scene, wherein each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding; setting (S102) the viewpoint as a camera having intrinsic parameters and extrinsic parameters; transforming (S103) the 3D coordinates of each point to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters; defining (S104) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters; grouping (S105) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order
  • training of the machine learning predictor comprises two consecutive stages: a pretraining stage performed on first training data set, wherein the first training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images; and a fine-tuning stage performed on second training data set, wherein the second training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
  • each of the two stages of the training of the machine learning predictor comprises: randomly choosing (S201), from respective training data set, training data of randomly chosen scene comprising the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint; transforming (S202) the 3D coordinates of each point of the 3D point cloud to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured; defining (S203) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters; grouping (S204) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint; calculating (S205), for each ray, a ray embedding by aggregating the
  • the trained machine learning predictor comprises two parts, wherein the first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).
  • the first part of the trained machine learning predictor is at least one of a recurrent neural network
  • the second part of the trained machine learning predictor is a U-net-based neural network
  • Another aspect of the present invention provides an electronic computing device, comprising: at least one processor; and a memory that stores numerical parameters of a trained machine learning predictor and instructions that, when executed by at least one processor, cause the at least one processor to perform a method for synthesizing a 2D image of a scene as viewed from desired viewpoint.
  • Fig. 1 is a schematic diagram illustrating the ray grouping process.
  • Fig. 2 is a schematic diagram illustrating operations of a machine learning predictor.
  • Fig. 3 is a flowchart illustrating a preferred embodiment of a method for synthesizing a 2D image of a scene as viewed from desired viewpoint.
  • Fig. 4 is a flowchart illustrating a training process of a machine learning predictor according to the present invention.
  • Fig. 5 is a block diagram illustrating an electronic computing device according to the present invention.
  • module or “unit” may perform at least one function or operation, and may be implemented with hardware, software, or a combination thereof.
  • “Plurality of modules” or “plurality of units” may be implemented with at least one processor (not shown) through integration thereof with at least one module other than “module” or “unit” which needs to be implemented with specific hardware.
  • a camera parameterized by its intrinsic parameters and extrinsic parameters according to pinhole camera model (4) is provided.
  • the intrinsic parameters of the camera comprise a focal length, an image sensor format, and a principal point.
  • the extrinsic parameters of the camera comprise the location and orientation of the camera with respect to the world coordinate system.
  • an image of the point cloud perceived by the camera could be recreated by splatting (34 , 33) points of onto canvas ,i.e. setting color of each point as a color of a pixel in a hard way (straightforward assignment) or soft way (additional setting of point color to neighboring pixels and fusing colors of neighboring points for the sake of smoothness).
  • the Fig. 1 illustrates the ray grouping process.
  • a bunny shown is an example of a sample 3D model, which is visualized and stored as a point cloud.
  • a camera in the left-bottom corner perceives an image of a bunny collected on a yellow plane nearby.
  • a red-green-yellow trapezoid is a slice of an infinite pyramid comprising camera frustum - a space of all possible point locations which can have an effect on an image.
  • An image canvas will be saved in a discretized way and will contain pixels. Let us consider a ray (drawn blue) which passes through pixel ) on the image. One point from bunny's chest and another point from bunny's leg will be considered belonging to this ray, since they get projected onto the same integer position on the image.
  • points in are grouped according to their rounded screen-space coordinates as a first step of the pipeline. More specifically, imaginary rays are considered coming out of the camera. Each ray originates at the camera location and passes through respective pixel of the image canvas. Screen-space coordinates of each point in are arithmetically rounded and considered as belonging to a ray .
  • the grouping process results in the one-to-one distribution of points among buckets (rays), and points are subsequently sorted in order of their decreasing depth with respect to the camera (see Fig. 1). After the procedure different buckets (rays) will store a different number of ordered points , and some of the buckets might remain empty, in case no points were placed in these buckets (see Fig.1).
  • a stage (a) at each scale, the point cloud grouped by a number of rays is passed to the LSTM network with weights shared across all pixels and all scales. LSTM accepts embedding and depth of each points in ray one by one in the order of decreasing depth.
  • LSTM In a stage (b), for each ray, LSTM yields an aggregated ray embedding - a vector with scalars. A tensor is obtained by placing ray embeddings on a canvas.
  • a stage (c) CNN of U-Net-based architecture collects tensors from all scales at the respective levels of resolution and fuse them into final RGB image.
  • Ray grouping results in a distribution of points in into rays of variable length .
  • RNN recurrent neural network
  • learnable parameters comprised of recurrent cells, for example LSTM cells (7), but not limited by them.
  • the method will be described for the case of RNN consisting of LSTM cells. Being fed with input, last hidden state and last cell state , such an RNN transforms this information into the consequent hidden and cell states is used to sequentially process points in each ray one by one and aggregate this cumulative information into the output feature of the corresponding image pixel ) (see Fig. 2).
  • Such a construction allows one to mix points along the ray in a back-to-front order, effectively ignoring non-relevant clusters of points and identifying frontal surface of underlying 3D structure with respect to the chosen camera.
  • the resulting cell state is interpreted as an aggregated feature of a ray which contains relevant information about the pixel color to be estimated (for instance, one of the possible solutions for RNN would be to reproduce embedding of the frontmost point) and it is called a ray embedding.
  • the intuition behind such an approach lies in the ability of LSTM cell to simulate both simple and complex transparency blending rules, including OVER operator (23), order-independent overlay, etc.
  • Stepwise outputs of LSTM are not utilized since expression for the LSTM cell state is more similar to transparency blending formulas than the expression for the output variables. is set for those pixels ) which correspond to empty rays and further is considered as a multi-channel image tensor.
  • the ray embeddings are fused and transformed into a final RGB image by a fully-convolutional network (FCN) .
  • FCN fully-convolutional network
  • Architecture behind was mainly inspired by U-Net (26), which was augmented by adding pyramid of multi-scale inputs (see Fig. 2).
  • U-Net 26
  • At each scale of the contracting path, accepts , where 1,2,4,... is a downsampling factor of the respective level, and stacks it with feature maps from a level of higher resolution. This allows the network to fill missing regions and exploit information about wider context from several resolutions at once.
  • plain convolutions partial convolutions (14) were used as a replacement in the contracting part.
  • These layers receive an input and a mask and processes only values at non-masked positions in the input, properly reweighting the result with respect to the mask given. This is done in order to make the refining network less dependent on the possible sparsity of the input caused by holes in the tensors .
  • the first loss restricts the RNN to produce a canvas of ray embeddings semantically consistent with a ground truth picture . More specifically, along with RNN and CNN trained is one 1 ⁇ 1 convolution layer and result is compared with the ground truth:
  • the second loss restricts the refining ConvNet to produce a final prediction similar to the ground truth picture :
  • the learning procedure consists of two stages.
  • a pretraining being provided with a set of point clouds of scenes of a similar kind
  • both embeddings of the points of each scene and parameters of all ConvNets are being optimized altogether according to the sum of all loss functions over all scenes.
  • the optimization is performed by ADAM algorithm and involves backpropagation of the loss gradient.
  • the second stage called a fine-tuning, being provided with a point cloud for at least one scene, the learning is aimed to optimize only the point embeddings for this scene or a set of scenes with respect to the frozen, previously pretrained ConvNets.
  • the learning process starts with zero descriptor values for this new scene or a set of scenes.
  • the method 100 comprises steps S101 to S108.
  • a 3D point cloud is received.
  • the 3D point cloud can be either stored in the device memory or received from any remote device by wire or wireless communications.
  • the 3D point cloud can be obtained from a plurality of 2D images of the same scene by any technique known in prior art. Each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding.
  • the desired viewpoint is set as a camera having intrinsic parameters and extrinsic parameters.
  • the 3D coordinates of each point are transformed to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters.
  • the intrinsic parameters and the extrinsic parameters define a rule of perspective transformation of coordinates from the world coordinate system to a respective screen space coordinate system. Such transformation is known in the prior art and described above. Therefore, detailed description of the transformation is omitted herein.
  • a plurality of rays diverging from the viewpoint are defined.
  • the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters. Definition of the plurality of rays is detailed described above referring to Fig. 1.
  • the points are grouped into point sets associated with the rays.
  • Each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint.
  • Detailed description of grouping the points is set forth above in "Ray Grouping" section.
  • a trained machine learning predictor calculates a ray embedding by aggregating the point embeddings and depths of the respective point set.
  • the ray embeddings are projected onto an image plane.
  • the steps (S103) to (S107) are performed for a plurality of scales.
  • the plurality of scales are defined in advance.
  • the trained machine learning predictor fuses the image planes of the plurality of scales into the 2D image.
  • the trained machine learning predictor comprises two parts. The first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).
  • the first part of the trained machine learning predictor is at least one of a recurrent neural network.
  • the second part of the trained machine learning predictor is a U-net-based neural network.
  • the machine learning predictor is trained in two consecutive stages.
  • the first stage is a pretraining stage.
  • the second stage is a fine-tuning stage.
  • the first stage is performed on first training data set.
  • the first training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
  • the second stage is performed on second training data set.
  • the second training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
  • a training process 200 of a machine learning predictor is illustrated on Fig. 4.
  • Each of the two stages of the training of the machine learning predictor comprises steps S201 to S208.
  • training data are randomly chosen from respective training data set.
  • the training data are randomly chosen from the first training data set.
  • the training data are randomly chosen from the second training data set.
  • the training data belong to randomly chosen scene and comprise the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint.
  • the 3D coordinates of each point of the 3D point cloud are transformed to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured.
  • a plurality of rays is defined.
  • the rays diverge from the viewpoint.
  • the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters.
  • the points of the point cloud are grouped into point sets associated with the rays.
  • Each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint.
  • a ray embedding is calculated for each ray by aggregating the point embeddings and depths of the respective point set.
  • the machine learning predictor processes the ray embeddings to obtain a sum of the loss function values.
  • the process relating to loss functions is detailed described above in "Loss functions" section.
  • step S207 gradient of the obtained sum is evaluated with respect to each scalar weight of the machine learning predictor and point embeddings.
  • each scalar weight of the machine learning predictor and the embeddings of all points are changed according to the predefined optimizer rule based on the evaluated gradient.
  • the steps (S201) to (S208) are repeated for a predefined number of times.
  • the electronic computing device 300 comprises at least one processor 301 and a memory 302.
  • the memory 302 stores numerical parameters of the trained machine learning predictor and instructions. At least one processor 301 executes the instructions stored in the memory 302 to perform the method 100 for synthesizing a 2D image of a scene as viewed from desired viewpoint.
  • the method disclosed herein can be implemented by at least one processor, such as a central processing unit (CPU), a graphical processing unit (GPU), implemented on at least one of an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), but not limited by them.
  • the method disclosed herein can be implemented by a computer-readable medium that stores numerical parameters of the trained machine learning predictor and computer-executable instructions that, when executed by a computer processor, cause the computer to perform the inventive method.
  • the trained machine learning predictor and instructions for implementing the present method can be downloaded to the electronic computing device via a network or from the medium.
  • the present invention can be applied in Virtual Reality headsets, Augmented Reality glasses, Mixed Reality glasses, smartphones and other Virtual and/or Augmented Reality devices and systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates generally to: the fields of computer vision and computer graphics to produce 2D images of a 3D scene as viewed from different viewpoints; a method for synthesizing a 2D image of a scene as viewed from desired viewpoint; and an electronic computing device implementing the method. The method comprises: receiving (S101) a 3D point cloud obtained from a plurality of 2D images of the same scene; setting (S102) the viewpoint as a camera having intrinsic parameters and extrinsic parameters; transforming (S103) the 3D coordinates of each point to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters; defining (S104) a plurality of rays diverging from the viewpoint; grouping (S105) the points into point sets associated with the rays; calculating (S106), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set with a trained machine learning predictor; projecting (S107) the ray embeddings onto an image plane; and fusing (S108) the image planes of the plurality of scales by the trained machine learning predictor into the 2D image. The present invention allows to synthesize, from the point clouds, a 2D image of a scene as viewed from desired viewpoint with high quality and low computational cost.

Description

METHOD FOR SYNTHESIZING 2D IMAGE OF SCENE AS VIEWED FROM DESIRED VIEWPOINT AND ELECTRONIC COMPUTING DEVICE IMPLEMENTING THE SAME
The present invention relates generally to the fields of computer vision and computer graphics to produce 2D images of a 3D scene as viewed from different viewpoints, in particular, to a method for synthesizing a 2D image of a scene as viewed from desired viewpoint and an electronic computing device implementing the method.
Several approaches for neural appearance capture of geometrically and photometrically complex scenes have been proposed recently (16, 31, 28, 29). Such approaches rely on differentiable rendering of certain geometric representations, and the capture process is usually performed by the fitting procedure that iterates between such rendering and backpropagation of the loss between the rendered images and the ground truth images onto the learnable parameters.
Several types of geometric representations have been investigated in this context. Thus (31) assumes that the scene geometry is modeled with a triangular mesh that is provided to the fitting process, and estimates neural textures in order to capture the photometric properties of different parts of the surface. Alternatively, several works (9, 22, 15) attempt to learn both the mesh and the texture (or surface colors) through the backpropagation process, which however has proven to be very difficult due to inherent non-differentiability of mesh rasterization near occluding boundaries. Another line of works (16, 21) investigates explicit and implicit volumetric representations that are learned jointly with the rendering network.
The present invention follows the neural point-based graphics framework (2), which uses neural modeling based on point cloud geometry representations. Point clouds have several attractive properties compared to mesh representations and volumetric representations. First, unlike volumetric representations, they scale well to large scenes, as points in the cloud need not be uniformly or near-uniformly distributed. Second, while meshes may be unsuitable to represent various complex phenomena such as thin objects, point clouds can model them efficiently. In general, point clouds of natural scenes are easier to obtain than their mesh representations, with the meshing process being one of the most brittle steps of the traditional image-based modeling pipelines. At the same time, point clouds arise as an intermediate representation early in such pipelines, whenever a scene is captured using a depth sensor, which generates a collection of depth scans, or by passive multi-view stereo, which usually also generates a collection of dense or semi-dense depth maps.
While the original framework (2) demonstrated several capture results of high realism, it is limited in several aspects. Thus, the rendering pipeline (2) starts with a "hard" rasterization of points using OpenGL z-buffering. This may introduce significant noise and overfitting into the learning process when the point cloud is noisy, as the outlier points occlude the true surface points. While, in principle, the learning process can identify such outlier points and learn to "inpaint them out" , this requires extra network capacity and may lead to overfitting, when outliers are observed in very few views. Secondly, the success of (2) depends on the choice of the point radius used for their rasterization. If the selected radius does not roughly match point density, the results may degrade considerably, with either fine details being lost, or invisible surfaces "bleeding" through the visible ones.
3D representation of a scene
Today a lot of representations of 3D scenes with different properties exist, and many of them can be used for automated processing. These include point clouds, meshes, Signed Distance Functions (SDF) (21), voxel representations and Octrees (18, 30), etc. Point clouds are simple to process, as they are stored as two real-valued arrays - points coordinates with respect to some world system and points colors. Their expressiveness depends only on the number of points, and clouds of varying spatial density can be reflected. Nowadays, numerous studies exist which operate on point clouds for 3D models classification, segmentation and generation (13, 11, 20, 10, 24, 25). Voxel representations are also natural to be learned and used for any kind of processing, however, they occupy a large portion of memory and cannot adapt to varying resolution. Meshes, basically being point clouds with nondifferentiable triangles, are much harder to process and are mainly employed for rendering (19, 5). A large body of work exists on synthesis of projections via various hardware and software techniques with a trade-off between photorealism and speed (1, 3, 19, 23).
Differentiable rendering
Differentiable rendering frameworks allow to generate gradients with respect to various scene parameters, such as intrinsic parameters and extrinsic parameters of a camera, spatial and physical properties of a given 3D representation (e.g. mesh vertices positions, colors, or reflectance), and lighting. Significant amount of works is devoted to mesh rendering since mesh topology allows to leverage geometry deformation and use generic priors for various reconstruction tasks. Soft Rasterizer (15) suggests probabilistic smoothing of the discrete sampling operation, while OpenDR (17) and Kato (9) explicitly derive approximate partial derivatives which, however, rely on numerical methods. Authors of (32) address this issue by computing the gradients through integrating pixel intensity function. Pix2Vex (22) features semi-transparent Z-Buffer and cyclic image/geometry training pipeline. Likewise, triangle rasterization ray tracing allows to be extended in a differentiable manner: in particular, a method from (12) enforces edge sampling with further integration to approximate gradients with respect to accurate occlusion handling. Point clouds, despite being a more simple representation for automated processing than meshes due to the absence of non-differentiable topology, are significantly harder to be realistically rendered. Differentiable Surface Splatting (33) method evaluates a rendered image by projecting points onto a canvas (an element for producing a 2D image, an image plane) and blending them with truncated Gaussian kernels. Due to truncation introduced for the sake of efficiency, derivatives are calculated approximately.
Neural rendering
Unlike physical based pipelines, neural rendering implies learning arbitrary scene representation in order to generate realistic imagery and manipulate its appearance (from scene attributes manipulation to inpainting). For instance, Neural Volumes (16) is based on prediction of 4D volume (RGB + opacity) for a model based on several photos by variational autoencoder, warping the volume and its opacity-aware integration. Deferred Neural Rendering (31) learns to estimate a neural texture of an object based on its UV coordinate maps and is aimed to synthesize an image based on this deep texture samples. DeepVoxels (28) takes a similar approach and estimate a volumetric latent neural code of an object by a combination of CNN and RNN (GRU), is followed up by Scene Representation Networks (29) which only employ RNN for learning point depths. Neural Point-Based Graphics (2) undertakes an in-between approach: it involves learning embeddings of points in a cloud, splats visible points with a large kernel onto a canvas via fast Z-Buffer and desparsifies the result by a CNN of U-Net type. Nevertheless, due to hard Z-Buffer this pipeline is not fully differentiable.
The present invention has been created to eliminate at least one of the above shortcomings and to provide at least one of the advantages described below.
CITATION LIST
(1) T. Aila and S. Laine. Understanding the efficiency of ray traversal on gpus. In Proceedings of the conference on high performance graphics 2009, pages 145-149. ACM, 2009. 2
(2) K.-A. Aliev, D. Ulyanov, and V. Lempitsky. Neural point-based graphics. arXiv preprint arXiv:1906.08240, 2019. 1, 2
(3) F. C. Crow. A comparison of antialiasing techniques. IEEE Computer Graphics and Applications, (1):40-48, 1981. 2
(4) J. M. Cychosz. An introduction to ray tracing. Computers & Graphics, 17(1):107, 1993. 3
(5) K. Dempski and D. S. Dietrich. Real-time rendering tricks and techniques in directx. Premier Press, 2002. 2
(6) J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248-255, June 2009. 5
(7) S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997. 2, 3
(8) J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, pages 694-711, 2016. 5
(9) H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3907-3916, 2018. 1, 2
(10) L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages 4558-4567, 2018. 2
(11) C.-L. Li, M. Zaheer, Y. Zhang, B. Poczos, and R. Salakhutdinov. Point cloud gan. arXiv preprint arXiv:1810.05795, 2018. 2
(12) T.-M. Li, M. Aittala, F. Durand, and J. Lehtinen. Differentiable monte carlo ray tracing through edge sampling. In SIGGRAPH Asia 2018 Technical Papers, page 222. ACM, 2018. 2
(13) C.-H. Lin, C. Kong, and S. Lucey. Learning efficient point cloud generation for dense 3d object reconstruction. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. 2
(14) G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 85-100, 2018. 5
(15) S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. arXiv preprint arXiv:1904.01786, 2019. 1, 2
(16) S. Lombardi, T. Simon, J. M. Saragih, G. Schwartz, A. M. Lehrmann, and Y. Sheikh. Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph., 38(4):65:1-65:14, 2019. 1, 2
(17) M. M. Loper and M. J. Black. Opendr: An approximate differentiable renderer. In European Conference on Computer Vision, pages 154-169. Springer, 2014. 2
(18) D. Meagher. Geometric modeling using octree encoding. Computer graphics and image processing, 19(2):129-147, 1982. 2
(19) J. Neider, T. Davis, and M. Woo. OpenGL programming guide, volume 14. Addison-Wesley Reading, MA, 1993. 2
(20) A. Nguyen and B. Le. 3d point cloud segmentation: A survey. In 2013 6th IEEE conference on robotics, automation and mechatronics (RAM), pages 225-230. IEEE, 2013. 2
(21) J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. arXiv preprint arXiv:1901.05103, 2019. 1,2
(22) F. Petersen, A. H. Bermano, O. Deussen, and D. Cohen-Or. Pix2vex: Image-to-geometry reconstruction using a smooth differentiable renderer. arXiv preprint arXiv:1903.11149, 2019. 1, 2
(23) T. Porter and T. Duff. Compositing digital images. In ACM Siggraph Computer Graphics, volume 18, pages 253-259. ACM, 1984. 2, 4
(24) C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652-660, 2017. 2
(25) C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099-5108, 2017. 2
(26) O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015. 5
(27) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 5
(28) V. Sitzmann, J. Thies, F. Heide, M. NieBner, G. Wetzstein, and M. Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In Proc. CVPR, 2019. 1, 2
(29) V. Sitzmann, M. Zollhofer, and G. Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. CoRR, abs/1906.01618, 2019. 1,2
(30) R. Szeliski. Rapid octree construction from image sequences. CVGIP: Image understanding, 58(1):23-32, 1993. 2
(31) J. Thies, M. Zollhofer, and M. NieBner. Deferred neural rendering: Image synthesis using neural textures. In Proc. SIG- GRAPH, 2019. 1, 2
(32) Z. Wu and W. Jiang. Analytical derivatives for differentiable renderer: 3d pose estimation by silhouette consistency. arXiv preprint arXiv:1906.07870, 2019. 2
(33) W. Yifan, F. Serena, S. Wu, C. Oztireli, and O. Sorkine-Hornung. Differentiable surface splatting for point-based geometry processing. arXiv preprint arXiv:1906.04173, 2019. 2, 3
(34) M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Surface splatting. In Proc. SIGGRAPH, pages 371-378. ACM, 2001. 3
-
In the present invention a differentiable neural renderer of point clouds obtained from scans of scenes reconstructed from real-world imagery is introduced. The system is capable of synthesizing realistic and high quality looks of 3D scenes represented as point clouds, even when occlusions, noise, reflections and other complications take place. Suggested neural architecture consists of a recurrent neural network for sequential processing of points grouped by imaginary rays forming the camera frustum and of a fully-convolutional neural network which refines the obtained image. The renderer is trained on a number of 3D scenes captured in-the-wild with a corresponding set of their photographs from several viewpoints, and is able to generate novel photorealistic views of a new scene after training as perceived by an arbitrarily located camera in the real world. Decent results for scenes taken by commodity RGB-D scanners and for point clouds compiled from RGB photos are presented.
Inspired by ray-based rendering the aim of the present invention is to leverage flexibility of point cloud representation within the end-to-end fully differentiable framework that allows to solve various computer graphics tasks along with handling flaws occurring naturally in the real world scans.
In the present invention, two improvements to the neural point-based graphics pipeline (2) are introduced that address two shortcomings. First, the hard rasterization process with recurrent rasterization is replaced, where an LSTM network (7) performs a neural analog to the z-buffer algorithm. Such replacement allows much more graceful handling of outlier points, as the LSTM network can learn to render them fully transparent. In the approach, such LSTM rendering precedes the convolutional rendering and can be trained jointly with it.
As the second part of the contribution, the point radius selection problem is avoided by using the neural analog to the classical Mipmapping algorithm from computer graphics. Thus proposed is a convolutional MipMapNet architecture that rasterizes point cloud several times at different resolutions while always using a single-pixel radius for point rasterization. The resulting rasterizations are fused inside the MipMapNet, so that the fusion process implicitly selects the optimal point radius based on the local point density.
The present invention demonstrates both improvements (neural z-buffer and neural Mipmapping) of the neural point-based graphics framework that the both improvements lead to renderings that are more compelling, temporally-stable, and less prone to artifacts in the presence of the noisy points.
Technologies of the present invention allow to synthesize, from the point clouds, a 2D image of a scene as viewed from desired viewpoint with high quality and low computational cost.
One aspect of the present invention provides a method for synthesizing a 2D image of a scene as viewed from desired viewpoint, the method comprising: receiving (S101) a 3D point cloud obtained from a plurality of 2D images of the same scene, wherein each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding; setting (S102) the viewpoint as a camera having intrinsic parameters and extrinsic parameters; transforming (S103) the 3D coordinates of each point to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters; defining (S104) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters; grouping (S105) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint; calculating (S106), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set with a trained machine learning predictor; projecting (S107) the ray embeddings onto an image plane, wherein the steps (S103) to (S107) are performed for a predefined plurality of scales; fusing (S108) the image planes of the plurality of scales by the trained machine learning predictor into the 2D image.
In additional aspect, training of the machine learning predictor comprises two consecutive stages: a pretraining stage performed on first training data set, wherein the first training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images; and a fine-tuning stage performed on second training data set, wherein the second training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
In another additional aspect, each of the two stages of the training of the machine learning predictor comprises: randomly choosing (S201), from respective training data set, training data of randomly chosen scene comprising the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint; transforming (S202) the 3D coordinates of each point of the 3D point cloud to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured; defining (S203) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters; grouping (S204) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint; calculating (S205), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set; processing (S206), by the machine learning predictor, the ray embeddings to obtain a sum of the loss function values; evaluating (S207) gradient of the obtained sum with respect to each scalar weight of the machine learning predictor and point embeddings; changing (S208) each scalar weight of the machine learning predictor and the embeddings of all points according to the predefined optimizer rule based on the evaluated gradient, wherein the steps (S201) to (S208) are repeated for a predefined number of times.
In another additional aspect, the trained machine learning predictor comprises two parts, wherein the first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).
In another additional aspect, the first part of the trained machine learning predictor is at least one of a recurrent neural network, and the second part of the trained machine learning predictor is a U-net-based neural network.
Another aspect of the present invention provides an electronic computing device, comprising: at least one processor; and a memory that stores numerical parameters of a trained machine learning predictor and instructions that, when executed by at least one processor, cause the at least one processor to perform a method for synthesizing a 2D image of a scene as viewed from desired viewpoint.
-
The above and other aspects, features and advantages of the present invention will be more apparent from the following detailed description, taken in conjunction with the accompanying drawings, in which:
Fig. 1 is a schematic diagram illustrating the ray grouping process.
Fig. 2 is a schematic diagram illustrating operations of a machine learning predictor.
Fig. 3 is a flowchart illustrating a preferred embodiment of a method for synthesizing a 2D image of a scene as viewed from desired viewpoint.
Fig. 4 is a flowchart illustrating a training process of a machine learning predictor according to the present invention.
Fig. 5 is a block diagram illustrating an electronic computing device according to the present invention.
In the following description, unless otherwise indicated, the same reference numbers are used for the same elements when they are depicted in different drawings, and their parallel description is not given.
-
The following description with reference to the accompanying drawings is provided to facilitate a thorough understanding of various embodiments of the present invention defined by the claims and its equivalents. To facilitate such an understanding the description includes various specific details, but these details should be considered only exemplary. Accordingly, those skilled in the art will find that various changes and modifications to the various embodiments described herein can be developed without departing from the scope of the present invention. In addition, descriptions of known functions and structures may be omitted for clarity and conciseness.
The terms and wordings used in the following description and claims are not limited to bibliographic meanings, but simply used by the inventor to provide a clear and consistent understanding of the present invention. Accordingly, it should be apparent to those skilled in the art that the following description of various embodiments of the present invention is provided for illustration only.
It should be understood that the singular forms as "a," "an," and "the" include the plural, unless the context clearly indicates otherwise.
It should be understood that, although the terms first, second, etc. may be used herein in reference to elements of the present disclosure, such elements should not be construed as limited by these terms. The terms are used only to distinguish one element from other elements.
Additionally, it should be understood that the terms "comprises" , "comprising" , "includes" and/or "including" , as used herein, mean the presence of the mentioned features, meanings, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, values, operations, elements, components and/or groups thereof.
In various embodiments of the present disclosure, "module" or "unit" may perform at least one function or operation, and may be implemented with hardware, software, or a combination thereof. "Plurality of modules" or "plurality of units" may be implemented with at least one processor (not shown) through integration thereof with at least one module other than "module" or "unit" which needs to be implemented with specific hardware.
Hereinafter, various embodiments of the present invention are described in more detail with reference to the accompanying drawings.
Figure PCTKR2020015686-appb-img-000001
is a 3D point cloud defined by point coordinates
Figure PCTKR2020015686-appb-img-000002
with respect to a selected world coordinate system and points embeddings - vectors
Figure PCTKR2020015686-appb-img-000003
associated with each point.
Figure PCTKR2020015686-appb-img-000004
represents a 3D object or scene we aim to realistically render, and embeddings
Figure PCTKR2020015686-appb-img-000005
will be further described as learnable parameters of the scene. Besides that, a camera parameterized by its intrinsic parameters
Figure PCTKR2020015686-appb-img-000006
and extrinsic parameters
Figure PCTKR2020015686-appb-img-000007
according to pinhole camera model (4) is provided. The intrinsic parameters
Figure PCTKR2020015686-appb-img-000008
of the camera comprise a focal length, an image sensor format, and a principal point. The extrinsic parameters
Figure PCTKR2020015686-appb-img-000009
of the camera comprise the location and orientation of the camera with respect to the world coordinate system. These quantities define a rule of perspective transformation of coordinates from the world coordinate system to a respective screen space coordinate system:
Figure PCTKR2020015686-appb-img-000010
By perspective projection (1) obtained are screen-space co-ordinates
Figure PCTKR2020015686-appb-img-000011
and their distances from camera
Figure PCTKR2020015686-appb-img-000012
for all points in
Figure PCTKR2020015686-appb-img-000013
:
Figure PCTKR2020015686-appb-img-000014
If the points characterized by RGB colors instead of embeddings, an image of the point cloud perceived by the camera could be recreated by splatting (34 , 33) points of
Figure PCTKR2020015686-appb-img-000015
onto canvas
Figure PCTKR2020015686-appb-img-000016
,i.e. setting color of each point
Figure PCTKR2020015686-appb-img-000017
as a color of a pixel
Figure PCTKR2020015686-appb-img-000018
in a hard way (straightforward assignment) or soft way (additional setting of point color to neighboring pixels and fusing colors of neighboring points for the sake of smoothness).
Nonetheless, while the described approach to rendering of point clouds is straightforward, it does not allow one to handle many important issues which naturally arise in realistic rendering, such as presence of holes in the image between the projected points, accounting of visibility (most often, if several points splat onto the same pixel, only the closest one to the camera influences the pixel color), lighting effects, and many others.
Ray Grouping
The Fig. 1 illustrates the ray grouping process. A bunny shown is an example of a sample 3D model, which is visualized and stored as a point cloud. A camera in the left-bottom corner perceives an image of a bunny collected on a yellow plane nearby. A red-green-yellow trapezoid is a slice of an infinite pyramid comprising camera frustum - a space of all possible point locations which can have an effect on an image. An image canvas will be saved in a discretized way and will contain
Figure PCTKR2020015686-appb-img-000019
pixels. Let us consider a ray (drawn blue) which passes through pixel
Figure PCTKR2020015686-appb-img-000020
) on the image. One point from bunny's chest and another point from bunny's leg will be considered belonging to this ray, since they get projected onto the same integer position on the image.
In order to make the rendering rule more visibility-aware, points in
Figure PCTKR2020015686-appb-img-000021
are grouped according to their rounded screen-space coordinates as a first step of the pipeline. More specifically, imaginary rays are considered coming out of the camera. Each ray
Figure PCTKR2020015686-appb-img-000022
originates at the camera location and passes through respective pixel
Figure PCTKR2020015686-appb-img-000023
of the image canvas. Screen-space coordinates
Figure PCTKR2020015686-appb-img-000024
of each point in
Figure PCTKR2020015686-appb-img-000025
are arithmetically rounded and considered as belonging to a ray
Figure PCTKR2020015686-appb-img-000026
. The grouping process results in the one-to-one distribution of points among buckets (rays), and points are subsequently sorted in order of their decreasing depth with respect to the camera (see Fig. 1). After the procedure different buckets (rays)
Figure PCTKR2020015686-appb-img-000027
will store a different number of ordered points
Figure PCTKR2020015686-appb-img-000028
, and some of the buckets might remain empty, in case no points were placed in these buckets (see Fig.1).
Neural Architecture
The main parts of our pipeline are illustrated in the Fig. 2. Although, in FIG. 2, the processing is shown with two scales, the scales and their number can be any ones and are set in advance. In a stage (a), at each scale, the point cloud
Figure PCTKR2020015686-appb-img-000029
grouped by a number of rays is passed to the LSTM network with weights shared across all pixels and all scales. LSTM accepts embedding and depth of each points in ray one by one in the order of decreasing depth. In a stage (b), for each ray, LSTM yields an aggregated ray embedding - a vector with
Figure PCTKR2020015686-appb-img-000030
scalars. A
Figure PCTKR2020015686-appb-img-000031
tensor
Figure PCTKR2020015686-appb-img-000032
is obtained by placing ray embeddings on a canvas. In a stage (c), CNN of U-Net-based architecture collects tensors
Figure PCTKR2020015686-appb-img-000033
from all scales at the respective levels of resolution and fuse them into final RGB image.
Ray grouping results in a distribution of
Figure PCTKR2020015686-appb-img-000034
points in
Figure PCTKR2020015686-appb-img-000035
into
Figure PCTKR2020015686-appb-img-000036
rays
Figure PCTKR2020015686-appb-img-000037
of variable length
Figure PCTKR2020015686-appb-img-000038
. Let consider a recurrent neural network (RNN)
Figure PCTKR2020015686-appb-img-000039
with learnable parameters
Figure PCTKR2020015686-appb-img-000040
comprised of recurrent cells, for example LSTM cells (7), but not limited by them. The method will be described for the case of RNN consisting of LSTM cells. Being fed with input, last hidden state and last cell state
Figure PCTKR2020015686-appb-img-000041
, such an RNN transforms this information into the consequent hidden and cell states
Figure PCTKR2020015686-appb-img-000042
is used to sequentially process points in each ray
Figure PCTKR2020015686-appb-img-000043
one by one and aggregate this cumulative information into the output feature of the corresponding image pixel ) (see Fig. 2).
In more detail, for each ray
Figure PCTKR2020015686-appb-img-000044
, feature vectors of its points
Figure PCTKR2020015686-appb-img-000045
(whereas
Figure PCTKR2020015686-appb-img-000046
) are constructed. In case points in
Figure PCTKR2020015686-appb-img-000047
possess more features than just world coordinates and embeddings, such as point color, semantic segmentation, etc., these features can be projected onto screen coordinates and included in
Figure PCTKR2020015686-appb-img-000048
. Fed with features of a new point, RNN produces an updated estimate
Figure PCTKR2020015686-appb-img-000049
of parameters of a whole ray:
Figure PCTKR2020015686-appb-img-000050
Such a construction allows one to mix points along the ray in a back-to-front order, effectively ignoring non-relevant clusters of points and identifying frontal surface of underlying 3D structure with respect to the chosen camera. The resulting cell state
Figure PCTKR2020015686-appb-img-000051
is interpreted as an aggregated feature of a ray which contains relevant information about the pixel color to be estimated (for instance, one of the possible solutions for RNN would be to reproduce embedding of the frontmost point) and it is called a ray embedding. The intuition behind such an approach lies in the ability of LSTM cell to simulate both simple and complex transparency blending rules, including OVER operator (23), order-independent overlay, etc. Stepwise outputs of LSTM are not utilized since expression for the LSTM cell state is more similar to transparency blending formulas than the expression for the output variables.
Figure PCTKR2020015686-appb-img-000052
is set for those pixels
Figure PCTKR2020015686-appb-img-000053
) which correspond to empty rays and further
Figure PCTKR2020015686-appb-img-000054
is considered as a multi-channel image tensor.
Thus,
Figure PCTKR2020015686-appb-img-000055
contains a set of ray embeddings for all non-empty rays, which reflect aggregated information about points on a ray, but do not depend on points from adjacent rays. The notation
Figure PCTKR2020015686-appb-img-000056
is introduced which comprises the aforementioned procedure of grouping across
Figure PCTKR2020015686-appb-img-000057
-rays, RNN-based processing and construction of
Figure PCTKR2020015686-appb-img-000058
canvas with ray embeddings.
Reusing this operation several times, a pyramid of tensors of different resolution is constructed:
Figure PCTKR2020015686-appb-img-000059
With the increase of downscaling, the degree of detail and "sharpness" declines, however, individual rays start containing more points. This results in a better context handling by RNN and fewer holes in the tensor
Figure PCTKR2020015686-appb-img-000060
corresponding to empty rays at smaller scales.
As a last step of the pipeline, the ray embeddings are fused and transformed into a final RGB image by a fully-convolutional network (FCN)
Figure PCTKR2020015686-appb-img-000061
. Architecture behind
Figure PCTKR2020015686-appb-img-000062
was mainly inspired by U-Net (26), which was augmented by adding pyramid of multi-scale inputs (see Fig. 2). At each scale of the contracting path,
Figure PCTKR2020015686-appb-img-000063
accepts
Figure PCTKR2020015686-appb-img-000064
, where
Figure PCTKR2020015686-appb-img-000065
= 1,2,4,... is a downsampling factor of the respective level, and stacks it with feature maps from a level of higher resolution. This allows the network to fill missing regions and exploit information about wider context from several resolutions at once. Instead of plain convolutions, partial convolutions (14) were used as a replacement in the contracting part. These layers receive an input and a mask and processes only values at non-masked positions in the input, properly reweighting the result with respect to the mask given. This is done in order to make the refining network
Figure PCTKR2020015686-appb-img-000066
less dependent on the possible sparsity of the input caused by holes in the tensors
Figure PCTKR2020015686-appb-img-000067
.
During training, view-invariant point embeddings
Figure PCTKR2020015686-appb-img-000068
, parameters
Figure PCTKR2020015686-appb-img-000069
and
Figure PCTKR2020015686-appb-img-000070
of both networks and auxiliary parameters
Figure PCTKR2020015686-appb-img-000071
are learned (see below) altogether.
Loss functions
The system is trained by optimized a sum of two loss functions. Both losses leverage perceptual loss
Figure PCTKR2020015686-appb-img-000072
between two images
Figure PCTKR2020015686-appb-img-000073
and
Figure PCTKR2020015686-appb-img-000074
based on VGG-19 (27) network pretrained on ImageNet (6):
Figure PCTKR2020015686-appb-img-000075
,
whereas
Figure PCTKR2020015686-appb-img-000076
correspond to feature maps from relu1_2; relu2_2; relu3_4; relu4_4; relu5_4 layers of VGG-19. This choice of layers is commonly justified as a representative subset of perceptual VGG features in style transfer and related areas (8).
The first loss restricts the RNN
Figure PCTKR2020015686-appb-img-000077
to produce a canvas of ray embeddings
Figure PCTKR2020015686-appb-img-000078
semantically consistent with a ground truth picture
Figure PCTKR2020015686-appb-img-000079
. More specifically, along with RNN and CNN trained is one 1×1 convolution layer
Figure PCTKR2020015686-appb-img-000080
and result is compared with the ground truth:
Figure PCTKR2020015686-appb-img-000081
.
The second loss restricts the refining ConvNet to produce a final prediction
Figure PCTKR2020015686-appb-img-000082
similar to the ground truth picture
Figure PCTKR2020015686-appb-img-000083
:
Figure PCTKR2020015686-appb-img-000084
.
The learning procedure consists of two stages. In the first stage called a pretraining, being provided with a set of point clouds of scenes of a similar kind, both embeddings
Figure PCTKR2020015686-appb-img-000085
of the points of each scene and parameters
Figure PCTKR2020015686-appb-img-000086
of all ConvNets are being optimized altogether according to the sum of all loss functions over all scenes. The optimization is performed by ADAM algorithm and involves backpropagation of the loss gradient. In the second stage called a fine-tuning, being provided with a point cloud for at least one scene, the learning is aimed to optimize only the point embeddings for this scene or a set of scenes with respect to the frozen, previously pretrained ConvNets. At the beginning of this stage, the learning process starts with zero descriptor values for this new scene or a set of scenes.
One embodiment of the method for synthesizing a 2D image of a scene as viewed from desired viewpoint is described in more detail with reference to Fig. 3. The method 100 comprises steps S101 to S108.
In the step S101, a 3D point cloud is received. The 3D point cloud can be either stored in the device memory or received from any remote device by wire or wireless communications. The 3D point cloud can be obtained from a plurality of 2D images of the same scene by any technique known in prior art. Each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding.
In the step S102, the desired viewpoint is set as a camera having intrinsic parameters and extrinsic parameters.
In the step S103, the 3D coordinates of each point are transformed to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters. The intrinsic parameters and the extrinsic parameters define a rule of perspective transformation of coordinates from the world coordinate system to a respective screen space coordinate system. Such transformation is known in the prior art and described above. Therefore, detailed description of the transformation is omitted herein.
In the step S104, a plurality of rays diverging from the viewpoint are defined. The rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters. Definition of the plurality of rays is detailed described above referring to Fig. 1.
In the step S105, the points are grouped into point sets associated with the rays. Each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint. Detailed description of grouping the points is set forth above in "Ray Grouping" section.
In the step S106, for each ray, a trained machine learning predictor calculates a ray embedding by aggregating the point embeddings and depths of the respective point set. In the step S107, the ray embeddings are projected onto an image plane. The steps (S103) to (S107) are performed for a plurality of scales. The plurality of scales are defined in advance. In the step S108, the trained machine learning predictor fuses the image planes of the plurality of scales into the 2D image. The steps S106, S107, S108 are detailed described above referring to Fig. 2.
The trained machine learning predictor comprises two parts. The first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).
The first part of the trained machine learning predictor is at least one of a recurrent neural network. The second part of the trained machine learning predictor is a U-net-based neural network.
The machine learning predictor is trained in two consecutive stages. The first stage is a pretraining stage. The second stage is a fine-tuning stage. The first stage is performed on first training data set. The first training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
The second stage is performed on second training data set. The second training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
A training process 200 of a machine learning predictor is illustrated on Fig. 4. Each of the two stages of the training of the machine learning predictor comprises steps S201 to S208.
In the step S201, training data are randomly chosen from respective training data set. For the pretraining stage, the training data are randomly chosen from the first training data set. For the fine-tuning stage, the training data are randomly chosen from the second training data set. The training data belong to randomly chosen scene and comprise the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint.
In the step S202, the 3D coordinates of each point of the 3D point cloud are transformed to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured.
In the step S203, a plurality of rays is defined. The rays diverge from the viewpoint. The rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters.
In the step S204, the points of the point cloud are grouped into point sets associated with the rays. Each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint.
In the step S205, a ray embedding is calculated for each ray by aggregating the point embeddings and depths of the respective point set.
In the step S206, the machine learning predictor processes the ray embeddings to obtain a sum of the loss function values. The process relating to loss functions is detailed described above in "Loss functions" section.
In the step S207, gradient of the obtained sum is evaluated with respect to each scalar weight of the machine learning predictor and point embeddings.
In the step S208, each scalar weight of the machine learning predictor and the embeddings of all points are changed according to the predefined optimizer rule based on the evaluated gradient. The steps (S201) to (S208) are repeated for a predefined number of times.
A block diagram illustrating an electronic computing device according to the present invention is shown on Fig. 5. The electronic computing device 300 comprises at least one processor 301 and a memory 302.
The memory 302 stores numerical parameters of the trained machine learning predictor and instructions. At least one processor 301 executes the instructions stored in the memory 302 to perform the method 100 for synthesizing a 2D image of a scene as viewed from desired viewpoint.
The method disclosed herein can be implemented by at least one processor, such as a central processing unit (CPU), a graphical processing unit (GPU), implemented on at least one of an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), but not limited by them. In addition, the method disclosed herein can be implemented by a computer-readable medium that stores numerical parameters of the trained machine learning predictor and computer-executable instructions that, when executed by a computer processor, cause the computer to perform the inventive method. The trained machine learning predictor and instructions for implementing the present method can be downloaded to the electronic computing device via a network or from the medium.
The present invention can be applied in Virtual Reality headsets, Augmented Reality glasses, Mixed Reality glasses, smartphones and other Virtual and/or Augmented Reality devices and systems.
The above descriptions of embodiments of the invention are illustrative, and modifications to the configuration and implementation are not beyond the scope of the present description. For example, although embodiments of the invention are described generally in connection with Figs. 1 and 2, the descriptions presented above are exemplary. Although the subject matter of the invention is described in the language characteristic of structural features or method steps, it is understood that the subject matter of the invention is not necessarily limited to the described features or steps. Moreover, the specific features and steps described above are disclosed as exemplary forms of implementing the claims. The invention is not limited to the illustrated sequence of the method steps, the sequence can be modified by a skilled person without innovative effort. Some or all of steps of the method may be performed sequentially or in parallel.
Accordingly, it is contemplated that the scope of the embodiment of the invention is limited only by the following claims.

Claims (6)

  1. A method for synthesizing a 2D image of a scene as viewed from desired viewpoint, the method comprising:
    receiving (S101) a 3D point cloud obtained from a plurality of 2D images of the same scene, wherein each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding;
    setting (S102) the viewpoint as a camera having intrinsic parameters and extrinsic parameters;
    transforming (S103) the 3D coordinates of each point to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters;
    defining (S104) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters;
    grouping (S105) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint;
    calculating (S106), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set with a trained machine learning predictor;
    projecting (S107) the ray embeddings onto an image plane,
    wherein the steps (S103) to (S107) are performed for a predefined plurality of scales;
    fusing (S108) the image planes of the plurality of scales by the trained machine learning predictor into the 2D image.
  2. The method according to claim 1, wherein training of the machine learning predictor comprises two consecutive stages:
    a pretraining stage performed on first training data set, wherein the first training data set includes:
    - sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint,
    - the viewpoints, and
    - 3D point clouds, each 3D point cloud obtained from the respective set of 2D images; and
    a fine-tuning stage performed on second training data set, wherein the second training data set includes:
    - sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set,
    - the viewpoints, and
    - 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
  3. The method according to claim 2, wherein each of the two stages of the training of the machine learning predictor comprises:
    randomly choosing (S201), from respective training data set, training data of randomly chosen scene comprising the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint;
    transforming (S202) the 3D coordinates of each point of the 3D point cloud to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured;
    defining (S203) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters;
    grouping (S204) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint;
    calculating (S205), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set;
    processing (S206), by the machine learning predictor, the ray embeddings to obtain a sum of the loss function values;
    evaluating (S207) gradient of the obtained sum with respect to each scalar weight of the machine learning predictor and point embeddings;
    changing (S208) each scalar weight of the machine learning predictor and the embeddings of all points according to the predefined optimizer rule based on the evaluated gradient,
    wherein the steps (S201) to (S208) are repeated for a predefined number of times.
  4. The method according to any one of claims 1 to 3, wherein the trained machine learning predictor comprises two parts, wherein the first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).
  5. The method according to claim 4, wherein the first part of the trained machine learning predictor is at least one of a recurrent neural network, and the second part of the trained machine learning predictor is a U-net-based neural network.
  6. An electronic computing device, comprising:
    at least one processor; and
    a memory that stores numerical parameters of a trained machine learning predictor and instructions that, when executed by at least one processor, cause the at least one processor to perform a method for synthesizing a 2D image of a scene as viewed from desired viewpoint according to any one of claims 1 to 5.
PCT/KR2020/015686 2019-11-12 2020-11-10 Method for synthesizing 2d image of scene as viewed from desired viewpoint and electronic computing device implementing the same WO2021096190A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
RU2019136333 2019-11-12
RU2019136333 2019-11-12
RU2020113525A RU2749749C1 (en) 2020-04-15 2020-04-15 Method of synthesis of a two-dimensional image of a scene viewed from a required view point and electronic computing apparatus for implementation thereof
RU2020113525 2020-04-15

Publications (1)

Publication Number Publication Date
WO2021096190A1 true WO2021096190A1 (en) 2021-05-20

Family

ID=75911419

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/015686 WO2021096190A1 (en) 2019-11-12 2020-11-10 Method for synthesizing 2d image of scene as viewed from desired viewpoint and electronic computing device implementing the same

Country Status (1)

Country Link
WO (1) WO2021096190A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505845A (en) * 2021-07-23 2021-10-15 黑龙江省博雅智睿科技发展有限责任公司 Deep learning training set image generation method based on language
CN113744379A (en) * 2021-08-25 2021-12-03 北京字节跳动网络技术有限公司 Image generation method and device and electronic equipment
CN115375884A (en) * 2022-08-03 2022-11-22 北京微视威信息科技有限公司 Free viewpoint synthesis model generation method, image rendering method and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096463A1 (en) * 2016-09-30 2018-04-05 Disney Enterprises, Inc. Point cloud noise and outlier removal for image-based 3d reconstruction
US20180096527A1 (en) * 2013-10-25 2018-04-05 Appliance Computing III, Inc. Image-based rendering of real spaces
US20190088004A1 (en) * 2018-11-19 2019-03-21 Intel Corporation Method and system of 3d reconstruction with volume-based filtering for image processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096527A1 (en) * 2013-10-25 2018-04-05 Appliance Computing III, Inc. Image-based rendering of real spaces
US20180096463A1 (en) * 2016-09-30 2018-04-05 Disney Enterprises, Inc. Point cloud noise and outlier removal for image-based 3d reconstruction
US20190088004A1 (en) * 2018-11-19 2019-03-21 Intel Corporation Method and system of 3d reconstruction with volume-based filtering for image processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Advances in Intelligent Data Analysis XIX", vol. 42, 19 June 2019, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-030-71592-2, ISSN: 0302-9743, article ALIEV KARA-ALI; SEVASTOPOLSKY ARTEM; KOLOS MARIA; ULYANOV DMITRY; LEMPITSKY VICTOR: "Neural Point-Based Graphics", pages: 696 - 712, XP047590859, DOI: 10.1007/978-3-030-58542-6_42 *
SITZMANN VINCENT; THIES JUSTUS; HEIDE FELIX; NIEBNER MATTHIAS; WETZSTEIN GORDON; ZOLLHOFER MICHAEL: "DeepVoxels: Learning Persistent 3D Feature Embeddings", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 15 June 2019 (2019-06-15), pages 2432 - 2441, XP033686397, DOI: 10.1109/CVPR.2019.00254 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505845A (en) * 2021-07-23 2021-10-15 黑龙江省博雅智睿科技发展有限责任公司 Deep learning training set image generation method based on language
CN113744379A (en) * 2021-08-25 2021-12-03 北京字节跳动网络技术有限公司 Image generation method and device and electronic equipment
CN115375884A (en) * 2022-08-03 2022-11-22 北京微视威信息科技有限公司 Free viewpoint synthesis model generation method, image rendering method and electronic device
CN115375884B (en) * 2022-08-03 2023-05-30 北京微视威信息科技有限公司 Free viewpoint synthesis model generation method, image drawing method and electronic device

Similar Documents

Publication Publication Date Title
Ravi et al. Accelerating 3d deep learning with pytorch3d
Munkberg et al. Extracting triangular 3d models, materials, and lighting from images
WO2021096190A1 (en) Method for synthesizing 2d image of scene as viewed from desired viewpoint and electronic computing device implementing the same
Rosu et al. Permutosdf: Fast multi-view reconstruction with implicit surfaces using permutohedral lattices
Zach Fast and high quality fusion of depth maps
CN109255831A (en) The method that single-view face three-dimensional reconstruction and texture based on multi-task learning generate
Petersen et al. Pix2vex: Image-to-geometry reconstruction using a smooth differentiable renderer
CN110163974B (en) Single-image picture reconstruction method based on undirected graph learning model
Rakotosaona et al. Nerfmeshing: Distilling neural radiance fields into geometrically-accurate 3d meshes
Long et al. Neuraludf: Learning unsigned distance fields for multi-view reconstruction of surfaces with arbitrary topologies
CN113298936B (en) Multi-RGB-D full-face material recovery method based on deep learning
CN113096234A (en) Method and device for generating three-dimensional grid model by using multiple color pictures
Liu et al. A general differentiable mesh renderer for image-based 3D reasoning
CN115423978A (en) Image laser data fusion method based on deep learning and used for building reconstruction
CN115428027A (en) Neural opaque point cloud
Sharma et al. Point cloud upsampling and normal estimation using deep learning for robust surface reconstruction
CN114782417A (en) Real-time detection method for digital twin characteristics of fan based on edge enhanced image segmentation
CN115861570A (en) Multi-view human body reconstruction method based on luminosity consistency matching and optimization algorithm
CN117745932A (en) Neural implicit curved surface reconstruction method based on depth fusion constraint
Tiwary et al. Towards learning neural representations from shadows
CN114581577A (en) Object material micro-surface model reconstruction method and system
Toussaint et al. Fast gradient descent for surface capture via differentiable rendering
CN116681839B (en) Live three-dimensional target reconstruction and singulation method based on improved NeRF
CN117975525A (en) Reconstruction system and method based on unsupervised three-dimensional face model
Nguyen et al. High-definition texture reconstruction for 3D image-based modeling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20887740

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20887740

Country of ref document: EP

Kind code of ref document: A1