WO2020242170A1

WO2020242170A1 - Electronic device and controlling method thereof

Info

Publication number: WO2020242170A1
Application number: PCT/KR2020/006777
Authority: WO
Inventors: Kara-Ali Alibulatovich ALIEV; Victor Sergeevich LEMPITSKY; Dmitry Vladimirovich ULYANOV
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2019-05-28
Filing date: 2020-05-26
Publication date: 2020-12-03

Abstract

An electronic apparatus and a method thereof are provided. The method for controlling an electronic device comprising obtaining an input data including a point cloud with a neural descriptor for each point and a camera parameter for the point cloud, estimating the viewpoint direction based on the input data, rasterizing the points of the point cloud with a z-buffer algorithm using the neural descriptor concatenated with the viewpoint direction as pseudo-colors, obtaining the resulting image by passing the rasterized points through a neural rendering network, and learning the neural descriptor for every point and the neural rendering network, and rendering, using a loss function, the resulting image on a display as a ground truth.

Description

ELECTRONIC DEVICE AND CONTROLLING METHOD THEREOF

A disclosure relates to an electronic device and controlling method thereof and for example to an electronic apparatus capable of rendering computer graphics, virtual reality, augmented reality, and method for modeling complex scenes by representing the geometry of a scene using its point cloud.

The outlined pipeline has been developed and polished by the computer graphics researchers and practitioners for decades. Under controlled settings, this pipeline yields stunningly realistic results. Yet several of its stages (and, consequently, the entire pipeline) remain brittle, often require manual intervention of designers and photogrammetrists, and are challenged by certain classes of objects (e.g. thin objects).

Multiple streams of work aim to simplify the entire pipeline by eliminating some of its stages. Thus, image-based rendering techniques [15, 27, 32, 38] aim to obtain photorealistic views by warping the original camera images using certain (oftentimes very coarse) approximations of scene geometry. Alternatively, point-based graphics [16, 17, 25, 28] discards the estimation of the surface mesh and use a collection of points or unconnected disks (surfels) to model the geometry. More recently, deep rendering approaches [4, 5, 18, 20, 33] aim to replace physics-based rendering with a generative neural network, so that some of the mistakes of the modeling pipeline can be rectified by the rendering deep network.

RGBD scene modeling.

Since the introduction of Kinect, RGBD sensors have been actively used for scene modeling due to the combination of their low cost and their suitability for 3D geometry acquisition [7, 34]. Robust algorithms for RGBD-based simultaneous localization and mapping (SLAM) are now available [11, 23, 41, 46]. Most registration (SLAM) algorithms working with RGBD data construct dense volumetric scene representation, from which scene surface can be extracted e.g. using the marching cubes algorithm [30]. Such surface estimation procedure, however, is limited by the resolution of the underlying volumetric grid, and in general will loose e.g. thin details that might be present in the raw RGBD data.

Surface lightfields.

Since the inception of image-based rendering methods [32, 38], several ways to parameterize the plenoptic function [32] has been proposed. Among the most efficient is the surface hightfields [47].

This parameterization samples the plenoptic function densely at the surface of the scene. Namely, for a dense set of surface elements (parameterized using surface coordinates (u,v)), the radiance/color along the rays along arbitrary 3D angles α is recorded. Most recently, the deep variant of this parameterization was proposed in [5], where a fully-connected neural network accepting (u,v,α) as an input is used to store the surface lightfield. The network parameters are learned from a dataset of images and a surface mesh.

Image generation with ConvNets.

Deep splatting [4] and deep surface lightfields [5] are examples of a fast growing body of work that use neural networks to generate photorealistic images [10]. Generally, these works benefit greatly from the work in machine learning and image processing on generative image modeling and deep image processing, and in particular on that use adversarial learning [14] and perceptual losses [9, 21] to train convolutional neural networks (ConvNets) [26] to output images (rather than to e.g. classify them).

Recent works have demonstrated the ability to synthesize high resolution images [22] and to model sophisticated image [20, 45] and video [44] transformations using deep convolutional networks trained with such losses. In particular, [33] demonstrated how such pixel-to-pixel networks can be used to replace computationally heavy rendering and to directly transform images with rasterized material properties and normal orientations to photorealistic views.

Deep image based rendering.

Recent years have also seen active convergence of image-based rendering and deep learning. A number of works combine warping of preexisting photographs and the use of neural networks to combine warped images and/or to post-process the warping result. The warping can be estimated by stereo matching [12]. Estimating warping fields from a single input image and a low-dimensional parameter specifying a certain motion from a low-parametric family is also possible [13, 49]. Other works perform warping using coarse mesh geometry, which can be obtained through multi-view stereo [18, 43] or volumetric RGBD fusion [31].

Alternatively, some methods avoid explicit warping and instead use some form of plenoptic function estimation and parameterization using neural networks. As mentioned above, [5] proposes network-parameterized deep version of surface lightfields. The approach [40] learns neural parameterization of the plenoptic function in the form of low-dimensional descriptors situated at the nodes of a regular voxel grid and a rendering function that turns the reprojection of such descriptors to the new view into an RGB image.

The closest analogue is work [42], proposed is learning neural textures encoding the point plenoptic function at different surface points alongside the neural rendering convolutional network.

The disclosure relates to an electronic device which is reconstruction of picture scenes on the basis of point clouds that are used as geometric proxies for "geometric model" , and representing an object in 3D space with a set of points or polygons.

A method for controlling an electronic device according to an embodiment of the disclosure includes the steps of obtaining an input data including a point cloud with a neural descriptor for each point and a camera parameters for the point cloud, estimating the viewpoint direction based on the input data, rasterizing the points of the point cloud with a z-buffer algorithm using the neural descriptor concatenated with the viewpoint direction as pseudo-colors, obtaining the resulting image by passing the rasterized points through a neural rendering network, and learning the neural descriptor for every point and the neural rendering network, and rendering, using a loss function, the resulting image on a display as a ground truth.

Here, the estimating step may performs by a software for a camera including at least one of Agisoft Metashape, COLMAP and Open3D.

Here, the estimating step further comprise obtaining a raw data using a hand-held commodity RGB-D sensor, and estimating the viewpoint direction using the raw data, wherein the hand-held commodity RGB-D sensor is used to capture the raw data, which is processed by the software for the camera.

Here, the rasterizing step further comprise performing by first rasterizing each point into a square with a side length that is inversely proportional to a depth of the point with respect to the camera, wherein the neural rendering network provides rendering process which is performed OpenGL without anti-aliasing; superimposing the squares onto each other based on the depths of the camera by using the Z-buffer algorithm; obtaining a channel raw image by iterating over all footprint sets and filling all pixels, and mapping the channel raw image into a three-channel RGB image, by using a pretrained neural rendering network with learnable parameters.

Here, the point cloud is obtained by at least one of open and proprietary, selected from the group COLMAP and Agisoft Metashape.

Here, the point cloud is a scene geometry representation, and a convolutional neural network is used so that the output color value at a pixel depends on multiple neural descriptors and multiple points projected to the neighborhood of this pixel.

Here, the neural rendering network uses a deep convolutional neural network to generate photorealistic renderings from new viewpoints.

Here, the neural descriptors describe both geometric and photometric properties of the data.

Here, the descriptors are local descriptors at that the local descriptors are learned directly from data, and the learning performs in coordination with the learning of the rendering network.

Here, the camera is hand-held RGBD camera, and the point clouds are reconstructed via stereo matching.

Meanwhile, an electronic device according to another embodiment of the disclosure for includes a memory configured to include one or more instructions; and a processor configured to control the one or more instructions, wherein the processor is configured to: obtain an input data including a point cloud with a neural descriptor for each point and camera parameters for the point cloud, estimate the viewpoint direction based on the input data, rasterize the points of the point cloud with z-buffer algorithm using the neural descriptor concatenated with viewpoint direction as pseudo-colors, obtain the resulting image by passing the rasterized points through a neural rendering network, and learning the neural descriptor for every point and the neural rendering network, and render, using a loss function, the resulting image on a display as a ground truth.

Here, the processor may obtain a raw data using a hand-held commodity RGB-D sensor, and estimate the viewpoint direction using the raw data, wherein the hand-held commodity RGB-D sensor is used to capture raw data, which is then processed by above mentioned software for camera.

Here, the processor may perform by first rasterizing each point into a square with a side length that is inversely proportional to a depth of the point with respect to the camera, wherein the neural rendering network provides rendering process which is performed OpenGL without anti-aliasing; superimpose the squares onto each other based on the depths of the camera by using the Z-buffer algorithm; obtain an channel raw image by iterating over all footprint sets and filling all pixels, and map the channel raw image into a three-channel RGB image, by using a pretrained rendering network with learnable parameters.

The above and other aspects, features and advantages of certain embodiments of the present disclosure will be more apparent from the following detailed description, taken in conjunction with the accompanying drawings, in which:.

Fig. 1a illustrates point cloud constructed from registered RGBD scans.

Fig. 1b illustrates learning the neural descriptors for every point.

Fig. 1c illustrates proposed a neural rendering network that maps the rasterized point descriptors to realistic images.

Fig. 2 schematically represented proposed disclosure.

Fig. 3 illustrated comparative results on the ' Studio' dataset.

Fig. 4 illustrated comparative results on the ' LivingRoom' dataset (from [7]) - same format as in Fig. 3.

Fig. 5 illustrated comparative results on the ' Plant' dataset - same format as in Fig. 3.

Fig. 6 illustrated comparative results on the ' Shoe' dataset - same format as in Fig. 3.

Fig. 7 illustrated, that system can be used to speed up renderings of synthetic scenes.

FIG. 8 is a block diagram of an electronic apparatus according to an embodiment of the disclosure.

FIG. 9 is a flowchart provided to explain a controlling method of an electronic apparatus according to an embodiment.

Proposed disclosure learns neural descriptors of surface elements jointly with the rendering network (neural network). Proposed approach uses point-based geometry representation and thus avoids the need for surface estimation and meshing.

Proposed approach directly benefits from the availability of robust RGBD SLAM/registration algorithms, however it does not rely on the volumetric scene modeling and uses the point cloud assembled from the raw RGBD scans as the geometric model.

In the proposed disclosure highly relevant are methods that successfully apply deep ConvNets for image inpainting tasks [19, 29, 48]. Several modifications to the convolutional architecture with the ability to handle and fill in holes have been suggested, and in the proposed disclosure used is gating convolutional layers from [48].

In the proposed disclosure authors follow the point based graphics paradigm as authors represent the geometry of a scene using its point cloud. However, authors do not estimate the surface orientation, or suitable disk radii, or, in fact, even color, explicitly. Instead, authors keep a 3D point as modeling primitive and encode all local parameters of the surface (both photometric and geometric) within neural descriptors that are learned from data.

Proposed is a method for rendering pictures on a display comprising: receiving a points cloud with neural descriptors D for each point and camera parameters C for the points cloud as input data; estimating the viewpoint direction from the input data by software for camera pose and geometry estimation; learning neural descriptors for every point and neural network; making the loss function according to learning the neural network and descriptors; rasterizing the points of the points cloud with z-buffer algorithm using learned neural descriptors concatenated with viewpoint direction as learned descriptor, wherein said rasterization pass through learned neural rendering network to obtain the resulting image; rendering, using the loss function, the resulting image on the display. Wherein the estimating consists in camera pose and geometry estimation by a software for camera such as Agisoft Metashape or COLMAP or Open3D. Hand-held commodity RGB-D sensors are used to capture raw data, which is then processed by above mentioned software for camera. The rasterizing consists in performing by first rasterizing each point into a square with the side length that is inversely proportional to the depth of the point with respect to the camera, wherein neural rendering network provides rendering process which is performed OpenGL without anti-aliasing; the Z-buffer algorithm is used for superimposing these squares onto each other using their depths w.r.t. the camera; creating an channel raw image by iterating over all footprint sets and filling all pixels, using a pretrained rendering network with learnable parameters to map the channel raw image into a three-channel RGB image. The point cloud is obtained by algorithms implemented in various applications, open and proprietary, selected from the group COLMAP or Agisoft Metashape. The point cloud is a scene geometry representation. The neural rendering network uses a deep convolutional neural network to generate photorealistic renderings from new viewpoints. The convolutional neural network is used so that the output color value at a pixel depends on multiple neural descriptors and multiple points projected to the neighborhood of this pixel. The neural descriptors describe both the geometric and the photometric properties of the data. In some embodiments the descriptors are local descriptors at that the local descriptors are learned directly from data, and such learning happens in coordination with the learning of the rendering network. The camera is hand-held RGBD camera. Wherein the point clouds are reconstructed from simple RGB streams or via stereo matching.

Also proposed is computer readable medium, storing a computer-implemented instructions for implementing the proposed method.

Present disclosure is about a new point-based approach for modeling complex scenes of images. The approach uses a raw point cloud as the geometric representation of a scene, and augments each point with a learnable neural descriptor that encodes local geometry and appearance of scene.

Proposed approach brings together several lines from computer graphics, computer vision, and deep learning communities. The present disclosure provides high realism of rendering given imperfectly reconstructed scene geometry, simplicity of scene modeling, robustness of scene modeling. Provided is improved realism of scene rendering in the situations when scene geometry is not perfectly modeled. The solution may be stored in the device memory or on any suitable storage medium, and can be realized in any system that involves computer graphics (games, VR, AR), desktop, laptop, mobile.

Presented disclosure is a new point-based approach for modeling complex scenes. Similarly to classical point-based approaches, proposed is using 3D points as modeling primitives (surfels). Each of the points in the proposed approach is associated with a local descriptor containing information about local geometry and appearance of scene. Each descriptor may comprise, among other, color information of the point. Although any information, except of color, may present in the descriptor vector, the descriptor vector may be named as "pseudo-color" . A rendering network that translates point rasterizations into realistic views, while taking the learned descriptors, is learned in parallel with the descriptors.

The learning process is performed using a dataset of point clouds and images for each point cloud. There is a point cloud obtained from third-party software (via video). Each point is described by three coordinates. For each point in this cloud, a descriptor (8-dimensional vector) is learned that, passing through a neural network, turns into an RGB color, for example. Further, these descriptors are saved and can be used to render a picture from a different angle, other than the video capture from which the descriptors were learned.

In practice, the descriptor may be a M-dimensional vector, originally the vector is "empty" , the vector is filled, during learning, with information about local geometry and/or appearance of scene for each point in this cloud.

After learning, proposed model can be fitted to new scenes and is capable of producing realistic views from new viewpoints. Notably, proposed system accomplishes this in a purely data-driven manner, while avoiding meshing, or any other form of explicit surface reconstruction, as well as without performing explicit geometric and photometric surface parameter estimation.

Main effect of the disclosure is a possibility of reconstruction of picture scenes on the basis of point clouds that are used as geometric proxies for "geometric model" , and of representing an object in 3D space with a set of points or polygons, while missing information about connectivity as well as geometric noise and holes can be handled by deep rendering networks gracefully. It has also been show that the model benefits from pretraining on a corpus of scenes, and that good results can be obtained with a universal rendering network that has not been fine-tuned for a particular scene.

As illustrated on Fig. 2, given the point cloud P with neural descriptors D and camera parameters C, estimated is the viewpoint directions, and then rasterize the points with z-buffer It is still possible to use soft z-buffer, it differs from the usual z-buffer in that it takes into account the transparency of objects. However, this method works much slower and is not suitable for real-time rendering, using the neural descriptors concatenated with viewpoint directions as pseudo-colors. Such rasterization is then pass through the rendering network to obtain the resulting image. Proposed model is fit to new scene(s) by optimizing the parameters of the rendering network and the neural descriptors by the backpropagation of the perceptual loss function.

A deep rendering network is learned in parallel with the descriptors, so that new views of the scene can be obtained by passing the rasterizations of a point cloud from new viewpoints through this network. The input rasterizations use the learned descriptors as point pseudo colors. The proposed approach can be used for modeling complex scenes and obtaining their photorealistic views, while avoiding explicit surface estimation and meshing. In particular, compelling results are obtained for scene scanned using hand-held commodity RGB-D sensors as well as standard RGB cameras even in the presence of objects that are challenging (variety of objects with thin structure, like leafs, ropes, bicycle wheels etc.) for standard mesh-based modeling. There are several commodity RGB-D sensors, the most available are Microsoft Kinect or Intel RealSense. These sensors are used to capture RGB and depth images. They are not able to estimate viewpoint directions on their own, there is a special software for it, like KinectFusion.

Creating virtual models of real scenes usually involves a lengthy pipeline of operations. Such modeling usually starts with a scanning process, where the photometric properties are captured using camera images and the raw scene geometry is captured using depth scanners or dense stereo matching. The latter process usually provides noisy and incomplete point cloud that needs to be further processed by applying certain surface reconstruction and meshing approaches. Given the mesh, the texturing and material estimation processes determine the photometric properties of surface fragments and store them in the form of 2D parameterized maps, such as texture maps [3], bump maps [2], view-dependent textures [8], surface lightfields [47], Finally, generating photorealistic views of the modeled scene involves computationally-heavy rendering process such as ray tracing and/or radiance transfer estimation. Given a point cloud constructed from registered RGBD scans (fig. 1a), proposed system learns the neural descriptors for every point (the first three PCA dimensions are shown in false color in fig. 1b), and a neural rendering network that maps the rasterized point descriptors to realistic images (fig. 1c). Gaps in geometry, geometric noise, and outlier points are inevitable in raw point clouds collected with consumer RGBD cameras (fig. 1a), such as this scene from the ScanNet dataset. Proposed approach can handle these deficiencies gracefully and synthesizes realistic renderings despite them.

In the present disclosure proposed is a system (approach) that eliminates most of the steps of the classical pipeline. It combines the ideas of image-based rendering, point-based graphics, and neural rendering into a simple approach. The approach uses the raw point-cloud as a scene geometry representation, thus eliminating the need for surface estimation and meshing. Similarly to other neural rendering approaches, proposed approach also uses a deep convolutional neural network to generate photorealistic renderings from new viewpoints. The realism of the rendering is facilitated by the estimation of latent vectors (neural descriptors) that describe both the geometric and the photometric properties of the data. The local descriptors are learned directly from data, and such learning happens in coordination with the learning of the rendering network (see fig. 1a to 1c). Descriptors learn by gradient descent method, in parallel with the neural network that learns to interpret these descriptors. Learning the neural network and descriptors makes the loss function, which says how to change the neural network and descriptors, so that together they produce the desired picture. In other words, the perceptual loss is assumed as the objective function, and when finding the extremum of the objective function using the gradient descent method, the parameters values of the descriptor are defined. However, any optimization method may be used.

Proposed approach is capable of modeling and rendering scenes that are captured by hand-held RGBD cameras as well as simple RGB streams The point cloud is restored at the stage of preparing data from the rgb or rgb-d sequence (video or many photos taken from different angles). This process is associated with a viewpoint assessment, that is, it is necessary to understand from what angle are needed to look at the point cloud. So that this cloud of points appears in place of the object that is in the photo.

A number of comparisons are performed with ablations and competing approaches, demonstrating the capabilities and advantages of the proposed method. In general, results suggest that given the power of modern deep networks, the simplest 3D primitives (i.e. 3D points) represent sufficient and most suitable geometric proxies (another term for "geometric proxy" is "geometric model" , a way to represent an object in 3D space with a set of points or polygons) for neural rendering.

Proposed approach is related to approaches based on surface lightfields, as it implicitly learns the parameterization of pointwise plenoptic function at scene surface within the neural descriptors. Unlike surface lightfields, proposed approach does not require scene surface modeling. Also, differently from [5] that outputs color value independently in each surface vertex, proposed approach uses a convolutional neural network for rendering, so that the output color value at a pixel depends on multiple neural descriptors and multiple points projected to the neighborhood of this pixel.

The proposed rendering process consists in the following Assume that a point cloud (set 3D of points)

(ref. 1 in fig. 2) with M-dimensional neural descriptors D =

(ref. 2 in fig. 2) are given, wherein each d_i represents M-dimensional neural descriptor for corresponding point in the cloud and its rendering from a new view characterized by the camera C (including both extrinsic and intrinsic parameters) needs to be obtained. In particular, assume that the target image has W × H-sized pixel grid, and that its viewpoint is located in point. In particular, assume that the target image has W × H-sized pixel grid, and that its viewpoint is located in point.

Rendering point cloud is drawing points as pixels on an image. New viewpoint means viewpoint not present in the training images.

Camera is a widespread term in computer graphics meaning viewpoint, image size, focal distance and other parameters of real-world camera optics. To render a 3D object, one needs all these parameters, like one needs optics to take a photo. Target image is an image from the training image sequence, which should be reconstructed with proposed neural network and point descriptors. Rendering network means a network which takes rendered point cloud as input and tries to re-paint it to look similar to corresponding target image.

As indicated in fig. 2, point cloud is presented as a set of 3D points, coupled with neural descriptors (M-dimensional vectors) and viewpoint directions. Here, viewpoint direction is vector pointing from camera position to a point in the cloud. Camera defines from which direction the pointcloud must be viewer to match corresponding target image. Taken the point cloud and a camera, points are rasterized (or rendered) using z-buffer. Z-buffer leaves only foremost points. This image is then fed to neural network which transforms this input to RGB image, looking like ground truth (target image). Loss is a function which measures similarity between image from neural network and ground truth.

The rendering is performed by first rasterizing each point

into a square with the side length that is inversely proportional to the depth of the point with respect to the camera C. The rendering is performed using OpenGL without anti-aliasing, so that the dimensions of each square are effectively rounded to the nearest integers. The Z-buffer (fig. 2) algorithm is used for superimposing these squares onto each other using their depths w.r.t. the camera. Let

denote the "footprint" set of the point s_i resulting from such rendering, i.e. a set of pixels that are occupied by the rasterization of the i-th square after z-buffer.

Hereinafter created is an (M +3)-channel raw image S(P,D,C) by iterating over all footprint sets

and filling all pixels from si(C) with the values of d_i (first M channels)) as indicated in fig. 2 (rasterized points). The last three channels are set to the coordinates of the normalized viewpoint direction vector

. Thus, the pixels (x,y) of the raw image are filled as follows:

Where {d_i, v_i} denotes concatenation, and [x,y] denotes the vectorial entry of the raw image corresponding to the pixel (x,y). Concatenating the local surface information encoded within d_i with the viewpoint direction

allows proposed system to model view-dependent photometric effects, and also to fill-in the holes in the point-cloud network in a way that takes the relative orientation of the surface with respect to the viewpoint direction vector into account. The pixels not covered by any footprint are set to the special descriptor value d₀ ∈ R^M (which is also learned for a particular scene), and their viewpoint direction dimensions are set to zeros.

Finally, used is a pretrained rendering network Pretraining stage - took is 52 scans from Scannet http://www.scan-net.org/ and train network to render these scenes. In this stage, the network learns how to interpret point descriptors, which are jointly learned with the network.

Training on new scene - take pretrained network and train it further on a new scene, which was not "seen" by the network before. Using pretrained network boosts rendering quality.

with learnable parameters θ to map the (M + 3)-channel raw image S(P,D,C) into a three-channel RGB image I:

[Rectified under Rule 91, 24.08.2020]

(2)

The rendering network in proposed case has a popular convolutional U-Net architecture [36] with gated convolutions [48].

Learning process in the proposed system.

Neural network jointly with point descriptors is trained for two reasons: fist, it is necessary a neural network which is able to interpret point descriptors. Second, it is necessary to learn descriptors for particular scene or object to render it using neural network.

It is assumed that during learning K training scenes are available. For the k-th scene the point cloud

as well as the set of L_k training ground truth RGB images

with known camera parameters

are given. Learning objective

then corresponds to the mismatch between the rendered and the ground truth RGB images:

[Rectified under Rule 91, 24.08.2020]

(3)

where

denotes the set of neural descriptors for the point cloud of the k-th scene, and Δ denotes the mismatch between the two images (the ground truth and the rendered one). In proposed implementation, used is the perceptual loss [9, 21] that computes the mismatch between the activations of a pretrained VGG network [39].

The learning is performed by optimizing the loss (equation (3)) over both the parameters θ of the rendering network and the neural descriptors

of points in the training set of scenes. Thus, proposed approach learns the neural descriptors directly from data. Optimization is performed by the ADAM algorithm [24]. At that the neural descriptors are updated via backpropagation through (1) of the loss derivatives w.r.t. S(P,D,C) onto d_i.

Modeling new scenes.

After the learning (3) is performed, a new scene can be modeled by proposed system given its point cloud and a set of RGB views registered with this point cloud. For example, in the case of the scene scanned with an RGBD camera, the registered RGBD views can provide both the point cloud and the RGB views.

For a new scene, given a point cloud P' and a set of images

with camera parameters

learned is the neural descriptors

of the new scene, while keeping the parameters θ fixed, by optimizing the objective

:

[Rectified under Rule 91, 24.08.2020]

(4)

By sharing the rendering parameters θ between the training scene and the new scene, proposed system is capable of better generalization resulting in a better new view synthesis.

*Alternatively, rather than keeping the parameters θ of the rendering network fixed, proposed is fine-tuning them to the new scene, using the pre-learned values as initializations. For some scenes, observed is modest improvements in the rendering quality of new views from such fine-tuning. In practical systems, however, it may be desirable to keep the rendering network compatible across multiple scenes (i.e. to have a universal rendering network).

Experimental details.

The model is based on a popular U-Net architecture with four downsampling and upsampling blocks. Max pooling layers with average pooling layers and transposed convolutions with bilinear upsampling layers. Observed is that gated convolutions improve performance of the model on sparse input data, so normal convolutions are substituted with gated convolutions in proposed model. Since used is U-Net as proposed rendering network and learn rich point features separately, it turns out it can be use lightweight network with fewer parameters. Proposed model has four times fewer channels in each convolutional layer than in the original architecture, resulting in 1.96M parameters. It allows us to render real-time, taking 50ms on GeForce RTX 2080 Ti to render a 1296x968 image.

To demonstrate the versatility of the approach, it is evaluated on several types of scenes. Interested is the capture of real scenes using consumer low-cost devices. Thus, considered is two types of capture. First, considered is RGBD streams from the ScanNet dataset of room-scale scenes scanned with a structuredlight RGBD sensor1. Second, considered is the RGB video streams captured by a smartphone. Finally, demonstrated is the relevance of proposed approach to modeling of photometrically-complex synthetic scenes by running it on a standard test scene from the Blender software.

For the ScanNet scenes, used is the provided registration data obtained with the BundleFusion dataset. Used is the mesh geometry computed by BundleFusion in the respective baselines. Given the registration data, point clouds are obtained by joining together the 3D points from all RGBD frames and using volumetric subsampling (with the grid step 1 cm) resulting in the point clouds containing few million points per scene.

In the evaluation, used is two ScanNet scenes ' Studio' (scene 0), which has 5578 frames, and ' LivingRoom' (scene 24), which has 3300 frames. In each case, used is every 100th frames in the trajectory for validation. Then removed is frames within 20 time steps from each of these validation frames from the fitting set, using remaining 3303 and 2007 frames respectively for the fitting (finetuning) and descriptor estimation. Pretrained is rendering networks on the set of 52 scenes (preprocessed in a similar fashion) that does not include Studio and LivingRoom scenes.

For the smartphone-captured scenes, disclosure have run the commercial Agisoft Metashape package (Agisoft. retrieved 20.05.2019. Metashape software. https://www.agisoft.com/), which is one of the best packages for scene modeling/reconstruction. Agisoft Metashape provides the registration, the point cloud, and the mesh by running proprietary structure-and-motion and dense multiview stereo methods. Evaluated is two scenes: ' Shoe' and ' Plant' . The plant scene contains 2727 frames taken with 250ms intervals, out of which put is every 50th into the validation set and withhold 10 frames around these frames and use the rest as the fitting set. The shoe scene has been taken deliberately very small number of images, as it contains 100 frames taken with 250ms intervals, which shuffled is and hold out 10 frames for validation.

Compared is several approaches on the evaluation scenes. Most of these approaches have a rendering network similar to proposed method, which takes an intermediate representation and then is trained to output the final RGB image. Unless stated otherwise, used is the network with 1.96M parameters described above for all methods.

· Adapted. This is a variant of proposed system, where the rendering network and the descriptor space are pretrained on the 52 ScanNet scenes. Then learned is the neural descriptors and fine-tune (adapt) the rendering network on the fitting part of the evaluation scene. Such fine-tuning converges after 30 epochs (8 minutes to 1.5 hours on 4x NVIDIA Tesla V-100 depending on the size of the scene).

· Universal. In this variant, this option does the same as above. However the rendering network is not fine-tuned for the evaluation scheme and is kept fixed, while the neural descriptors of the points are trained. Keeping the rendering network "universal" , i.e. unadapted to a specific scene may be more practical in many scenarios. Such learning converges after 20 epochs (5 minutes to 1 hour on 4x NVIDIA Tesla V-100 depending on the size of the scene).

· Scene.This option does not pretrain the rendering network, and instead learn it on the evaluation scene (its fitting part) only, alongside the point descriptors. Naturally, such approach is more prone to overfitting. Such learning converges after 50 epochs (12 minutes to 2.5 hours on 4x NVIDIA Tesla V-100 depending on the size of the scene).

· Mesh+Texture. In this baseline, given the mesh of the scene obtained with BundleFusion or Metashape, learned is the texture via backpropagation of the same loss as used in proposed method through the texture mapping process onto the texture map. This results in a "classical" scene representation of the textured mesh.

· Mesh+RenderNet. In this variant (similar to e.g. Lookin-Good), additionally learned is the rendering network that maps the rasterizations of the textured mesh into the final RGB images. The rendering network has the same architecture as proposed (except that the input has three channels), and the learning uses the same loss as proposed.

· Direct RenderNet. In this variant, evaluated is an ablation of proposed point-based system without neural descriptors. Here, learned is the rendering network that maps the point cloud rasterized in the same way as in proposed method. However, instead of neural descriptors, used is the color of the point (taken from the original RGBD scan/RGB image), the 3D coordinate of the point, and the viewpoint direction vi as a 9D pseudocolor. The rendering network is then trained with the same loss as proposed. The rendering network is also pretrained on the set of 52 scenes.

· Direct RenderNet (slow). Observed is that the Direct RenderNet variant described above benefits considerably from higher-capacity and slower rendering network. Therefore evaluated is the variant with the rendering network with doubled number of channels in all intermediate layers (resulting in 4x params, 4x FLOPs).

Inventors have also invested a significant effort into adapting the surface lightfields approach to proposed data. However, seldom observed is any improvement over the Mesh+Texture variant, and on average the results on hold-out data was worse. Apparently, surface light field estimation is not suitable for the cases when the mesh geometry is coarse.

Comparison results

The quantitative results of the comparison are shown in Table 1.

[Rectified under Rule 91, 24.08.2020]
[Table 1]

Table 1 is comparison results in terms of the perceptual loss (lower is better), PSNR (higher is better), SSIM (higher is better) measures. The methods marked with * have been pretrained on a hold-out scene dataset. See text for the description of method. In most cases, the variants of our method outperform the baselines.

All comparisons are measured on the validation subsets, for which the obtained and the ground truth RGB images are compared. Reported is the value of the loss on these subsets (note that this comparison is valid, since most of the methods optimize the same loss on the training set). Also reported is the peak signal-to-noise ration (PSNR) and the self-similarity measure (SSIM). Also showed is qualitative comparisons on the validation set frames in Figs. 3-6, where also showed is the point cloud.

Generally, both the quantitative and the qualitative comparison reveals the advantage of using the point cloud as the geometric proxy. Thus, Mesh+texture and Mesh+RenderNet perform worse than all methods that use the point clouds. The exception is the Shoe scene, where the meshing procedure was successful at generating a reasonably good mesh. In all other scenes, there are parts of the scene where the meshing process (BundleFusion or Metashape) has failed, leading to gross mistakes in the renderings. The qualitative comparison reveals such failures that are particularly notorious on thin objects (such as the details of the bicycle in Fig. 3 or the leaves of the plant in Fig. 5).

Fig. 3 illustrated comparative results on the ' Studio' dataset. Showed is the textured mesh, the colored point cloud, the results of three neural rendering systems, and the ground truth. Proposed system can successfully reproduce details that pose challenging for meshing, and suffers less from blurriness than the Direct RenderNet system. From left to right, from up to down: 1) mesh + texture (see table 1) point cloud with colors obtained from BundleFusion, 3) Direct RenderNet (see table 1), mesh + RenderNet (see table 1), 5) Ours-full (see table 1) image taken from RGB sensor (≪ground truth≫)

In Fig. 4, we show comparative results on the ≪Living Room≫ scene from ScanNet dataset.

Fig. 6 illustrated comparative results on the ' Shoe' dataset - same format as in Fig. 3. Unlike the other three datasets, the geometry of this scene was more suitable for mesh representation, and the mesh-based rendering performs relatively well. Proposed method again outperforms the Direct RenderNet baseline.

Proposed system based on neural descriptors of the point generally outperforms the direct RenderNet ablation, which does not have such descriptors. Proposed validation frames are not too far from the fitting set, and observed is that qualitatively the difference between methods becomes larger when camera is moved further from the fitting set cameras. The effect of this can be observed in supplementary video. Generally, the quality of single frames for such camera positions is considerably better for proposed method that for the Direct baseline (which suffers from blurriness and loss of details). At the same time, admittedly, this strong improvement in the quality of individual frames comes at the price of increased temporal flickering.

Results on synthetic data

Showed is the capability of proposed approach to model synthetic scenes with extremely complex photometric properties (Fig. 7). Fig. 7 illustrated, that system can be used to speed up renderings of synthetic scenes. Here, showed is renderings of the standard Blender test scene using proposed system (third column). The closest frame from the dataset of frames used for model fitting is shown the fourth column. While proposed system does not match the result of the ray tracing rendering exactly, it manages to reproduce some details in the specular reflection, and fine details in the texture, while doing so at real-time speed.

Here, the use of proposed approach may be justified as a means for accelerating rendering. Towards this end, took is the default Blender [2] test scene with complex lighting and highly-specular object in the center, sample a point cloud (2.5 million points) from its surface, and learn the neural descriptors and the rendering networks from 200 random views of the scene. The comparison of proposed renderings with the "ground truth" synthetic renderings obtained by ray tracing within Blender reveals very close match (Fig. 7). While Blender takes about 2 minutes to render one frame of this scene on two GeForce RTX 2080 Ti (highest quality setting), proposed renderings are obtained at 50ms (20 frames-per-second) on one GeForce RTX 2080Ti. Noted is that given the availability of a good surface mesh for this scene, mesh-based neural rendering approaches are also likely to perform well at this task.

Thus, presented is a neural point-based approach for modeling complex scenes. Similarly to classical point-based approaches, used is 3D points as modeling primitives. Each of the points in proposed approach is associated with a local descriptor containing information about local geometry and appearance. A rendering network that translates point rasterizations into realistic views, while taking the learned descriptors as an input point pseudo-colors, is learned in parallel with the descriptors themselves.

The learning process is performed using a dataset of point clouds and images. After learning, proposed model can be fitted to new scenes and is capable of producing realistic views from new viewpoints.

Notably, proposed system accomplishes this in a purely data-driven manner, while avoiding meshing, or any other form of explicit surface reconstruction, as well as without performing explicit geometric and photometric surface parameter estimation.

Main contribution is the demonstration that point clouds can be successfully used as geometric proxies for neural rendering, while missing information about connectivity as well as geometric noise and holes can be handled by deep rendering networks gracefully.

The model benefits from pretraining on a corpus of scenes, and that good results can be obtained with a universal rendering network that has not been fine-tuned for a particular scene.

Limitations and improvements. Proposed model currently cannot fill very big holes in geometry in a realistic way. Such ability is likely to come with additional point cloud processing/inpainting that could potentially be trained jointly with proposed modeling pipeline. Investigated is the performance of the system for dynamic scenes, where some update mechanism for the neural descriptors of points would need to be introduced.

An electronic apparatus 100 according to an embodiment may include a memory 110, and a processor 120. The electronic apparatus 100 according to an embodiment may be implemented as various types of electronic apparatus such as smartphone, AR glasses, tablet PC, artificial intelligence speaker, mobile phone, video phone, e-book reader, TV, desktop PC, laptop PC, netbook computer, workstation, camera, smart watch, and the like.

The memory 110 may store various programs and data necessary for the operation of the electronic apparatus 100. Specifically, at least one instruction may be stored in the memory 110. The processor 130 may perform the operation of the electronic apparatus 100 by executing the instruction stored in the memory 110.

The memory 110 may be implemented as a non-volatile memory, a volatile memory, a flash memory, a hard disk drive (HDD) or a solid state drive (SSD). The memory 110 is accessed by the processor 130 and reading/recording/modifying/deleting/updating of data can be performed by the processor 130. In the present disclosure, the term 'memory' may include the memory 110, a ROM (not illustrated) in the processor 130, a RAM (not illustrated), or a memory card (not illustrated) (e.g., a micro SD card or a memory stick) mounted on the electronic apparatus 100. In addition, when the electronic apparatus 100 includes a display, programs and data for configuring various screens to be displayed on a display area of the display may be stored in the memory 110.

The functions related to the artificial intelligence according to the disclosure are operated through the processor 120 and the memory 110. The processor 120 may be implemented by a system-on-chip (SoC) or a large scale integration (LSI) in which a processing algorithm is embedded, and may also be implemented in the form of a field programmable gate array (FPGA). The processor 120 may perform various functions by executing computer executable instructions stored in the memory to be described later. The processor 120 may be configured as one or a plurality of processors. Here, one or the plurality of processors may be a general-purpose processor such as a CPU, an AP, or the like, a graphic-dedicated processor such as a GPU, a VPU, or the like, or an artificial intelligence dedicated process such as an NPU. One or the plurality of processors performs a control to process input data according to predefined operating rules or artificial intelligence models stored in the memory 110. The predefined operating rules or artificial intelligence models are characterized by being created through learning. Here, the predefined operating rules or artificial intelligence models created through learning refer to the predefined operating rules or artificial intelligence models of desired characteristics created by applying learning algorithms to a large number of learning data. The learning of the artificial intelligence model may be performed in a device itself in which the artificial intelligence according to the disclosure is performed, or may also be performed through a separate server/system.

The processor 120 may be electrically connected to the memory 110 and control the overall operations and functions of the electronic apparatus 100. In particular, the processor 120 may obtain an input data including a point cloud with neural descriptors D for each point and camera parameters C for the point cloud. For example, the point cloud is obtained by one of open and proprietary, selected from the group COLMAP and Agisoft Metashape. For example, the camera is hand-held RGBD camera, and the point clouds are reconstructed via stereo matching.

And the processor 120 may estimate the viewpoint directions based on the input data. For example, the processor 120 may obtain raw data using hand-held commodity RGB-D sensors, and estimating the viewpoint directions using the raw data. According to an embodiment, the hand-held commodity RGB-D sensors are used to capture raw data, which is then processed by above mentioned software for camera.

And the processor 120 may rasterize the points of the point cloud with z-buffer algorithm using neural descriptors concatenated with viewpoint directions as pseudo-colors. For example, the processor 120 may perform, by first rasterizing each point into a square with the side length that is inversely proportional to the depth of the point with respect to the camera, wherein neural rendering network provides rendering process which is performed OpenGL without anti-aliasing. And, the processor 120 may superimpose the squares onto each other based on the depths of the camera by using the Z-buffer algorithm, and obtain an channel raw image by iterating over all footprint sets and filling all pixels, and map the channel raw image into a three-channel RGB image, by using a pretrained rendering network with learnable parameters. According to an embodiment, the neural rendering network uses a deep convolutional neural network to generate photorealistic renderings from new viewpoints.

And the processor 120 may obtain the resulting image by passing the rasterized points through neural rendering network, and learning neural descriptors for every point and neural network.

And the processor 120 may render, using a loss function, the resulting image on the display as a ground truth.

According to an embodiment, the point cloud is a scene geometry representation, and the convolutional neural network is used so that the output color value at a pixel depends on multiple neural descriptors and multiple points projected to the neighborhood of this pixel. And the descriptors are local descriptors at that the local descriptors are learned directly from data, and the learning performs in coordination with the learning of the rendering network.

According to an embodiment, the neural descriptors describe both the geometric and the photometric properties of the data.

First of all, the electronic apparatus 100 may obtain an input data including a point cloud with neural descriptors for each point and camera parameters for the point cloud (S910). For example, the point cloud is obtained by one of open and proprietary, selected from the group COLMAP and Agisoft Metashape. For example, the camera is hand-held RGBD camera, and the point clouds are reconstructed via stereo matching.

And, the electronic apparatus 100 may estimate the viewpoint directions based on the input data (S920). According to an embodiment, the hand-held commodity RGB-D sensors are used to capture raw data, which is then processed by above mentioned software for camera.

And, the electronic apparatus 100 may rasterize the points of the point cloud with z-buffer algorithm using neural descriptors concatenated with viewpoint directions as pseudo-colors (S930).

And, the electronic apparatus 100 may obtain the resulting image by passing the rasterized points through neural rendering network, and learning neural descriptors for every point and neural network (S940). And, the electronic apparatus 100 may render, using a loss function, the resulting image on the display as a ground truth (S950).

As for the terms used in the embodiments of the disclosure, general terms that are currently used widely were selected in consideration of the functions described in the disclosure. However, the terms may vary depending on the intention of those skilled in the art who work in the pertinent field, previous court decisions or emergence of new technologies. Also, in particular cases, there may be terms that were designated by the applicant, and in such cases, the meaning of the terms will be described in detail in the relevant descriptions in the disclosure. Thus, the terms used in the disclosure should be defined based on the meaning of the terms and the overall content of the disclosure, but not just based on the names of the terms.

In the present disclosure, terms including an ordinal number such as 'first' , 'second' , etc. may be used to describe various components, but the components are not to be construed as being limited to the terms. The terms are only used to differentiate one component from other components. For example, the first element may be referred to as the second element and similarly, the second element may be referred to as the first element without going beyond the scope of rights of the present disclosure. The term 'and/or' includes a combination of a plurality of items or any one of a plurality of terms.

In addition, in the present disclosure, singular forms used in the specification are intended to include plural forms unless the context clearly indicates otherwise.

Further, it will be further understood that the terms "comprises" or "have" used in the present disclosure, specify the presence of stated features, numerals, steps, operations, components, parts mentioned in this specification, or a combination thereof, but do not preclude the presence or addition of one or more other features, numerals, steps, operations, components, parts, or a combination thereof.

Further, in the present disclosure, a 'module' or a 'unit' performs at least one function or operation and may be implemented by hardware or software or a combination of the hardware and the software. In addition, a plurality of 'modules' or 'units' may be integrated into at least one module and may be realized as at least one processor in an integrated manner except for 'modules' or 'units' that should be realized in specific hardware.

Further, in the present disclosure, a case in which any one part is "connected" with the other part includes a case in which the parts are "directly connected" with each other and a case in which the parts are "electrically connected" with each other with other elements interposed therebetween.

Meanwhile, according to an embodiment, the above-described various embodiments of the disclosure may be implemented as software including instructions that can be stored in machine-readable storage media, which can be read by machine (e.g.: computers). The machine refers to an apparatus that calls instructions stored in a storage medium, and can operate according to the called instructions, and the apparatus may include an electronic apparatus (e.g.: an electronic apparatus (A)) according to the embodiments described in the disclosure. When an instruction is executed by a processor, the processor may perform a function corresponding to the instruction by itself, or by using other components under its control. The instruction may include a code that is generated or executed by a compiler or an interpreter. The storage medium that is readable by machine may be provided in the form of a non-transitory storage medium. Here, the term 'non-transitory' only means that a storage medium does not include signals, and is tangible, but does not indicate whether data is stored in the storage medium semi-permanently or temporarily.

In addition, according to an embodiment of the disclosure, the method according to the various embodiments described above may be provided while being included in a computer program product. A computer program product refers to a product, and it can be traded between a seller and a buyer. The computer program product can be distributed on-line in the form of a storage medium that is readable by machines (e.g.: a compact disc read only memory (CD-ROM)), or through an application store (e.g.: play store TM). In the case of on-line distribution, at least a portion of the computer program product may be stored in a storage medium such as the server of the manufacturer, the server of the application store, and the memory of the relay server at least temporarily, or may be generated temporarily

Also, each of the components according to the aforementioned various embodiments (e.g.: a module or a program) may consist of a singular object or a plurality of objects. In addition, among the aforementioned corresponding sub components, some sub components may be omitted, or other sub components may be further included in the various embodiments. Generally or additionally, some components (e.g.: a module or a program) may be integrated as an object, and perform the functions that were performed by each of the components before integration identically or in a similar manner. Operations performed by a module, a program, or other components according to the various embodiments may be executed sequentially, in parallel, repetitively, or heuristically. Or, at least some of the operations may be executed in a different order, or omitted, or other operations may be added

While preferred embodiments of the disclosure have been shown and described, the disclosure is not limited to the aforementioned specific embodiments, and it is apparent that various modifications can be made by those having ordinary skill in the art to which the disclosure belongs, without departing from the gist of the disclosure as claimed by the appended claims, and such modifications are not to be interpreted independently from the technical idea or prospect of the disclosure.

References

[1] Blender Online Community, retrieved 20.05.2019. Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam. http://www.blender.org

[2] James F Blinn. 1978. Simulation of wrinkled surfaces. In Proc. SIGGRAPH, Vol. 12.ACM, 286-292.

[3] James F Blinn and Martin E Newell. 1976. Texture andreflection in computer generated images. Commun. ACM 19, 10 (1976), 542-547.

[4] Giang Bui, True Le, Brittany Morago, and Ye Duan. 2018. Point-based rendering enhancement via deep learning. The Visual Computer 34, 6-8 (2018), 829-841.

[5] Anpei Chen, Minye Wu, Yingliang Zhang, Nianyi Li, Jie Lu, Shenghua Gao, and Jingyi Yu. 2018. Deep Surface Light Fields. Proceedings of the ACM on Computer Graphics and Interactive Techniques 1, 1 (2018), 14.

[6] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nieβner. 2017. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In Proc. CVPR.

[7] Angela Dai, Matthias Nieβner, Michael

, Shahram Izadi, and Christian Theobalt. 2017. BundleFusion: Real-Time Globally Consistent 3D Reconstruction Using On-the-Fly Surface Reintegration. ACM Trans. Graph. 36, 3 (2017), 24:1-24:18.

[8] Paul Debevec, Yizhou Yu, and George Borshukov. 1998. Efficient view-dependent image-based rendering with projective texture-mapping. In Rendering Techniques. Springer, 105-116.

[9] Alexey Dosovitskiy and Thomas Brox. 2016. Generating Images with Perceptual Similarity Metrics based on Deep Networks. In Proc. NIPS. 658-666.

[10] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. 2015. Learning to generate chairs with convolutional neural networks. In Proc. CVPR. 1538-1546.

[11] Felix Endres,

Hess,

Sturm, Daniel Cremers, and Wolfram Burgard. 2014. 3-D mapping with an RGB-D camera. IEEE transactions on robotics 30, 1 (2014), 177-187.

[12] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 2016. Deepstereo: Learning to predict new views from the world' s imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5515-5524.

*[13] Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor S. Lempitsky. 2016. DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation. In Proc. ECCV. 311-326.

[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proc. NIPS. 2672-2680.

[15] Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. 1996. The Lumigraph. In SIGGRAPH. ACM, 43-54.

[16] Markus Gross, Hanspeter Pfister, Marc Alexa, Mark Pauly, Marc Stamminger, and Matthias Zwicker. 2002. Point based computer graphics. Eurographics Assoc.

[17] Jeffrey P Grossman and William J Dally. 1998. Point sample rendering. In Rendering

98. Springer, 181-192.

[18] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel J. Brostow. 2018. Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph. 37, 6 (2018), 257:1-257:15.

[19] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and Locally Consistent Image Completion. ACM Transactions on Graphics (Proc. of SIGGRAPH 2017) 36, 4, Article 107 (2017), 107:1-107:14 pages.

[20] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Imageto- Image Translation with Conditional Adversarial Networks. In Proc. CVPR. 5967-5976.

[21] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proc. ECCV. 694-711.

[22] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations. https://openreview.net/forum?id=Hk99zCeAb

[23] Christian Kerl,

Sturm, and Daniel Cremers. 2013. Dense visual SLAM for RGB-D cameras. In Proc. IROS. IEEE, 2100-2106.

[24] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980

[25] Leif Kobbelt and Mario Botsch. 2004. A survey of point-based techniques in computer graphics. Computers & Graphics 28, 6 (2004), 801-814.

[26] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation 1, 4 (1989), 541-551.

[27] Marc Levoy and Pat Hanrahan. 1996. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 31-42.

[28] Marc Levoy and Turner Whitted. 1985. The use of points as a display primitive. Citeseer.

[29] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-ChunWang, Andrew Tao, and Bryan Catanzaro. 2018. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV). 85-100.

[30] William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution 3D surface construction algorithm. In Proc. SIGGRAPH, Vol. 21. 163-169.

[31] Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, et al. 2018. LookinGood: enhancing performance capture with real-time neural re-rendering. In SIGGRAPH Asia 2018 Technical Papers. ACM, 255.

[32] Leonard McMillan and Gary Bishop. 1995. Plenoptic modeling: an image-based rendering system. In SIGGRAPH. ACM, 39-46.

[33] Oliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, Hans-Peter Seidel, and Tobias Ritschel. 2017. Deep Shading: Convolutional Neural Networks for Screen Space Shading. Comput. Graph. Forum 36, 4 (2017), 65-78.

[34] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew W. Fitzgibbon. 2011. KinectFusion: Real-time dense surface mapping and tracking. In ISMAR. IEEE Computer Society, 127-136.

[35] Hanspeter Pfister, Matthias Zwicker, Jeroen Van Baar, and Markus Gross. 2000. Surfels: Surface elements as rendering primitives. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 335-342.

[36] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234-241.

[37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR abs/1505.04597 (2015). arXiv:1505.04597 http://arxiv.org/abs/1505.04597

[38] Steven M Seitz and Charles R Dyer. 1996. View morphing. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM,21-30.

[39] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).arXiv:1409.1556 http://arxiv.org/abs/1409.1556

[40] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nieβner, GordonWetzstein, and Michael

. 2019. DeepVoxels: Learning Persistent 3D Feature Embeddings. In Proc. CVPR.

[41]

Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In Proc. IROS. IEEE, 573-580.

[42] Justus Thies, Michael

, and Matthias Nieβner. 2019. Deferred Neural Rendering: Image Synthesis using Neural Textures. In Proc. SIGGRAPH.

[43] J. Thies, M.

, C. Theobalt, M. Stamminger, and M. Nieβner. 2018. IGNOR: Image-guided Neural Object Rendering. arXiv 2018 (2018).

[44] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-Video Synthesis. In Proc. NIPS.

[45] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proc. CVPR.

[46] Thomas Whelan, Michael Kaess, Hordur Johannsson, Maurice Fallon, John J Leonard, and John McDonald. 2015. Real-time large-scale dense RGB-D SLAM with volumetric fusion. The International Journal of Robotics Research 34, 4-5(2015), 598-626.

[47] Daniel N Wood, Daniel I Azuma, Ken Aldinger, Brian Curless, Tom Duchamp, DavidHSalesin, andWerner Stuetzle. 2000. Surface light fields for 3D photography. In Proc. SIGGRAPH. 287-296.

[48] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018. Free-Form Image Inpainting with Gated Convolution. arXiv preprint arXiv:1806.03589 (2018).

[49] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. 2016. View synthesis by appearance flow. In Proc. ECCV. 286-301.

[50] Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. 2001. Surface splatting. In Proc. SIGGRAPH. ACM, 371-378.

Claims

A method for controlling an electronic device, the method comprising:

obtaining an input data including a point cloud with a neural descriptor for each point and a camera parameter for the point cloud,

estimating the viewpoint direction based on the input data,

rasterizing the points of the point cloud with a z-buffer algorithm using the neural descriptor concatenated with the viewpoint direction as pseudo-colors,

obtaining the resulting image by passing the rasterized points through a neural rendering network, and learning the neural descriptor for every point and the neural rendering network, and

rendering, using a loss function, the resulting image on a display as a ground truth.
The method according to claim 1, wherein estimating performs by a software for a camera including at least one of the Agisoft Metashape, COLMAP and Open3D.
The method according to claim 2,

wherein estimating further comprise;

obtaining a raw data using a hand-held commodity RGB-D sensor, and

*estimating the viewpoint direction using the raw data,

wherein the hand-held commodity RGB-D sensor is used to capture the raw data, which is processed by the software for the camera.
The method according to claim 1,

wherein rasterizing further comprise;

performing by first rasterizing each point into a square with a side length that is inversely proportional to a depth of the point with respect to the camera, wherein the neural rendering network provides rendering process which is performed OpenGL without anti-aliasing;

superimposing the squares onto each other based on the depths of the camera by using the Z-buffer algorithm;

obtaining a channel raw image by iterating over all footprint sets and filling all pixels, and

mapping the channel raw image into a three-channel RGB image, by using a pretrained neural rendering network with learnable parameters.
The method according to claim 1, wherein the point cloud is obtained by at least one of open and proprietary, selected from the group COLMAP and Agisoft Metashape.
The method according to claim 1,

wherein the point cloud is a scene geometry representation, and

wherein a convolutional neural network is used so that the output color value at a pixel depends on multiple neural descriptors and multiple points projected to the neighborhood of this pixel.
The method according to claim 1, wherein the neural rendering network uses a deep convolutional neural network to generate photorealistic renderings from new viewpoints.
The method according to claim 1,

Wherein the neural descriptor describe both geometric and photometric properties of the data.
The method according to claim 6, wherein the neural descriptor are local descriptor at that the local descriptor are learned directly from data, and the learning performs in coordination with the learning of the neural rendering network.
The method according to claim 2,

wherein the camera is hand-held RGBD camera, and

wherein the point cloud is reconstructed via stereo matching.
An electronic apparatus comprising:

a memory configured to include one or more instructions; and

a processor configured to control the one or more instructions,

wherein the processor is configured to:

obtain an input data including a point cloud with a neural descriptor for each point and camera parameter for the point cloud,

estimate the viewpoint direction based on the input data,

rasterize the points of the point cloud with z-buffer algorithm using the neural descriptor concatenated with viewpoint direction as pseudo-colors,

obtain the resulting image by passing the rasterized points through a neural rendering network, and learning the neural descriptor for every point and the neural rendering network, and

render, using a loss function, the resulting image on a display as a ground truth.
The electronic device of claim 11, wherein estimating performs by a software for a camera including at least one of the Agisoft Metashape, COLMAP and Open3D.
The electronic device of claim 12, wherein the processor is configured to:

obtain a raw data using a hand-held commodity RGB-D sensor, and

estimate the viewpoint direction using the raw data,

wherein the hand-held commodity RGB-D sensor is used to capture raw data, which is then processed by above mentioned software for camera.
The electronic device of claim 11, wherein the processor is configured to:

perform by first rasterizing each point into a square with a side length that is inversely proportional to a depth of the point with respect to the camera, wherein the neural rendering network provides rendering process which is performed OpenGL without anti-aliasing;

superimpose the squares onto each other based on the depths of the camera by using the Z-buffer algorithm;

obtain an channel raw image by iterating over all footprint sets and filling all pixels, and

map the channel raw image into a three-channel RGB image, by using a pretrained rendering network with learnable parameters.
The electronic device of claim 11, wherein the point cloud is obtained by at least one of open and proprietary, selected from the group COLMAP and Agisoft Metashape.