WO2021096190A1

WO2021096190A1 - Method for synthesizing 2d image of scene as viewed from desired viewpoint and electronic computing device implementing the same

Info

Publication number: WO2021096190A1
Application number: PCT/KR2020/015686
Authority: WO
Inventors: Kara-Ali Alibulatovich ALIEV; Maria Vladimirovna KOLOS; Victor Sergeevich LEMPITSKY; Artem Mikhailovich SEVASTOPOLSKIY
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2019-11-12
Filing date: 2020-11-10
Publication date: 2021-05-20

Abstract

The present invention relates generally to: the fields of computer vision and computer graphics to produce 2D images of a 3D scene as viewed from different viewpoints; a method for synthesizing a 2D image of a scene as viewed from desired viewpoint; and an electronic computing device implementing the method. The method comprises: receiving (S101) a 3D point cloud obtained from a plurality of 2D images of the same scene; setting (S102) the viewpoint as a camera having intrinsic parameters and extrinsic parameters; transforming (S103) the 3D coordinates of each point to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters; defining (S104) a plurality of rays diverging from the viewpoint; grouping (S105) the points into point sets associated with the rays; calculating (S106), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set with a trained machine learning predictor; projecting (S107) the ray embeddings onto an image plane; and fusing (S108) the image planes of the plurality of scales by the trained machine learning predictor into the 2D image. The present invention allows to synthesize, from the point clouds, a 2D image of a scene as viewed from desired viewpoint with high quality and low computational cost.

Description

METHOD FOR SYNTHESIZING 2D IMAGE OF SCENE AS VIEWED FROM DESIRED VIEWPOINT AND ELECTRONIC COMPUTING DEVICE IMPLEMENTING THE SAME

The present invention relates generally to the fields of computer vision and computer graphics to produce 2D images of a 3D scene as viewed from different viewpoints, in particular, to a method for synthesizing a 2D image of a scene as viewed from desired viewpoint and an electronic computing device implementing the method.

Several approaches for neural appearance capture of geometrically and photometrically complex scenes have been proposed recently (16, 31, 28, 29). Such approaches rely on differentiable rendering of certain geometric representations, and the capture process is usually performed by the fitting procedure that iterates between such rendering and backpropagation of the loss between the rendered images and the ground truth images onto the learnable parameters.

Several types of geometric representations have been investigated in this context. Thus (31) assumes that the scene geometry is modeled with a triangular mesh that is provided to the fitting process, and estimates neural textures in order to capture the photometric properties of different parts of the surface. Alternatively, several works (9, 22, 15) attempt to learn both the mesh and the texture (or surface colors) through the backpropagation process, which however has proven to be very difficult due to inherent non-differentiability of mesh rasterization near occluding boundaries. Another line of works (16, 21) investigates explicit and implicit volumetric representations that are learned jointly with the rendering network.

The present invention follows the neural point-based graphics framework (2), which uses neural modeling based on point cloud geometry representations. Point clouds have several attractive properties compared to mesh representations and volumetric representations. First, unlike volumetric representations, they scale well to large scenes, as points in the cloud need not be uniformly or near-uniformly distributed. Second, while meshes may be unsuitable to represent various complex phenomena such as thin objects, point clouds can model them efficiently. In general, point clouds of natural scenes are easier to obtain than their mesh representations, with the meshing process being one of the most brittle steps of the traditional image-based modeling pipelines. At the same time, point clouds arise as an intermediate representation early in such pipelines, whenever a scene is captured using a depth sensor, which generates a collection of depth scans, or by passive multi-view stereo, which usually also generates a collection of dense or semi-dense depth maps.

While the original framework (2) demonstrated several capture results of high realism, it is limited in several aspects. Thus, the rendering pipeline (2) starts with a "hard" rasterization of points using OpenGL z-buffering. This may introduce significant noise and overfitting into the learning process when the point cloud is noisy, as the outlier points occlude the true surface points. While, in principle, the learning process can identify such outlier points and learn to "inpaint them out" , this requires extra network capacity and may lead to overfitting, when outliers are observed in very few views. Secondly, the success of (2) depends on the choice of the point radius used for their rasterization. If the selected radius does not roughly match point density, the results may degrade considerably, with either fine details being lost, or invisible surfaces "bleeding" through the visible ones.

3D representation of a scene

Today a lot of representations of 3D scenes with different properties exist, and many of them can be used for automated processing. These include point clouds, meshes, Signed Distance Functions (SDF) (21), voxel representations and Octrees (18, 30), etc. Point clouds are simple to process, as they are stored as two real-valued arrays - points coordinates with respect to some world system and points colors. Their expressiveness depends only on the number of points, and clouds of varying spatial density can be reflected. Nowadays, numerous studies exist which operate on point clouds for 3D models classification, segmentation and generation (13, 11, 20, 10, 24, 25). Voxel representations are also natural to be learned and used for any kind of processing, however, they occupy a large portion of memory and cannot adapt to varying resolution. Meshes, basically being point clouds with nondifferentiable triangles, are much harder to process and are mainly employed for rendering (19, 5). A large body of work exists on synthesis of projections via various hardware and software techniques with a trade-off between photorealism and speed (1, 3, 19, 23).

Differentiable rendering

Differentiable rendering frameworks allow to generate gradients with respect to various scene parameters, such as intrinsic parameters and extrinsic parameters of a camera, spatial and physical properties of a given 3D representation (e.g. mesh vertices positions, colors, or reflectance), and lighting. Significant amount of works is devoted to mesh rendering since mesh topology allows to leverage geometry deformation and use generic priors for various reconstruction tasks. Soft Rasterizer (15) suggests probabilistic smoothing of the discrete sampling operation, while OpenDR (17) and Kato (9) explicitly derive approximate partial derivatives which, however, rely on numerical methods. Authors of (32) address this issue by computing the gradients through integrating pixel intensity function. Pix2Vex (22) features semi-transparent Z-Buffer and cyclic image/geometry training pipeline. Likewise, triangle rasterization ray tracing allows to be extended in a differentiable manner: in particular, a method from (12) enforces edge sampling with further integration to approximate gradients with respect to accurate occlusion handling. Point clouds, despite being a more simple representation for automated processing than meshes due to the absence of non-differentiable topology, are significantly harder to be realistically rendered. Differentiable Surface Splatting (33) method evaluates a rendered image by projecting points onto a canvas (an element for producing a 2D image, an image plane) and blending them with truncated Gaussian kernels. Due to truncation introduced for the sake of efficiency, derivatives are calculated approximately.

Neural rendering

Unlike physical based pipelines, neural rendering implies learning arbitrary scene representation in order to generate realistic imagery and manipulate its appearance (from scene attributes manipulation to inpainting). For instance, Neural Volumes (16) is based on prediction of 4D volume (RGB + opacity) for a model based on several photos by variational autoencoder, warping the volume and its opacity-aware integration. Deferred Neural Rendering (31) learns to estimate a neural texture of an object based on its UV coordinate maps and is aimed to synthesize an image based on this deep texture samples. DeepVoxels (28) takes a similar approach and estimate a volumetric latent neural code of an object by a combination of CNN and RNN (GRU), is followed up by Scene Representation Networks (29) which only employ RNN for learning point depths. Neural Point-Based Graphics (2) undertakes an in-between approach: it involves learning embeddings of points in a cloud, splats visible points with a large kernel onto a canvas via fast Z-Buffer and desparsifies the result by a CNN of U-Net type. Nevertheless, due to hard Z-Buffer this pipeline is not fully differentiable.

The present invention has been created to eliminate at least one of the above shortcomings and to provide at least one of the advantages described below.

CITATION LIST

(1) T. Aila and S. Laine. Understanding the efficiency of ray traversal on gpus. In Proceedings of the conference on high performance graphics 2009, pages 145-149. ACM, 2009. 2

(2) K.-A. Aliev, D. Ulyanov, and V. Lempitsky. Neural point-based graphics. arXiv preprint arXiv:1906.08240, 2019. 1, 2

(3) F. C. Crow. A comparison of antialiasing techniques. IEEE Computer Graphics and Applications, (1):40-48, 1981. 2

(4) J. M. Cychosz. An introduction to ray tracing. Computers & Graphics, 17(1):107, 1993. 3

(5) K. Dempski and D. S. Dietrich. Real-time rendering tricks and techniques in directx. Premier Press, 2002. 2

(6) J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248-255, June 2009. 5

(7) S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997. 2, 3

(8) J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, pages 694-711, 2016. 5

(9) H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3907-3916, 2018. 1, 2

(10) L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages 4558-4567, 2018. 2

(11) C.-L. Li, M. Zaheer, Y. Zhang, B. Poczos, and R. Salakhutdinov. Point cloud gan. arXiv preprint arXiv:1810.05795, 2018. 2

(12) T.-M. Li, M. Aittala, F. Durand, and J. Lehtinen. Differentiable monte carlo ray tracing through edge sampling. In SIGGRAPH Asia 2018 Technical Papers, page 222. ACM, 2018. 2

(13) C.-H. Lin, C. Kong, and S. Lucey. Learning efficient point cloud generation for dense 3d object reconstruction. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. 2

(14) G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 85-100, 2018. 5

(15) S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. arXiv preprint arXiv:1904.01786, 2019. 1, 2

(16) S. Lombardi, T. Simon, J. M. Saragih, G. Schwartz, A. M. Lehrmann, and Y. Sheikh. Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph., 38(4):65:1-65:14, 2019. 1, 2

(17) M. M. Loper and M. J. Black. Opendr: An approximate differentiable renderer. In European Conference on Computer Vision, pages 154-169. Springer, 2014. 2

(18) D. Meagher. Geometric modeling using octree encoding. Computer graphics and image processing, 19(2):129-147, 1982. 2

(19) J. Neider, T. Davis, and M. Woo. OpenGL programming guide, volume 14. Addison-Wesley Reading, MA, 1993. 2

(20) A. Nguyen and B. Le. 3d point cloud segmentation: A survey. In 2013 6th IEEE conference on robotics, automation and mechatronics (RAM), pages 225-230. IEEE, 2013. 2

(21) J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. arXiv preprint arXiv:1901.05103, 2019. 1,2

(22) F. Petersen, A. H. Bermano, O. Deussen, and D. Cohen-Or. Pix2vex: Image-to-geometry reconstruction using a smooth differentiable renderer. arXiv preprint arXiv:1903.11149, 2019. 1, 2

(23) T. Porter and T. Duff. Compositing digital images. In ACM Siggraph Computer Graphics, volume 18, pages 253-259. ACM, 1984. 2, 4

(24) C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652-660, 2017. 2

(25) C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099-5108, 2017. 2

(26) O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015. 5

(27) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 5

(28) V. Sitzmann, J. Thies, F. Heide, M. NieBner, G. Wetzstein, and M. Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In Proc. CVPR, 2019. 1, 2

(29) V. Sitzmann, M. Zollhofer, and G. Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. CoRR, abs/1906.01618, 2019. 1,2

(30) R. Szeliski. Rapid octree construction from image sequences. CVGIP: Image understanding, 58(1):23-32, 1993. 2

(31) J. Thies, M. Zollhofer, and M. NieBner. Deferred neural rendering: Image synthesis using neural textures. In Proc. SIG- GRAPH, 2019. 1, 2

(32) Z. Wu and W. Jiang. Analytical derivatives for differentiable renderer: 3d pose estimation by silhouette consistency. arXiv preprint arXiv:1906.07870, 2019. 2

(33) W. Yifan, F. Serena, S. Wu, C. Oztireli, and O. Sorkine-Hornung. Differentiable surface splatting for point-based geometry processing. arXiv preprint arXiv:1906.04173, 2019. 2, 3

(34) M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Surface splatting. In Proc. SIGGRAPH, pages 371-378. ACM, 2001. 3

-

In the present invention a differentiable neural renderer of point clouds obtained from scans of scenes reconstructed from real-world imagery is introduced. The system is capable of synthesizing realistic and high quality looks of 3D scenes represented as point clouds, even when occlusions, noise, reflections and other complications take place. Suggested neural architecture consists of a recurrent neural network for sequential processing of points grouped by imaginary rays forming the camera frustum and of a fully-convolutional neural network which refines the obtained image. The renderer is trained on a number of 3D scenes captured in-the-wild with a corresponding set of their photographs from several viewpoints, and is able to generate novel photorealistic views of a new scene after training as perceived by an arbitrarily located camera in the real world. Decent results for scenes taken by commodity RGB-D scanners and for point clouds compiled from RGB photos are presented.

Inspired by ray-based rendering the aim of the present invention is to leverage flexibility of point cloud representation within the end-to-end fully differentiable framework that allows to solve various computer graphics tasks along with handling flaws occurring naturally in the real world scans.

In the present invention, two improvements to the neural point-based graphics pipeline (2) are introduced that address two shortcomings. First, the hard rasterization process with recurrent rasterization is replaced, where an LSTM network (7) performs a neural analog to the z-buffer algorithm. Such replacement allows much more graceful handling of outlier points, as the LSTM network can learn to render them fully transparent. In the approach, such LSTM rendering precedes the convolutional rendering and can be trained jointly with it.

As the second part of the contribution, the point radius selection problem is avoided by using the neural analog to the classical Mipmapping algorithm from computer graphics. Thus proposed is a convolutional MipMapNet architecture that rasterizes point cloud several times at different resolutions while always using a single-pixel radius for point rasterization. The resulting rasterizations are fused inside the MipMapNet, so that the fusion process implicitly selects the optimal point radius based on the local point density.

The present invention demonstrates both improvements (neural z-buffer and neural Mipmapping) of the neural point-based graphics framework that the both improvements lead to renderings that are more compelling, temporally-stable, and less prone to artifacts in the presence of the noisy points.

Technologies of the present invention allow to synthesize, from the point clouds, a 2D image of a scene as viewed from desired viewpoint with high quality and low computational cost.

One aspect of the present invention provides a method for synthesizing a 2D image of a scene as viewed from desired viewpoint, the method comprising: receiving (S101) a 3D point cloud obtained from a plurality of 2D images of the same scene, wherein each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding; setting (S102) the viewpoint as a camera having intrinsic parameters and extrinsic parameters; transforming (S103) the 3D coordinates of each point to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters; defining (S104) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters; grouping (S105) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint; calculating (S106), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set with a trained machine learning predictor; projecting (S107) the ray embeddings onto an image plane, wherein the steps (S103) to (S107) are performed for a predefined plurality of scales; fusing (S108) the image planes of the plurality of scales by the trained machine learning predictor into the 2D image.

In additional aspect, training of the machine learning predictor comprises two consecutive stages: a pretraining stage performed on first training data set, wherein the first training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images; and a fine-tuning stage performed on second training data set, wherein the second training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.

In another additional aspect, each of the two stages of the training of the machine learning predictor comprises: randomly choosing (S201), from respective training data set, training data of randomly chosen scene comprising the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint; transforming (S202) the 3D coordinates of each point of the 3D point cloud to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured; defining (S203) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters; grouping (S204) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint; calculating (S205), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set; processing (S206), by the machine learning predictor, the ray embeddings to obtain a sum of the loss function values; evaluating (S207) gradient of the obtained sum with respect to each scalar weight of the machine learning predictor and point embeddings; changing (S208) each scalar weight of the machine learning predictor and the embeddings of all points according to the predefined optimizer rule based on the evaluated gradient, wherein the steps (S201) to (S208) are repeated for a predefined number of times.

In another additional aspect, the trained machine learning predictor comprises two parts, wherein the first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).

In another additional aspect, the first part of the trained machine learning predictor is at least one of a recurrent neural network, and the second part of the trained machine learning predictor is a U-net-based neural network.

Another aspect of the present invention provides an electronic computing device, comprising: at least one processor; and a memory that stores numerical parameters of a trained machine learning predictor and instructions that, when executed by at least one processor, cause the at least one processor to perform a method for synthesizing a 2D image of a scene as viewed from desired viewpoint.

-

The above and other aspects, features and advantages of the present invention will be more apparent from the following detailed description, taken in conjunction with the accompanying drawings, in which:

Fig. 1 is a schematic diagram illustrating the ray grouping process.

Fig. 2 is a schematic diagram illustrating operations of a machine learning predictor.

Fig. 3 is a flowchart illustrating a preferred embodiment of a method for synthesizing a 2D image of a scene as viewed from desired viewpoint.

Fig. 4 is a flowchart illustrating a training process of a machine learning predictor according to the present invention.

Fig. 5 is a block diagram illustrating an electronic computing device according to the present invention.

In the following description, unless otherwise indicated, the same reference numbers are used for the same elements when they are depicted in different drawings, and their parallel description is not given.

-

The following description with reference to the accompanying drawings is provided to facilitate a thorough understanding of various embodiments of the present invention defined by the claims and its equivalents. To facilitate such an understanding the description includes various specific details, but these details should be considered only exemplary. Accordingly, those skilled in the art will find that various changes and modifications to the various embodiments described herein can be developed without departing from the scope of the present invention. In addition, descriptions of known functions and structures may be omitted for clarity and conciseness.

The terms and wordings used in the following description and claims are not limited to bibliographic meanings, but simply used by the inventor to provide a clear and consistent understanding of the present invention. Accordingly, it should be apparent to those skilled in the art that the following description of various embodiments of the present invention is provided for illustration only.

It should be understood that the singular forms as "a," "an," and "the" include the plural, unless the context clearly indicates otherwise.

It should be understood that, although the terms first, second, etc. may be used herein in reference to elements of the present disclosure, such elements should not be construed as limited by these terms. The terms are used only to distinguish one element from other elements.

Additionally, it should be understood that the terms "comprises" , "comprising" , "includes" and/or "including" , as used herein, mean the presence of the mentioned features, meanings, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, values, operations, elements, components and/or groups thereof.

In various embodiments of the present disclosure, "module" or "unit" may perform at least one function or operation, and may be implemented with hardware, software, or a combination thereof. "Plurality of modules" or "plurality of units" may be implemented with at least one processor (not shown) through integration thereof with at least one module other than "module" or "unit" which needs to be implemented with specific hardware.

Hereinafter, various embodiments of the present invention are described in more detail with reference to the accompanying drawings.

is a 3D point cloud defined by point coordinates

with respect to a selected world coordinate system and points embeddings - vectors

associated with each point.

represents a 3D object or scene we aim to realistically render, and embeddings

will be further described as learnable parameters of the scene. Besides that, a camera parameterized by its intrinsic parameters

and extrinsic parameters

according to pinhole camera model (4) is provided. The intrinsic parameters

of the camera comprise a focal length, an image sensor format, and a principal point. The extrinsic parameters

of the camera comprise the location and orientation of the camera with respect to the world coordinate system. These quantities define a rule of perspective transformation of coordinates from the world coordinate system to a respective screen space coordinate system:

By perspective projection (1) obtained are screen-space co-ordinates

and their distances from camera

for all points in

:

If the points characterized by RGB colors instead of embeddings, an image of the point cloud perceived by the camera could be recreated by splatting (34 , 33) points of

onto canvas

,i.e. setting color of each point

as a color of a pixel

in a hard way (straightforward assignment) or soft way (additional setting of point color to neighboring pixels and fusing colors of neighboring points for the sake of smoothness).

Nonetheless, while the described approach to rendering of point clouds is straightforward, it does not allow one to handle many important issues which naturally arise in realistic rendering, such as presence of holes in the image between the projected points, accounting of visibility (most often, if several points splat onto the same pixel, only the closest one to the camera influences the pixel color), lighting effects, and many others.

Ray Grouping

The Fig. 1 illustrates the ray grouping process. A bunny shown is an example of a sample 3D model, which is visualized and stored as a point cloud. A camera in the left-bottom corner perceives an image of a bunny collected on a yellow plane nearby. A red-green-yellow trapezoid is a slice of an infinite pyramid comprising camera frustum - a space of all possible point locations which can have an effect on an image. An image canvas will be saved in a discretized way and will contain

pixels. Let us consider a ray (drawn blue) which passes through pixel

) on the image. One point from bunny's chest and another point from bunny's leg will be considered belonging to this ray, since they get projected onto the same integer position on the image.

In order to make the rendering rule more visibility-aware, points in

are grouped according to their rounded screen-space coordinates as a first step of the pipeline. More specifically, imaginary rays are considered coming out of the camera. Each ray

originates at the camera location and passes through respective pixel

of the image canvas. Screen-space coordinates

of each point in

are arithmetically rounded and considered as belonging to a ray

. The grouping process results in the one-to-one distribution of points among buckets (rays), and points are subsequently sorted in order of their decreasing depth with respect to the camera (see Fig. 1). After the procedure different buckets (rays)

will store a different number of ordered points

, and some of the buckets might remain empty, in case no points were placed in these buckets (see Fig.1).

Neural Architecture

The main parts of our pipeline are illustrated in the Fig. 2. Although, in FIG. 2, the processing is shown with two scales, the scales and their number can be any ones and are set in advance. In a stage (a), at each scale, the point cloud

grouped by a number of rays is passed to the LSTM network with weights shared across all pixels and all scales. LSTM accepts embedding and depth of each points in ray one by one in the order of decreasing depth. In a stage (b), for each ray, LSTM yields an aggregated ray embedding - a vector with

scalars. A

tensor

is obtained by placing ray embeddings on a canvas. In a stage (c), CNN of U-Net-based architecture collects tensors

from all scales at the respective levels of resolution and fuse them into final RGB image.

Ray grouping results in a distribution of

points in

into

rays

of variable length

. Let consider a recurrent neural network (RNN)

with learnable parameters

comprised of recurrent cells, for example LSTM cells (7), but not limited by them. The method will be described for the case of RNN consisting of LSTM cells. Being fed with input, last hidden state and last cell state

, such an RNN transforms this information into the consequent hidden and cell states

is used to sequentially process points in each ray

one by one and aggregate this cumulative information into the output feature of the corresponding image pixel ) (see Fig. 2).

In more detail, for each ray

, feature vectors of its points

(whereas

) are constructed. In case points in

possess more features than just world coordinates and embeddings, such as point color, semantic segmentation, etc., these features can be projected onto screen coordinates and included in

. Fed with features of a new point, RNN produces an updated estimate

of parameters of a whole ray:

Such a construction allows one to mix points along the ray in a back-to-front order, effectively ignoring non-relevant clusters of points and identifying frontal surface of underlying 3D structure with respect to the chosen camera. The resulting cell state

is interpreted as an aggregated feature of a ray which contains relevant information about the pixel color to be estimated (for instance, one of the possible solutions for RNN would be to reproduce embedding of the frontmost point) and it is called a ray embedding. The intuition behind such an approach lies in the ability of LSTM cell to simulate both simple and complex transparency blending rules, including OVER operator (23), order-independent overlay, etc. Stepwise outputs of LSTM are not utilized since expression for the LSTM cell state is more similar to transparency blending formulas than the expression for the output variables.

is set for those pixels

) which correspond to empty rays and further

is considered as a multi-channel image tensor.

Thus,

contains a set of ray embeddings for all non-empty rays, which reflect aggregated information about points on a ray, but do not depend on points from adjacent rays. The notation

is introduced which comprises the aforementioned procedure of grouping across

-rays, RNN-based processing and construction of

canvas with ray embeddings.

Reusing this operation several times, a pyramid of tensors of different resolution is constructed:

With the increase of downscaling, the degree of detail and "sharpness" declines, however, individual rays start containing more points. This results in a better context handling by RNN and fewer holes in the tensor

corresponding to empty rays at smaller scales.

As a last step of the pipeline, the ray embeddings are fused and transformed into a final RGB image by a fully-convolutional network (FCN)

. Architecture behind

was mainly inspired by U-Net (26), which was augmented by adding pyramid of multi-scale inputs (see Fig. 2). At each scale of the contracting path,

accepts

, where

= 1,2,4,... is a downsampling factor of the respective level, and stacks it with feature maps from a level of higher resolution. This allows the network to fill missing regions and exploit information about wider context from several resolutions at once. Instead of plain convolutions, partial convolutions (14) were used as a replacement in the contracting part. These layers receive an input and a mask and processes only values at non-masked positions in the input, properly reweighting the result with respect to the mask given. This is done in order to make the refining network

less dependent on the possible sparsity of the input caused by holes in the tensors

.

During training, view-invariant point embeddings

, parameters

and

of both networks and auxiliary parameters

are learned (see below) altogether.

Loss functions

The system is trained by optimized a sum of two loss functions. Both losses leverage perceptual loss

between two images

and

based on VGG-19 (27) network pretrained on ImageNet (6):

,

whereas

correspond to feature maps from relu1_2; relu2_2; relu3_4; relu4_4; relu5_4 layers of VGG-19. This choice of layers is commonly justified as a representative subset of perceptual VGG features in style transfer and related areas (8).

The first loss restricts the RNN

to produce a canvas of ray embeddings

semantically consistent with a ground truth picture

. More specifically, along with RNN and CNN trained is one 1×1 convolution layer

and result is compared with the ground truth:

.

The second loss restricts the refining ConvNet to produce a final prediction

similar to the ground truth picture

:

.

The learning procedure consists of two stages. In the first stage called a pretraining, being provided with a set of point clouds of scenes of a similar kind, both embeddings

of the points of each scene and parameters

of all ConvNets are being optimized altogether according to the sum of all loss functions over all scenes. The optimization is performed by ADAM algorithm and involves backpropagation of the loss gradient. In the second stage called a fine-tuning, being provided with a point cloud for at least one scene, the learning is aimed to optimize only the point embeddings for this scene or a set of scenes with respect to the frozen, previously pretrained ConvNets. At the beginning of this stage, the learning process starts with zero descriptor values for this new scene or a set of scenes.

One embodiment of the method for synthesizing a 2D image of a scene as viewed from desired viewpoint is described in more detail with reference to Fig. 3. The method 100 comprises steps S101 to S108.

In the step S101, a 3D point cloud is received. The 3D point cloud can be either stored in the device memory or received from any remote device by wire or wireless communications. The 3D point cloud can be obtained from a plurality of 2D images of the same scene by any technique known in prior art. Each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding.

In the step S102, the desired viewpoint is set as a camera having intrinsic parameters and extrinsic parameters.

In the step S103, the 3D coordinates of each point are transformed to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters. The intrinsic parameters and the extrinsic parameters define a rule of perspective transformation of coordinates from the world coordinate system to a respective screen space coordinate system. Such transformation is known in the prior art and described above. Therefore, detailed description of the transformation is omitted herein.

In the step S104, a plurality of rays diverging from the viewpoint are defined. The rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters. Definition of the plurality of rays is detailed described above referring to Fig. 1.

In the step S105, the points are grouped into point sets associated with the rays. Each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint. Detailed description of grouping the points is set forth above in "Ray Grouping" section.

In the step S106, for each ray, a trained machine learning predictor calculates a ray embedding by aggregating the point embeddings and depths of the respective point set. In the step S107, the ray embeddings are projected onto an image plane. The steps (S103) to (S107) are performed for a plurality of scales. The plurality of scales are defined in advance. In the step S108, the trained machine learning predictor fuses the image planes of the plurality of scales into the 2D image. The steps S106, S107, S108 are detailed described above referring to Fig. 2.

The trained machine learning predictor comprises two parts. The first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).

The first part of the trained machine learning predictor is at least one of a recurrent neural network. The second part of the trained machine learning predictor is a U-net-based neural network.

The machine learning predictor is trained in two consecutive stages. The first stage is a pretraining stage. The second stage is a fine-tuning stage. The first stage is performed on first training data set. The first training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.

The second stage is performed on second training data set. The second training data set includes: sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set, the viewpoints, and 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.

A training process 200 of a machine learning predictor is illustrated on Fig. 4. Each of the two stages of the training of the machine learning predictor comprises steps S201 to S208.

In the step S201, training data are randomly chosen from respective training data set. For the pretraining stage, the training data are randomly chosen from the first training data set. For the fine-tuning stage, the training data are randomly chosen from the second training data set. The training data belong to randomly chosen scene and comprise the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint.

In the step S202, the 3D coordinates of each point of the 3D point cloud are transformed to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured.

In the step S203, a plurality of rays is defined. The rays diverge from the viewpoint. The rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters.

In the step S204, the points of the point cloud are grouped into point sets associated with the rays. Each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint.

In the step S205, a ray embedding is calculated for each ray by aggregating the point embeddings and depths of the respective point set.

In the step S206, the machine learning predictor processes the ray embeddings to obtain a sum of the loss function values. The process relating to loss functions is detailed described above in "Loss functions" section.

In the step S207, gradient of the obtained sum is evaluated with respect to each scalar weight of the machine learning predictor and point embeddings.

In the step S208, each scalar weight of the machine learning predictor and the embeddings of all points are changed according to the predefined optimizer rule based on the evaluated gradient. The steps (S201) to (S208) are repeated for a predefined number of times.

A block diagram illustrating an electronic computing device according to the present invention is shown on Fig. 5. The electronic computing device 300 comprises at least one processor 301 and a memory 302.

The memory 302 stores numerical parameters of the trained machine learning predictor and instructions. At least one processor 301 executes the instructions stored in the memory 302 to perform the method 100 for synthesizing a 2D image of a scene as viewed from desired viewpoint.

The method disclosed herein can be implemented by at least one processor, such as a central processing unit (CPU), a graphical processing unit (GPU), implemented on at least one of an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), but not limited by them. In addition, the method disclosed herein can be implemented by a computer-readable medium that stores numerical parameters of the trained machine learning predictor and computer-executable instructions that, when executed by a computer processor, cause the computer to perform the inventive method. The trained machine learning predictor and instructions for implementing the present method can be downloaded to the electronic computing device via a network or from the medium.

The present invention can be applied in Virtual Reality headsets, Augmented Reality glasses, Mixed Reality glasses, smartphones and other Virtual and/or Augmented Reality devices and systems.

The above descriptions of embodiments of the invention are illustrative, and modifications to the configuration and implementation are not beyond the scope of the present description. For example, although embodiments of the invention are described generally in connection with Figs. 1 and 2, the descriptions presented above are exemplary. Although the subject matter of the invention is described in the language characteristic of structural features or method steps, it is understood that the subject matter of the invention is not necessarily limited to the described features or steps. Moreover, the specific features and steps described above are disclosed as exemplary forms of implementing the claims. The invention is not limited to the illustrated sequence of the method steps, the sequence can be modified by a skilled person without innovative effort. Some or all of steps of the method may be performed sequentially or in parallel.

Accordingly, it is contemplated that the scope of the embodiment of the invention is limited only by the following claims.

Claims

A method for synthesizing a 2D image of a scene as viewed from desired viewpoint, the method comprising:

receiving (S101) a 3D point cloud obtained from a plurality of 2D images of the same scene, wherein each point of the cloud is defined by 3D coordinates in a world coordinate system and a point embedding;

setting (S102) the viewpoint as a camera having intrinsic parameters and extrinsic parameters;

transforming (S103) the 3D coordinates of each point to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters;

defining (S104) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters;

grouping (S105) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint;

calculating (S106), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set with a trained machine learning predictor;

projecting (S107) the ray embeddings onto an image plane,

wherein the steps (S103) to (S107) are performed for a predefined plurality of scales;

fusing (S108) the image planes of the plurality of scales by the trained machine learning predictor into the 2D image.
The method according to claim 1, wherein training of the machine learning predictor comprises two consecutive stages:

a pretraining stage performed on first training data set, wherein the first training data set includes:

- sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint,

- the viewpoints, and

- 3D point clouds, each 3D point cloud obtained from the respective set of 2D images; and

a fine-tuning stage performed on second training data set, wherein the second training data set includes:

- sets of 2D images of different scenes of the same kind, each set of 2D images presenting one scene, each 2D image in the set being captured from different viewpoint, wherein the scenes in the second training data set differ from the scenes in the first training data set,

- the viewpoints, and

- 3D point clouds, each 3D point cloud obtained from the respective set of 2D images.
The method according to claim 2, wherein each of the two stages of the training of the machine learning predictor comprises:

randomly choosing (S201), from respective training data set, training data of randomly chosen scene comprising the 3D point cloud, the viewpoint, and the 2D image captured from said viewpoint;

transforming (S202) the 3D coordinates of each point of the 3D point cloud to 2D coordinates and a depth of each point in a screen space coordinate system of the camera using the intrinsic parameters and the extrinsic parameters of the camera by which the set of 2D images was captured;

defining (S203) a plurality of rays diverging from the viewpoint, wherein the rays are defined by the screen space coordinates and the intrinsic parameters and the extrinsic parameters;

grouping (S204) the points into point sets associated with the rays, wherein each point set comprises the points through which one ray passes and, in each point set, the points are arranged in order of decreasing their depths with respect to the viewpoint;

calculating (S205), for each ray, a ray embedding by aggregating the point embeddings and depths of the respective point set;

processing (S206), by the machine learning predictor, the ray embeddings to obtain a sum of the loss function values;

evaluating (S207) gradient of the obtained sum with respect to each scalar weight of the machine learning predictor and point embeddings;

changing (S208) each scalar weight of the machine learning predictor and the embeddings of all points according to the predefined optimizer rule based on the evaluated gradient,

wherein the steps (S201) to (S208) are repeated for a predefined number of times.
The method according to any one of claims 1 to 3, wherein the trained machine learning predictor comprises two parts, wherein the first part of the trained machine learning predictor executes the step (S106), and the second part of the trained machine learning predictor executes the step (S108).
The method according to claim 4, wherein the first part of the trained machine learning predictor is at least one of a recurrent neural network, and the second part of the trained machine learning predictor is a U-net-based neural network.
An electronic computing device, comprising:

at least one processor; and

a memory that stores numerical parameters of a trained machine learning predictor and instructions that, when executed by at least one processor, cause the at least one processor to perform a method for synthesizing a 2D image of a scene as viewed from desired viewpoint according to any one of claims 1 to 5.