CN113222137A

CN113222137A - Neural rendering

Info

Publication number: CN113222137A
Application number: CN202110156872.7A
Authority: CN
Inventors: 单琦; J·萨斯金德; A·萨恩卡; R·A·科尔伯恩; E·杜邦; M·A·鲍蒂斯塔·马丁
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2020-02-06
Filing date: 2021-02-04
Publication date: 2021-08-06
Also published as: EP4085427A2

Abstract

The present disclosure relates to neural rendering. The subject technology provides a framework for learning neural scene representations directly from images by machine learning models without three-dimensional (3D) supervision. In the disclosed systems and methods, 3D structures may be applied by ensuring that the learned representation transforms like a real 3D scene. For example, an invariant loss function for performing scene representation with respect to 3D rotation may be provided. Since the natural tensor rotation may not be used to define a model that is invariant with respect to 3D rotation, a new operation called reversible shear rotation is disclosed, which has the required invariant properties. In some implementations, the model can be used to generate a 3D representation of the object, such as a mesh, from an image of the object.

Description

Neural rendering

Cross Reference to Related Applications

This patent application claims the benefit of priority from U.S. provisional patent application No. 62/971198 entitled "Neural reclaiming" filed on 6/2/2020 and U.S. provisional patent application No. 63/018434 entitled "Neural reclaiming" filed on 30/4/2020, the disclosures of each of which are hereby incorporated herein in their entirety.

Technical Field

The present specification relates generally to developing machine learning applications.

Background

Software engineers and scientists have been using computer hardware for machine learning to improve in different industry applications including neural rendering.

Drawings

Some of the features of the subject technology are set forth in the appended claims. However, for purposes of explanation, several embodiments of the subject technology are set forth in the following figures.

FIG. 1 illustrates an exemplary network environment in accordance with one or more implementations.

Fig. 2 illustrates an exemplary computing architecture of a system for providing an isovariate-based trained machine learning model according to one or more implementations.

FIG. 3 illustrates various input images that may be provided to a machine learning model trained based on equivariates, according to one or more implementations.

FIG. 4 illustrates a schematic diagram of a machine learning model in accordance with one or more implementations.

FIG. 5 illustrates features of a model architecture of a machine learning model in accordance with one or more implementations.

FIG. 6 illustrates a flow diagram of an exemplary process for operating a trained machine learning model in accordance with one or more implementations.

FIG. 7 illustrates an output image that can be generated by a trained machine learning model based on various input images in accordance with one or more implementations.

FIG. 8 illustrates additional output images that may be generated by the trained machine learning model based on various additional input images, according to one or more implementations.

Fig. 9 illustrates additional output images that may be generated by the trained machine learning model based on various additional input images, according to one or more implementations.

FIG. 10 illustrates aspects of explicit three-dimensional representations and implicit three-dimensional representations of objects in accordance with one or more implementations.

FIG. 11 illustrates a process for training a machine learning model in accordance with one or more implementations.

FIG. 12 illustrates a flow diagram of an exemplary process for training a machine learning model in accordance with one or more implementations.

Fig. 13 illustrates an example of a shear rotation operation in accordance with one or more implementations.

Fig. 14 illustrates additional details of an exemplary shear rotation operation in accordance with one or more implementations.

FIG. 15 illustrates an electronic system that may be used to implement one or more implementations of the subject technology.

Detailed Description

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The accompanying drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. The subject technology is not limited to the specific details set forth herein, however, and may be practiced with one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

The popularity of machine learning has risen dramatically in recent years due to the availability of large amounts of training data and the advancement of more powerful and efficient computing hardware. Machine learning may utilize models that are executed to provide predictions (e.g., analyzing images and video, object detection and/or tracking, etc.) in a particular application among many other types of applications.

For example, neural rendering methods produce realistic renderings, taking into account noisy or incomplete 3D or 2D observations. For example, neural textures have been used to convert incomplete 3D inputs into rich scene representations, filling and regularizing the noise measures. However, conventional approaches for neural rendering require 3D information, complex rendering priors, or expensive run-time decoding schemes during training.

The subject technology provides techniques for training a machine learning model to extract three-dimensional information from two-dimensional images. For example, a machine learning model may be trained to render an output image of an object based on an input image of the object, the output image depicting a view of the object that is different from a view of the object depicted in the input image. In one illustrative example, based on a two-dimensional input image depicting a view of the mug from above the mug and to the left of the mug handle, the trained machine learning model may provide an output image of the same mug as viewed in three dimensions from the bottom of the mug, from the right side of the mug, or from any other view of the mug. The trained machine learning model may generate these output images even if the input images do not contain depth information for the mugs, and even if the machine learning model does not have any depth information about the input images.

The subject technology does not require expensive sequential decoding steps and performs 3D structures by equivariant. The subject technology may be trained using only images and their relative poses, and thus may be more easily extended to real scenes with minimal assumptions about geometry.

Conventional neural networks may not be invariant with respect to the general set of transforms. The discrete rotated equal variations can be achieved by copying and rotating the filter. In the present disclosure, an equivariant neural network is provided by treating the potential representation as a geometric 3D data structure and applying the rotation directly to the representation. Traditional scene representations (e.g., explicit representations such as point clouds, voxel grids, and meshes) may not scale well due to memory and computational requirements. Thus, in this disclosure, the implicit neural representation is encoded as a potential 3D tensor.

In contrast to the subject technology, neuro-rendering using flow estimation for view synthesis predicts flow fields on input images based on camera view transformation. These methods model freeform deformations in image space and therefore may not perform the isotypes explicitly with respect to the 3D rotation. Furthermore, these models are typically limited to a single object that is segmented, rather than the entire scene.

Returning to the above example of an input image of a mug, in some implementations of the subject technology, a machine learning model may be trained to output an explicit representation of a mug in three dimensions in addition to or in place of a two dimensional output image of a mug. An explicit representation of a mug in three dimensions may be a point cloud, mesh, or voxel grid (as examples) that may be rendered so as to be recognizable by a human observer as an object and may be manipulated (e.g., rotated, translated, resized, etc.) in three dimensions.

Implementations of the subject technology improve the computational functionality of a given electronic device by providing an isovariate constraint that, when applied during training of a machine learning model, allows the model to (i) be trained without 3D supervision, (ii) be tested without providing pose information as input to the model, and/or (iii) operate to generate an implicit representation of a three-dimensional object (also referred to herein as a "scene representation") from a single two-dimensional image of the object in a single forward pass. Existing approaches may require expensive optimization procedures to extract three-dimensional information from an image or set of images, and may typically require 3D supervision and/or input pose information during training and/or at runtime. The subject technology avoids this situation by providing an iso-variate constraint that only performs that an implicit representation generated by the model based on the input image is iso-variate (e.g., in rotation, translation, and/or scaling down) from the three-dimensional object itself (e.g., in the same rotation, translation, and/or scaling down). Thus, these benefits are understood to improve the computing functionality of a given electronic device, such as an end-user device, which may generally have less available computing resources and/or power, for example, than one or more cloud-based servers.

FIG. 1 illustrates an exemplary network environment 100 in accordance with one or more implementations. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

Network environment 100 includes electronic device 110 and server 120. The network 106 may communicatively (directly or indirectly) couple the electronic device 110 and/or the server 120. In one or more implementations, the network 106 may be an interconnected network that may include the internet or a device communicatively coupled to the internet. For purposes of explanation, network environment 100 is shown in FIG. 1 as including an electronic device 110 and a server 120; however, network environment 100 may include any number of electronic devices and any number of servers.

The electronic device 110 may be, for example, a desktop computer, a portable computing device such as a laptop computer, a smartphone, a peripheral device (e.g., digital camera, headset), a tablet device, a wearable device such as a watch, a band, and so forth. In fig. 1, by way of example, the electronic device 110 is depicted as a mobile electronic device (e.g., a smartphone). Electronic device 110 may be and/or may include all or a portion of an electronic system described below with respect to fig. 3.

In one or more implementations, the electronic device 110 can provide a system for training a machine learning model using training data, where the trained machine learning model is subsequently deployed to the electronic device 110. Further, the electronic device 110 can provide one or more machine learning frameworks for training machine learning models and/or developing applications using such machine learning models. In one example, such a machine learning framework may provide various machine learning algorithms and models for different problem domains in machine learning. In one example, the electronic device 110 can include a deployed machine learning model that provides an output of data corresponding to the prediction or some other type of machine learning output.

The server 120 can provide a system for training a machine learning model using training data, where the trained machine learning model is subsequently deployed to the server 120. In one implementation, the server 120 may train a given machine learning model for deployment to a client electronic device (e.g., electronic device 110). The machine learning models deployed on the server 120 and/or the electronic device 110 may then execute one or more machine learning algorithms. In one implementation, the server 120 provides a cloud service that utilizes a trained machine learning model and that learns continuously over time.

FIG. 2 illustrates an exemplary computing architecture of a system that provides an equal-variation constraint for a machine learning model according to one or more implementations. For purposes of illustration, the computing architecture is described as being provided by the server 120, such as by a processor and/or memory of the server 120; however, the computing architecture may be implemented by any other electronic device, such as electronic device 110. However, not all of the depicted components may be used in all implementations, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of these components may be made without departing from the spirit or scope of the claims set forth herein. Additional components, different components, or fewer components may be provided.

As shown, the server 120 includes training data 210 for training the machine learning model. In one example, server 120 may utilize one or more machine learning algorithms that train Machine Learning (ML) model 220 using training data 210. The ML model 220 may be trained based on at least two training images in the training data 210 that depict different views of the training subject. The ML model 220 may be trained using an equal variation constraint. The iso-variate constraint may perform an iso-variate (e.g., in rotation, translation, and/or scaling down) between the implicit representation of the training object and the training object itself.

The training data 210 may include two-dimensional images of various objects, each depicting one or more of the objects from a particular view. The images may include a set of images of the particular object from various views that are rotated, translated, scaled, and/or otherwise different relative to views depicted in other images of the particular object.

For example, FIG. 3 shows several sets of example training images that may be used to train ML model 220. Neural rendering and scene representation models are typically tested on a ShapeNet dataset and benchmarked. However, the images produced by this sharenet dataset are typically very different from real scenes: they are rendered in an empty background and only involve fixed, rigid objects. Since the ML model 220 does not rely on 3D supervision, the ML model 220 may be trained on rich data for which it may be very expensive or difficult to obtain 3D truth values. The training data 210 may include three new data sets of composed images that may be used to train and/or test models with complex visual effects. The first dataset, referred to herein as mughq, consists of a photorealistic rendering of a colored mug on a table with an environmental background. The second data set, referred to herein as the 3D mountain data set, includes renderings of over five hundred mountains in the alps using satellite and terrain map data. The third data set contains real images of succulent plants in the indoor scene. According to various aspects, the subject technology provides three new challenging datasets to test representations and neural renderings of complex natural scenes, and shows compelling rendering results for each dataset, highlighting the flexibility of the disclosed systems and methods.

In the example of fig. 3, the training data 210 includes a first set of training images 300, a second set of training images 304, and a third set of training images 308. In this example, the first set of training images 300 includes a plurality of images 302, each image including a particular view of a particular training object 303 (e.g., a mug). The training data 210 may include a number of training images 300 for a number of views for each of a number of mug.

The first set of training images 300 may be referred to as a MugsHQ dataset and may be based on a mug class from a ShapeNet dataset. In the example of fig. 3, rather than rendering images on blank backgrounds, each scene is rendered with an environment map (e.g., lighting conditions) and a checkerboard disk platform 321. The first set of training images 300 may include, for example, one hundred fifty uniform viewing angles on the upper hemisphere for each of two hundred fourteen mug. In this example, the environment map and the disk platform are the same for each mug. As such, the scene in each image 302 in the first set of training images 300 is much more complex and more realistic-looking than typical sharenet rendering.

While the mughq dataset (e.g., first set of training images 300) contains a photorealistic rendering and complex background and lighting, the background scene of each object is the same. The ML model 220 may also be trained and/or tested using the second set of training images 304, where the depicted training objects 307 are mountains. The second set of training images 304 may be a dataset of a mountain landscape, where each scene does not share a common structure with each other. The second set of training images 304 may be generated based on, for example, the altitude, latitude, and longitude of the five hundred sixty-three highest mountains in the alps. The satellite images in combination with the terrain data may be used to sample a random view of each mountain at a fixed height for the second set of training images 304. Some samples from this data set are shown in fig. 3. This dataset can be very challenging for neural rendering with varying and complex geometries and textures.

In this example, the second set of training images 304 includes a plurality of training images 306, each training image including a particular view of a training object 307. The training objects 307 depicted in the training images 306 may be training objects of a different category (e.g., mountains) than the training objects 303. The training data 210 may include a number of training images 304 for a number of views of each of a number of mountains.

The third set of training images 308 may be a dataset of real images (e.g., images of several views, such as several perspectives, distances, and/or positions, of a physical object, such as a succulent plant). The third set of training images 308 in this example consists of images of a succulent plant viewed from different views around the table (e.g., views that change azimuth but remain constant in height). The illumination and background in the images 310 in the third set of training images 308 are approximately constant for all scenes in the images and there is some noise in the azimuth and elevation measurements. The third set of training images 308 may include, for example, twenty different succulent plants, and, for example, sixteen views of each succulent plant. Some samples from the data set are shown in fig. 3.

In this example, the third set of training images 308 includes a plurality of images 310, each image including a particular view of a particular training object 311. Training objects 311 depicted in image 310 may be training objects of a different category (e.g., succulent plants) than training objects 303 and training objects 307. The training data 210 may include a number of training images 304 for a number of views of each of a number of mountains.

The first, second, and third sets of

training images

300, 304, and 308 provide three new challenging data sets that can be used to train the ML model 220 and test representations and neural renderings of complex natural scenes, and show compelling rendering results for each, highlighting the flexibility of the disclosed systems and methods.

Designing a useful 3D scene representation for neural networks is a challenging task. While several works have used traditional 3D representations, such as voxel grids, meshes, point clouds, and symbolic distance functions, they each have limitations. For example, it is often difficult to scale texture, non-rigid objects, and lighting into these representations. Recently, neural scene representations have been proposed to overcome these problems, typically by incorporating ideas from graphics rendering into the model architecture.

In the present disclosure, the iso-transforms with respect to the 3D transforms provide strong induced bias for the neural rendering and scene representation. Indeed, for many tasks, the scene representation need not be explicit (such as point clouds and grids) as long as it transforms like an explicit representation.

However, building such models is challenging in practice. In the subject disclosure, a model is provided that includes an inverse renderer that maps images to a neural scene representation and a forward neural renderer that generates an output, such as an image from the representation. The scene representation itself may be a three-dimensional tensor, which may undergo the same transformations as an explicit 3D scene. In the subject disclosure, specific examples focus on 3D rotation, but the model may be generalized to other symmetric transformations, such as translation and scaling.

FIG. 4 schematically illustrates components that may be included in a machine learning model 220. As shown in the example of fig. 4, the machine learning model 220 may include an inverse renderer 402 and a forward renderer 406. The adjustment module 404 may be disposed between the inverse renderer 402 and the forward renderer 406. The inverse renderer 402 is trained to generate an implicit representation 408 (also referred to as a scene representation) of an object from an input image 400 of the object (e.g., a two-dimensional input image) for a particular view. The forward renderer 406 generates an output 412 based on the implicit representation 408 from the inverse renderer 402.

The output 412 may be, for example, an output two-dimensional image of an object of a different view that is rotated, translated, and/or scaled with respect to a particular view of the input image 400. As another example, the output may include a three-dimensional representation of the object. The three-dimensional representation of the object may be a mesh, a point cloud, or a voxel grid that will be visually recognizable by a human user as the object (e.g., if the explicit representation is rendered on a computer display, such as the display of electronic device 110). The ML model 220 may generate at least one of an output image depicting the object from a rotated view that is different from the view of the object in the input image 400, or a three-dimensional representation of the object, based on the provided input image 400.

As shown in fig. 4, in some operational scenarios, the implicit representation 408 generated by the inverse renderer 402 may be provided to the scaling module 404 before the implicit representation 408 is provided to the forward renderer 406. The adjustment module 404 may adjust the implicit representation 408 by rotating, translating, and/or scaling the implicit representation 408 to generate an adjusted implicit representation 410. For example, the adjustment module 404 may be a rotation module that rotates the implicit representation 408. The adjustment module 404 may be, for example, a shear rotation module. The shear rotation module may be particularly helpful in facilitating a machine learning model based on rotation equivariates.

According to aspects of the present disclosure, a model is trained without 3D supervision and using only images and their relative poses to learn an isodyne scene representation. Unlike most other scene representation models, the disclosed model does not require any pose information at inference time. From a single image, the model may infer a scene representation, transform it and render it (see, e.g., fig. 7-9 below). Furthermore, the model may infer a scene representation in a single forward pass, in contrast to conventional scene representation algorithms that may require computationally and/or resource-intensive optimization processes to extract a scene representation from an image or set of images.

According to various aspects, the subject technology introduces a framework for learning scene representations and new view synthesis without explicit 3D supervision by implementing an equivalence between perspective changes and scene potential representation changes.

FIG. 5 shows additional details of a model architecture that may be used for ML model 220. According to aspects of the present disclosure, the model architecture may be completely distinguishable, including the shear rotation operation discussed in further detail below. Providing a fully distinguishable model architecture facilitates implementing a model that can be trained using back propagation to update all learnable parameters in a neural network. As shown in the example of fig. 5, an input image 400 (e.g., an image depicting a car) is mapped by one or more 2D convolutions 500, followed by an inverse projection 502 and a set of 3D convolutions 504 (e.g., by an inverse renderer 402) to generate an implicit representation 408. In the example of fig. 5, the inferred scene (e.g., implicit representation 408) is then rendered (e.g., by forward renderer 406) by transpose of 3D convolution 504 ', forward projection 506, and transpose of one or more 2D convolutions 500' to render output 412 (e.g., an output image), which in this example is a copy of input image 400. In this example, the implicit representation 408 is provided to the forward renderer 406 without rotation, producing an output 412 (e.g., an output image), which in this example is a copy of the input image 400.

FIG. 6 illustrates a flow diagram of an exemplary process for generating an output using a machine learning model, according to one or more implementations. For purposes of explanation, process 600 is described herein primarily with respect to server 120 of fig. 1. However, process 600 is not limited to server 120 of fig. 1, and one or more blocks (or operations) of process 600 may be performed by one or more other components of server 120 and/or by other suitable devices, such as electronic device 110. For further explanation purposes, the blocks of process 600 are described herein as occurring sequentially or linearly. However, multiple blocks of process 600 may occur in parallel. Further, the blocks of the process 600 need not be performed in the order shown, and/or one or more blocks of the process 600 need not be performed and/or may be replaced by other operations.

At block 602, server 120 provides an input image, such as input image 400 of fig. 4, depicting a view of an object, to a machine learning model, such as ML model 220, that has been trained based on an invariant constraint under rotation between a training object (e.g., one or more of training objects 303, 307, and 311 of fig. 3) and a model-generated representation (e.g., an implicit representation) of the training object.

The machine learning model may include an inverse renderer, such as inverse renderer 402, and a forward renderer, such as forward renderer 406.

At block 604, the server 120 generates at least one of an output image depicting the object from a rotated view that is different from the view of the object in the image or a three-dimensional representation of the object based on the provided image using the machine learning model. The three-dimensional representation may comprise an explicit three-dimensional representation comprising at least one of a voxel grid, mesh, or point cloud.

At block 604, generating at least one of an output image depicting the object or a three-dimensional representation of the object from a rotated view that is different from the view of the object in the image may include utilizing a forward renderer to generate at least one of an output image depicting the object or a three-dimensional representation of the object from a rotated view that is different from the view of the object in the image. The generating operation of block 604 may also include generating an implicit representation, such as implicit representation 408, of the object with the inverse renderer based on the input image.

The forward renderer may generate at least one of an output image depicting the object or a three-dimensional representation of the object from a rotated view that is different from a view of the object in the image based on the implicit representation generated by the inverse renderer. At block 604, generating at least one of an output image depicting the object or a three-dimensional representation of the object from a rotated view that is different from a view of the object in the image based on the implicit representation may include rotating the implicit representation of the object.

The machine learning module may include an adjustment module, such as adjustment module 404 for performing implicitly represented rotations. Rotating the implicit representation of the object may include performing a crop rotation of the implicit representation of the object. The implicit representation of the object may be, for example, a tensor or latent space of an auto-encoder.

Defining the tensor rotation in 3D is not straightforward, and it has been found that natural tensor rotation may not be used for learning the equivalent representation. The adjustment module 404 may provide a new distinguishable layer for performing reversible shear rotation, which allows neural network learning and the like to be represented. According to various aspects, the subject technology shows that natural tensor rotation cannot be made equal, and introduces reversible clipping operations that address this limitation within a discriminative neural architecture.

Fig. 7, 8, and 9 illustrate various exemplary outputs of ML models, such as ML model 220, trained based on an iso-variate constraint under rotation between a training object and a model-generated representation of the training object. In the examples of fig. 7, 8 and 9, the outputs are output images, each depicting an object shown in the input image from a rotated view that is different from the view of the object in the image.

FIG. 7 shows two input images 400-1 and 400-2 depicting two different views of two different mugs, respectively. FIG. 7 also shows five output images generated by the ME model 220 based on each of the two input images 400-1 and 400-2. For input image 400-1, five output images 800-1, 800-2, 800-3, 800-4, and 800-5 are shown, each depicting a mug in input image 400-1, but a view different from that depicted in input image 400-1. For input image 400-2, five output images 802-1, 802-2, 802-3, 802-4, and 802-5 are shown, each depicting a mug in input image 400-2, but a view different from that depicted in input image 400-2.

The results shown in fig. 7 for single shot new view synthesis of previously unseen mugs indicate that the ML model 220 successfully infers the shape of the previously unseen mugs from a single image and is able to perform large perspective transformations around the scene. Even from difficult-to-observe perspectives, the model is able to produce a consistent and realistic view of the scene, even generating reflections on the mug rim and shiny mug handle (when they are not visible), which indicates that the model has learned a good a priori on the object shape.

Fig. 8 shows three input images 400-3, 400-4 and 400-5 depicting three different views of three different mountains, respectively. FIG. 8 also shows five output images generated by ML model 220 based on each of three input images 400-3, 400-4, and 400-5. For the input image 400-3 five output images 803-1, 803-2, 803-3, 803-4 and 803-5 are shown, each depicting a mountain range in the input image 400-3, but with a view different from the view depicted in the input image 400-3. For the input image 400-4, five output images 804-1, 804-2, 804-3, 804-4 and 804-5 are shown, each depicting a mountain in the input image 400-4, but with a view different from the view depicted in the input image 400-4. For the input image 400-5, five output images 805-1, 805-2, 805-3, 805-4, and 805-5 are shown, each depicting a mountain range in the input image 400-5, but a view different from that depicted in the input image 400-5.

The results shown in fig. 8 for single shot new view synthesis show that even for input images depicting complex objects and backgrounds such as mountains, the model may truly reproduce the 3D structure and texture of mountains as the camera rotates around the inferred scene representation, although it may be difficult for the model to capture high frequency details. For a wide variety of mountain landscapes (e.g., snow mountains, rock mountains, etc.), the ML model 220 can generate possible other views of the landscape.

Fig. 9 shows three input images 400-6, 400-7 and 400-8, which show three different views of three different plants (e.g., succulent plants), respectively. FIG. 9 also shows five output images generated by the ML model 220 based on each of the three input images 400-6, 400-7, and 400-8. For the input image 400-6, five output images 806-1, 806-2, 806-3, 806-4 and 806-5 are shown, each depicting a succulent plant in the input image 400-6, but a different view than the view depicted in the input image 400-6. For the input image 400-7, five output images 807-1, 807-2, 807-3, 807-4 and 807-5 are shown, each depicting a succulent plant in the input image 400-7, but with a different view than that depicted in the input image 400-7. For the input image 400-8, five output images 808-1, 808-2, 808-3, 808-4, and 808-5 are shown, each depicting a succulent plant in the input image 400-8, but with a different view than that depicted in the input image 400-8.

The results shown in fig. 9 for single shot new view synthesis indicate that the ML model 220 is able to generate possible other views of the plant shown in the input image, and in particular, produce a significantly consistent shading. Additional plant input training images may be used to further train the ML model to learn a strong prior on the plant shape to reduce or avoid overfitting the input data at runtime by the network.

In various implementations, different ML models may be trained using training images of training objects of different classes (e.g., mugs, mountains, vegetation, etc.) to perform neural rendering on input images of objects in the class, or a single ML model may be trained using training images of objects of various classes used to train the single ML model to perform neural rendering on input images of substantially any object or scene.

As described above, a two-dimensional output image, such as the output images shown in fig. 7, 8, and 9, is but one example of an output of the trained ML model 220. In other implementations, the ML model 220 may output three-dimensional information about an object depicted in a two-dimensional image, such as a three-dimensional representation of the object.

Fig. 10 illustrates a three-dimensional representation 1000 of an object (e.g., a rabbit) that may be generated by the trained ML model 220 based on a depiction of the object in a two-dimensional input image. In fig. 10, the three-dimensional representation 1000 is an explicit three-dimensional representation (e.g., a mesh) that is recognizable by a human observer as a rabbit and/or that may be rendered in a form recognizable by a human observer as a rabbit. The three-dimensional representation generated by the ML model 220 may be rendered as a point cloud, a voxel grid, or any other three-dimensional representation.

Fig. 10 also shows an implicit representation 408 of the same object (e.g., a rabbit) that is unrecognizable to a human observer, but rotationally invariant to the explicit three-dimensional representation 1000. As shown by the camera 1002 in fig. 10, the explicit three-dimensional representation 1000 may be viewed from a view 1004 of a location 1006 on the sphere (e.g., for rendering an output image such as those shown in fig. 7, 8, and 9) or from views associated with other locations on the sphere 1003, such as

locations

1008 and 1010. The implicit representation 408 may also be viewed from any of the

positions

1006, 1008, 1010 on the sphere 1003 (or any other suitable position). To train the ML model 220 to be able to generate the explicit three-dimensional representation 1000 and/or the output images of FIGS. 7, 8, and 9, the model may be trained based on performing an iso-variant constraint (e.g., under the same rotation) of the implicit representation 408 of the object as an iso-variant of the object itself (e.g., rotation).

FIG. 11 illustrates a training operation on ML model 220 using an iso-varying constraint. As shown in fig. 11, training of an ML model using an iso-variate constraint may be performed using two

input training images

1100 and 1102 depicting the same input training object 1101 from two different views (e.g., rotated views).

As shown in fig. 11, training the ML model 220 based on an equivariate constraint under rotation between the training object 1101 and a model-generated representation of the training object may include providing a first input training image 1100 (also referred to as x) depicting a first view of the training object 1101 to a machine learning model₁) And a second input training image 1102 (also referred to as x) that depicts a second view of the training object 1101 is provided to the machine learning model₂)。

Training may also include training the image 1100 based on the first input (e.g., at operation f (x)₁) In) generates a first implicit representation 1104 (also referred to as z) of training object 1101₁) And training the image 1102 based on the second input (e.g., in operation f (x)₂) Middle) generates a second implicit representation 1106 (also referred to as z) of the training object 1101₂). Training may also include rotating a first implicit representation 1104 of training object 1101 as shown by arrow 1107 (e.g., to form a rotated implicit representation 1108, also referred to as

) And rotating a second implicit representation 1106 of the training object 1101 as indicated by arrow 1109 (e.g., to form a rotated implicit representation 1108, also referred to as

)。

Training may also include generating a first output training image 1116 (also referred to as a "first output training image" as well) based on the rotated first implicit representation 1108 of the training object 1101

) And generates a second output training image 1114 (also referred to as a second output training image 1114) based on the rotated second implicit representation 1110 of the training object 1101

). Training may also include comparing the first input training image 1100 with the second output training image 1114, and the second inputThe training image 1102 is compared to a first output training image 1116.

FIG. 12 illustrates a flow diagram of an exemplary process for training a model, such as ML model 220, based on an invariant constraint under rotation between a training object and a model-generated representation, according to one or more implementations. For purposes of explanation, process 1200 is described herein primarily with respect to server 120 of fig. 1. However, process 1200 is not limited to server 120 of fig. 1, and one or more blocks (or operations) of process 1200 may be performed by one or more other components of server 120 and/or by other suitable devices, such as electronic device 110. Further for purposes of explanation, the blocks of process 1200 are described herein as occurring sequentially or linearly. However, multiple blocks of process 1200 may occur in parallel. Further, the blocks of process 1200 need not be performed in the order shown, and/or one or more blocks of process 1200 need not be performed and/or may be replaced by other operations.

At block 1202, server 120 may provide a first input training image, such as input training image 1100 of fig. 11, depicting a first view of a training object, such as training object 1101, to the machine learning model.

At block 1204, the server 120 may provide a second input training image, such as the input training image 1102 of fig. 11, depicting a second view of a training object, such as training object 1101, to the machine learning model.

At block 1206, the server 120 may generate a first implicit representation of the training object, such as implicit representation 1104, based on the first input training image.

At block 1208, the server 120 may generate a second implicit representation of the training object, such as the implicit representation 1106, based on the second input training image.

At block 1210, the server 120 may rotate the first implicit representation of the training object (e.g., to form a rotated first implicit representation 1108).

At block 1212, the server 120 may rotate the second implicit representation of the training object (e.g., to form a rotated second implicit representation 1110).

At block 1214, the server 120 may generate a first output training image, such as the output training image 1116, based on the rotated first implicit representation 1108 of the training object.

At block 1216, server 120 may generate a second output training image, such as output training image 1114, based on rotated second implicit representation 1110 of the training object.

At block 1218, the server 120 may compare the first input training image to the second output training image.

At block 1220, the server 120 may compare the second input training image with the first output training image. Training (e.g., comparison of the first input training image to the second output training image and comparison of the second input training image to the first output training image) may include minimizing a loss function based on the comparison of the first input training image to the second output training image and the comparison of the second input training image to the first output training image.

As discussed herein, in one or more embodiments, the framework for ML model 220 may consist of two models: image processing method

Mapping to a scene representation

Inverse renderer in (1)

(see also inverse renderer 402 of FIG. 4) and a forward renderer that maps the scene representation to the image

(see also forward renderer 406 of FIG. 4). Suppose that

Where c, h, w are the channel, height and width of the image, and similarly

Wherein c is_s、d_s、h_s、w_sChannel, depth, height and width of the scene representation.

Generally, since the real geometry of the 3D scene cannot be accessed, the structure is applied to the scene representation z by ensuring that the scene representation z transforms like the 3D scene. In particular, the training operation ensures that the rotation of the inverse renderer f and the forward renderer g with respect to the scene is equally varied. The rotation operation is performed in the scene space

And an equivalent rotation operation applied to the rendered image x is represented by

And (4) showing. Then, the equality of the inverse renderer (or encoder) f and the forward renderer (or decoder) g is given by:

the above formula in formula (1) means that if a camera view change is performed in image space, the scene representation coded by f should undergo an equal-variant rotation. The second formula means that if the scene representation is rotated, the image rendered by g should undergo an iso-rotational.

To design an invariant loss function that performs an equal function with respect to the rotational transformation, two images of the same scene and their relative transformations are considered, (x)₁，x₂Δ φ Δ θ) is as described above in connection with FIG. 11. The server 120 first maps the image through an inverse renderer to obtain its scene representation z₁＝f(x₁)and z₂＝f(x₂). The server 120 (e.g., the adjustment module 404) then transforms through its relative

Rotating each encoded representation (e.g., first implicit representation 1104 and second implicit table)Indication 1106) such that

And is

Due to z₁And z₂Representing the same scene in different poses, and therefore expected to be rotated

Rendering as an image x₂And will be rotated

Rendering as x₁As shown in fig. 11. Then training can be performed by making the loss function

Minimization to ensure that the model adheres to these transformations, where:

due to the fact that

Minimizing the penalty corresponds to satisfying the iso-properties of the forward renderer g. While this penalty performs an equalisation of g, in practice it has been found that this typically does not perform an equalisation of the inverse renderer f. Accordingly, training the machine learning model may also include comparing the first implicit representation 1104 with the rotated second implicit representation 1110, and comparing the second implicit representation 1106 with the rotated first implicit representation 1108. For example, the loss function may be further based on a comparison of the first implicit representation to the rotated second implicit representation and a comparison of the second implicit representation to the rotated first implicit representation. For example, a penalty function that performs an equality of the inverse renderer with respect to rotation may be defined as

The total loss function may be

To know

Ensuring that both the inverse renderer and the forward renderer in the trained machine learning model 220 are equally variant with respect to the perspective or rotation of the camera.

Any or all of the operations for training the model described above in connection with blocks 1202-1220 may be performed based on the at least two images without three-dimensional supervision of the training. The trained machine learning model may be tested without providing pose information to the trained machine learning model.

It has been found that a scene space is defined

The rotating operation in (1) is particularly useful. In practice, natural tensor rotation is not suitable for this task due to spatial aliasing, i.e. a rotation point on a discrete grid typically results in the rotation point not being aligned with the grid, requiring some form of sampling to reconstruct their values.

To illustrate this, the rotation of the 2D image is described below (since the effect of the 3D rotation is the same). To show aliasing effects caused by rotation on the grid, the image may be rotated by an angle θ, and the resulting image may then be rotated by an angle θ (sampled with bilinear interpolation to obtain values at grid points). If the rotation on the grid is reversible, the final image should be identical to the original image. To test whether this applies in practice, one thousand images were sampled from the CIFAR10 dataset, each rotated back and forth at each angle in [0, 360], and the errors were recorded. In this exemplary scenario, the average pixel value error is about 3%, which is quite significant.

These results imply that tensor rotation may not be used to learn a representation of a scene that is invariant with respect to camera rotation, etc. In fact, for tensor rotations, the rotation operation

Is not reversible, i.e.

For example, consider camera rotation in image space

Then consider its inversion:

applying f to both sides of this equation (4) and using the equal variation property twice, then we get:

since it is generally for tensor rotation

The operator may not satisfy the equation of equality. To overcome this problem, the adjustment module 404 (see fig. 4) may be implemented to perform the rotation by performing a shear rotation.

In the discussion that follows, it is shown that shear rotation can be used to define reversible tensor rotations that can be used in neural networks. Rotating the image corresponds to rotating the pixel values at a given (x, y) coordinate in the image by applying a rotation matrix to the coordinate vector. In contrast, the cropping rotation rotates the image by performing a series of cropping operations. According to aspects of the present disclosure, the rotation matrix may be factored into:

thus, the rotation is performed using three clipping operations, as opposed to a single matrix multiplication.

For example, as shown in fig. 13, in a crop rotation operation 1300, an input image 1304 may be cropped three times to obtain a rotated image 1312. Fig. 13 also shows how a nearest neighbor clipping rotation operation 1302 for rotating the input image 1314 into a rotated output image 1322 using three nearest neighbor clipping operations may allow the adjustment module 404 to perform reversible rotations on the grid.

Although the clipping operations themselves will not be aligned with the grid coordinates and will also require a form of interpolation, the following illustrates how these operations can be reversed by using the nearest neighbor approach (e.g., with the adjustment module 404).

Applying a clipping transform involves moving a column or a row of the image, but not both. Thus, for each shift point, there is a unique nearest neighbor on the grid. In contrast, for a regular rotation, two shift points may be mapped to the same grid point by a nearest neighbor operation.

FIG. 14 shows a regular rotation 1400 of the image 1401 to the rotated image 1406, where two

shift points

1410 and 1414 can be mapped to the same grid point 1408 by nearest neighbor operations. In contrast, fig. 14 also shows a crop rotation operation 1402 of the image 1404 to the rotated image 1420, where for each shift point 1424 there is a unique nearest neighbor 1422 on the grid.

Since the server 120 can find a unique nearest neighbor for each grid point, clipping with the nearest neighbor is a reversible operation. In some implementations, since each of the shearing operations is reversible, the composition of the three shearing operations is also reversible, meaning that a reversible rotation on the grid can be defined and performed by the adjustment module 404.

While the use of shear-defined tensor rotation allows reversibility, angular resolution can be traded off. In practice, the minimum rotation that can be represented using a reversible shear rotation depends on the grid size n, as:

this means that the model may not be invariant with respect to continuous rotation, but only to a limited angular resolution. However, for the grid size used in practice, the angular resolution is clear enough to model most rotations. For example, for a 32 x 32 grid, the angular resolution is less than 2 degrees. Some examples of values of angular resolution are given in table 1 below.

Table 1 angular resolution (in degrees) for various grid sizes.

Shear rotation matrix factorization involves tan (θ/2) terms. For rotation around the "camera" or view on the entire sphere of the scene representation, the adjustment module 404 may perform a rotation for θ ∈ [0, 360). To avoid infinity of the tan function, the angle can be decomposed into θ ═ θ_90n+θ_smallWherein theta_90nE {0, 90, 180, 270} and θ_small∈[-45，45]. Since the image rotation on the grid is reversible for multiples of 90 degrees, 90, 180, and 270 degree rotations can be performed first by flipping and transposing the image, followed by small angle cropping rotations. This results in only pairs of [ -45, 45 [ -C]The angle of (1) performs a shearing rotation, thereby avoiding any numerical problems.

While the shear rotation operation has been defined above for a 2D grid, the discussion above extends it to a 3D grid by performing two 2D rotations. Fully reversible shear rotation operation

May be defined as performing a lifting rotation angularly about a width axis followed by an azimuth rotation about a height axis of the scene representation.

The shearing rotation operation is angularly discontinuous. However, this is of no consequence in practice, since the gradient with respect to the angle need not be calculated. In practice, the shear rotation layer of the machine learning model 220 may correspond to the location of the voxels in the transformed scene representation tensor, allowing back propagation through operations.

As described above, one aspect of the present technology is the use of images of specific and legal origin for neural rendering. The present disclosure contemplates that, in some cases, the image may include personal information data that uniquely identifies or may be used to identify a particular person. Such personal information data may include images of the user's face or portions of the user's body, video data, demographic data, location-based data, online identifiers, printed information such as phone numbers, email addresses, home addresses, data or records relating to the user's health or fitness (e.g., vital signs measurements, medication information, exercise information), date of birth, or any other personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be useful to benefit the user. For example, the personal information data may be used for neural rendering of a person image.

The present disclosure contemplates that entities responsible for collecting, analyzing, disclosing, transmitting, storing, or otherwise using such personal information data will comply with established privacy policies and/or privacy practices. In particular, it would be desirable for such entities to implement and consistently apply privacy practices generally recognized as meeting or exceeding industry or government requirements to maintain user privacy. Such information regarding usage of personal data should be prominently and conveniently accessible to users and should be updated as data is collected and/or used. The user's personal information should be collected for legitimate use only. In addition, such collection/sharing should only occur after receiving user consent or other legal grounds as set forth in applicable law. Furthermore, such entities should consider taking any necessary steps to defend and secure access to such personal information data, and to ensure that others who have access to the personal information data comply with their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices. In addition, policies and practices should be tailored to the particular type of personal information data being collected and/or accessed and made applicable to applicable laws and standards, including jurisdiction-specific considerations that may be used to impose higher standards. For example, in the united states, the collection or acquisition of certain health data may be governed by federal and/or state laws, such as the health insurance association and accountability act (HIPAA); while other countries may have health data subject to other regulations and policies and should be treated accordingly.

Regardless of the foregoing, the present disclosure also contemplates embodiments in which a user selectively prevents use or access to personal information data. That is, the present disclosure contemplates that hardware elements and/or software elements may be provided to prevent or block access to such personal information data. For example, in the case of neural rendering, the subject technology may be configured to allow a user to opt-in or opt-out of participating in the collection and/or sharing of personal information data during registration service or at any time thereafter. In addition to providing "opt-in" and "opt-out" options, the present disclosure contemplates providing notifications related to accessing or using personal information. For example, the user may be notified that their personal information data is to be accessed when the application is downloaded, and then be reminded again just before the personal information data is accessed by the application.

Further, it is an object of the present disclosure that personal information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use. Once the data is no longer needed, the risk can be minimized by limiting data collection and deleting data. In addition, and when applicable, including in certain health-related applications, data de-identification may be used to protect the privacy of the user. De-identification may be facilitated by removing identifiers, controlling the amount or specificity of stored data (e.g., collecting location data at a city level rather than at an address level or at a level insufficient for facial recognition), controlling how data is stored (e.g., aggregating data among users), and/or other methods such as differential privacy, as appropriate.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed embodiments, the present disclosure also contemplates that various embodiments may be implemented without the need to access such personal information data. That is, various embodiments of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data.

FIG. 15 illustrates an electronic system 1500 that may be used to implement one or more implementations of the subject technology. Electronic system 1500 may be and/or may be part of electronic device 110 and/or server 120 shown in fig. 1. Electronic system 1500 may include various types of computer-readable media and interfaces for various other types of computer-readable media. Electronic system 1500 includes bus 1508, one or more processing units 1512, system memory 1504 (and/or buffers), ROM 1510, persistent storage 1502, input device interface 1514, output device interface 1506, and one or more network interfaces 1516, or subsets and variations thereof.

Bus 1508 generally represents all of the system, peripheral, and chipset buses that communicatively connect the many internal devices of electronic system 1500. In one or more implementations, a bus 1508 communicatively connects the one or more processing units 1512 with the ROM 1510, the system memory 1504, and the permanent storage device 1502. The one or more processing units 1512 retrieve instructions to be executed and data to be processed from these various memory units in order to perform the processes of the subject disclosure. In different implementations, the one or more processing units 1512 may be a single processor or a multi-core processor.

The ROM 1510 stores static data and instructions required by the one or more processing units 1512 as well as other modules of the electronic system 1500. On the other hand, the persistent storage device 1502 may be a read-write memory device. Persistent storage 1502 may be a non-volatile memory unit that stores instructions and data even when electronic system 1500 is turned off. In one or more implementations, a mass storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as the persistent storage device 1502.

In one or more implementations, a removable storage device (such as a floppy disk, a flash drive, and their corresponding disk drives) may be used as the permanent storage device 1502. Like the persistent storage device 1502, the system memory 1504 may be a read-write memory device. However, unlike the persistent storage device 1502, the system memory 1504 may be a volatile read-and-write memory, such as a random access memory. The system memory 1504 may store any of the instructions and data that may be needed by the one or more processing units 1512 at runtime. In one or more implementations, the processes of the subject disclosure are stored in the system memory 1504, the persistent storage device 1502, and/or the ROM 1510. The one or more processing units 1512 retrieve instructions to be executed and data to be processed from the various memory units in order to perform one or more embodied processes.

The bus 1508 is also connected to an input device interface 1514 and an output device interface 1506. The input device interface 1514 enables a user to transfer information and select commands to the electronic system 1500. Input devices that may be used with input device interface 1514 may include, for example, an alphanumeric keyboard and a pointing device (also referred to as a "cursor control device"). The output device interface 1506 may enable, for example, display of images generated by the electronic system 1500. Output devices that may be used with output device interface 1506 may include, for example, printers and display devices, such as Liquid Crystal Displays (LCDs), Light Emitting Diode (LED) displays, Organic Light Emitting Diode (OLED) displays, flexible displays, flat panel displays, solid state displays, projectors, or any other device for outputting information. One or more implementations may include a device that acts as both an input device and an output device, such as a touch screen. In these implementations, the feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

Finally, as shown in fig. 15, bus 1508 also couples electronic system 1500 to one or more networks and/or to one or more network nodes, such as electronic device 110 shown in fig. 1, through one or more network interfaces 1516. In this manner, electronic system 1500 may be part of a computer network, such as a LAN, wide area network ("WAN"), or intranet, or may be part of a network of networks, such as the internet. Any or all of the components of electronic system 1500 may be used with the subject disclosure.

The disclosed systems and methods provide advantages for neural rendering, including providing a machine learning model that makes very few assumptions about the scene representation and rendering process. In fact, the disclosed machine learning model learns the representation only by performing an equal variation with respect to the 3D rotation. Thus, the material, texture and lighting in the scene may be encoded into the model. The simplicity of the disclosed model also means that it can be trained purely from the composed 2D images without 3D supervision.

As described herein, these advantages are advantageous for other advantages, including allowing the application of models to data of interest in situations where 3D geometry is difficult to obtain. In contrast to other approaches, the disclosed machine learning model does not require pose information at test time. Moreover, operating the disclosed machine learning model is fast: the inference scenario representation corresponds only to performing a forward pass of the neural network. This is in contrast to other methods that require expensive optimization of each new observation image at inferred times.

In the disclosed system and method, rendering is also performed in a single pass forward, making it faster than other methods that typically require recurrence to produce an image.

In operational scenarios where the training data is sparse (e.g., the number of views per scene is small), the new view synthesis model may exhibit a tendency to "snap" to a fixed view rather than rotate smoothly around the scene. The disclosed systems and methods contemplate additional training data and/or training operations to reduce this type of undesirable alignment.

In the disclosed systems and methods, the equalization is described in various examples as being performed during training relative to 3D rotation. However, real scenes have other symmetries, such as panning and zooming. It should be understood that translation and scale variations may also be applied to the constraints of the model training.

Furthermore, while scene representations are sometimes described as being used to render images, additional structures may be performed on the underlying space to make the representations more interpretable or even editable. Additionally, it should be understood that the addition of induced bias from the rendering process has been shown, such as explicitly processing occlusions, to improve the performance of other models, and may also be applied to the disclosed models. It should also be appreciated that the learning scene representation may be used to generate a 3D reconstruction. In fact, most existing 3D reconstruction methods are object-centric (i.e., each object is reconstructed at the same orientation). This has been shown to make the model efficiently perform shape classification rather than reconstruction. Since the disclosed scene representation is view-centric, the disclosed scene representation may be used to generate a 3D reconstruction with view-centric.

In the disclosed systems and methods, a machine learning model is provided that learns a scene representation by ensuring that the representation transforms like a real 3D scene.

It should also be appreciated that various examples are discussed herein in which the machine learning model is deterministic, while inferring the scene from the image is an inherently uncertain process. In fact, for a given image, there are several possible scenes that may have been generated, and similarly, several different scenes may be rendered as the same image. In some implementations, the disclosed systems and methods can be used to train a model to learn a distribution over a scene P (scene | image).

Further, in some examples described herein, each view pair in the training image is considered the same during training. However, views close to each other should be easier to reconstruct, while views far from each other may not be accurately reconstructed due to the inherent uncertainty caused by occlusion. The training operations described herein may be modified, for example, to reflect this by weighting the reconstruction loss how far apart the scenes are from each other. It should also be appreciated that, in some examples described herein, pairs of views of training objects are provided to train a machine learning model. However, in some implementations, a machine learning model may be provided with a greater number of views of the training objects, which will also reduce the entropy of the P (scene | image) distribution and enhance the training process.

According to aspects of the present disclosure, systems and methods are provided for learning a scene representation by ensuring that the scene representation transforms like a real 3D scene. To facilitate this learning, the model may include reversible shear rotations that allow the model to represent varying scenes through gradient descent learning or the like. The disclosed machine learning model may be trained without 3D supervision and may be trained using only the composed 2D images. According to aspects of the present disclosure, systems and methods are provided for inferring scene representations directly from images using a single forward pass of an inverse renderer. With the disclosed techniques, the learned scene representation can be easily manipulated and rendered to produce a new perspective of the scene.

Three challenging new datasets for neural rendering and scene representation are also provided. The disclosed systems and methods have been shown to work well on these data sets as well as on the standard ShapeNet task.

According to an aspect of the present disclosure, there is provided a method comprising providing an input image depicting a view of an object to a machine learning model, the machine learning model having been trained based on an equi-variate constraint under rotation between a training object and a model-generated representation of the training object; and generating, using the machine learning model and based on the provided images, at least one of an output image depicting the object from a rotated view that is different from a view of the object in the image, or a three-dimensional representation of the object.

According to an aspect of the present disclosure, there is provided a system comprising a processor; and a memory device containing instructions that, when executed by the processor, cause the processor to: providing an input image depicting a view of an object to a machine learning model, the machine learning model having been trained based on an iso-variate constraint under rotation between a training object and a model-generated representation of the training object; and generating, using the machine learning model and based on the provided images, at least one of an output image depicting the object from a rotated view that is different from a view of the object in the image, or a three-dimensional representation of the object.

According to an aspect of the disclosure, there is provided a non-transitory machine-readable medium comprising code that, when executed by a processor, causes the processor to: providing an input image depicting a view of an object to a machine learning model, the machine learning model having been trained based on at least two training images depicting different views of a training object; and generating, using the machine learning model and based on the provided images, at least one of an output image depicting the object from a rotated view that is different from a view of the object in the image, or a three-dimensional representation of the object.

According to aspects of the disclosure, there is provided a non-transitory machine-readable medium comprising code that, when executed by a processor, causes the processor to: providing an input image depicting a view of an object to a machine learning model, the machine learning model having been trained based on an invariant constraint under rotation between a training object and a model-generated representation of the training object; and generating, using the machine learning model and based on the provided images, at least one of an output image depicting the object from a rotated view that is different from a view of the object in the image, or a three-dimensional representation of the object.

Implementations within the scope of the present disclosure may be realized, in part or in whole, by a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) having one or more instructions written thereon. The tangible computer readable storage medium may also be non-transitory in nature.

A computer-readable storage medium may be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device and that includes any processing electronics and/or processing circuitry capable of executing instructions. For example, without limitation, the computer-readable medium may include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer readable medium may also include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash memory, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.

Further, the computer-readable storage medium may include any non-semiconductor memory, such as optical disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium may be directly coupled to the computing device, while in other implementations, the tangible computer-readable storage medium may be indirectly coupled to the computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.

The instructions may be directly executable or may be used to develop executable instructions. For example, the instructions may be implemented as executable or non-executable machine code, or may be implemented as high-level language instructions that may be compiled to produce executable or non-executable machine code. Further, instructions may also be implemented as, or may include, data. Computer-executable instructions may also be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, and the like. As those skilled in the art will recognize, details including, but not limited to, number, structure, sequence, and organization of instructions may vary significantly without changing the underlying logic, function, processing, and output.

Although the above discussion has primarily referred to microprocessor or multi-core processors executing software, one or more implementations are performed by one or more integrated circuits, such as ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions stored on the circuit itself.

Those skilled in the art will appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. The various components and blocks may be arranged differently (e.g., arranged in a different order, or divided in a different manner) without departing from the scope of the subject technology.

It is to be understood that the specific order or hierarchy of blocks in the processes disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged or that all illustrated blocks may be performed. Any of these blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the division of various system components in the implementations described above should not be understood as requiring such division in all implementations, and it should be understood that program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

As used in this specification and any claims of this patent application, the terms "base station," "receiver," "computer," "server," "processor," and "memory" all refer to electronic or other technical devices. These terms exclude a person or group of persons. For the purposes of this specification, the term "display" or "displaying" means displaying on an electronic device.

As used herein, the phrase "at least one of," following the use of the term "and" or "to separate a series of items from any one of the items, modifies the list as a whole and not every member of the list (i.e., every item). The phrase "at least one of" does not require the selection of at least one of each of the items listed; rather, the phrase allows the meaning of at least one of any one item and/or at least one of any combination of items and/or at least one of each item to be included. For example, the phrases "at least one of A, B and C" or "at least one of A, B or C" each refer to a only, B only, or C only; A. any combination of B and C; and/or A, B and C.

The predicate words "configured to", "operable to", and "programmed to" do not imply any particular tangible or intangible modification to a certain subject but are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control operations or components may also mean that the processor is programmed to monitor and control operations or that the processor is operable to monitor and control operations. Also, a processor configured to execute code may be interpreted as a processor that is programmed to execute code or that is operable to execute code.

Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, a specific implementation, the specific implementation, another specific implementation, some specific implementation, one or more specific implementations, embodiments, the embodiment, another embodiment, some embodiments, one or more embodiments, configurations, the configuration, other configurations, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof, and the like are for convenience and do not imply that a disclosure relating to such phrase or phrases is essential to the subject technology, nor that such disclosure applies to all configurations of the subject technology. Disclosure relating to such one or more phrases may apply to all configurations or one or more configurations. Disclosure relating to such one or more phrases may provide one or more examples. Phrases such as an aspect or some aspects may refer to one or more aspects and vice versa and this applies similarly to the other preceding phrases.

The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" or as "exemplary" is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, to the extent that the terms "includes," "has," "having," "has," "with," "has," "having," "contains," "containing," "contain" within a certain extent to be able to contain or contain said.

All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element should be construed in accordance with the provisions of 35u.s.c. § 112(f), unless the element is explicitly recited using the phrase "means for.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in a singular value is not intended to mean "one only" and means "one or more" unless specifically so stated. The term "some" means one or more unless specifically stated otherwise. Pronouns for men (e.g., his) include women and neutrals (e.g., her and its), and vice versa. Headings and sub-headings (if any) are used for convenience only and do not limit the subject disclosure.

Claims

1. A method, the method comprising:

providing an input image depicting a view of an object to a machine learning model, the machine learning model having been trained based on an invariant constraint under rotation between a training object and a model-generated representation of the training object; and

generating, using the machine learning model and based on the provided input images, at least one of an output image depicting the object from a rotated view that is different from the view of the object in the input images, or a three-dimensional representation of the object.

2. The method of claim 1, wherein the machine learning model comprises:

an inverse renderer; and

a forward renderer.

3. The method of claim 2, wherein generating the at least one of the output image depicting the object or the three-dimensional representation of the object from the rotated view that is different from the view of the object in the input image comprises utilizing the forward renderer to generate the at least one of the output image depicting the object or the three-dimensional representation of the object from the rotated view that is different from the view of the object in the input image.

4. The method of claim 3, further comprising generating an implicit representation of the object with the inverse renderer based on the input image.

5. The method of claim 4, wherein the forward renderer generates the at least one of the output image depicting the object or the three-dimensional representation of the object from the rotated view that is different from the view of the object in the input image based on the implicit representation generated by the inverse renderer.

6. The method of claim 5, wherein generating, based on the implicit representation, the at least one of the output image depicting the object or the three-dimensional representation of the object from the rotated view that is different from the view of the object in the input image comprises rotating the implicit representation of the object.

7. The method of claim 6, wherein rotating the implicit representation of the object comprises performing a shear rotation of the implicit representation of the object.

8. The method of claim 7, wherein the three-dimensional representation comprises an explicit three-dimensional representation comprising at least one of a voxel grid, a mesh, or a point cloud.

9. The method of claim 7, wherein the implicit representation of the object comprises a tensor or latent space of an auto-encoder.

10. The method of claim 4, wherein generating the implicit representation of the object with the reverse renderer based on the input image comprises generating the implicit representation in a single forward pass of the reverse renderer.

11. The method of claim 1, further comprising training the machine learning model based on the invariant constraints under rotation between the training object and the model-generated representation of the training object by:

providing a first input training image to the machine learning model, the first input training image depicting a first view of the training object;

providing a second input training image to the machine learning model, the second input training image depicting a second view of the training object;

generating a first implicit representation of the training object based on the first input training image;

generating a second implicit representation of the training object based on the second input training image;

rotating the first implicit representation of the training object;

rotating the second implicit representation of the training object;

generating a first output training image based on the rotated first implicit representation of the object;

generating a second output training image based on the rotated second implicit representation of the object;

comparing the first input training image with the second output training image; and

comparing the second input training image to the first output training image.

12. The method of claim 11, wherein the training further comprises minimizing a loss function based on the comparison of the first input training image to the second output training image and the comparison of the second input training image to the first output training image.

13. The method of claim 12, further comprising:

comparing the first implicit representation to the rotated second implicit representation; and

comparing the second implicit representation to the rotated first implicit representation.

14. The method of claim 13, wherein the loss function is further based on the comparison of the first implicit representation to the rotated second implicit representation and the comparison of the second implicit representation to the rotated first implicit representation.

15. The method of claim 1, further comprising training the machine learning model based on at least two input training images without three-dimensional supervision of the training.

16. The method of claim 15, further comprising testing the trained machine learning model without providing pose information to the trained machine learning model.

17. A system, the system comprising:

a processor;

a memory device including instructions that, when executed by the processor, cause the processor to:

18. The system of claim 17, wherein the machine learning model comprises an inverse renderer, a forward renderer, and a shear rotation module.

19. The system of claim 18, wherein a model architecture of the machine learning model including the shear rotation module is fully distinguishable.

20. A non-transitory machine-readable medium comprising code that, when executed by a processor, causes the processor to:

providing an input image depicting a view of an object to a machine learning model, the machine learning model having been trained based on at least two training images depicting different views of training objects; and

21. The non-transitory machine-readable medium of claim 20, wherein the machine learning model comprises an inverse renderer, a forward renderer, and a shear rotation module.

22. The non-transitory machine-readable medium of claim 20, wherein the machine learning model has been trained based on the at least two training images without three-dimensional supervision.