CN112771539B

CN112771539B - Employing three-dimensional data predicted from two-dimensional images using neural networks for 3D modeling applications

Info

Publication number: CN112771539B
Application number: CN201980062890.XA
Authority: CN
Inventors: D·A·高斯贝克; M·T·贝尔; W·K·阿卜杜拉; P·K·哈恩
Original assignee: Matt Porter Co
Current assignee: Matt Porter Co
Priority date: 2018-09-25
Filing date: 2019-09-25
Publication date: 2023-08-25
Anticipated expiration: 2039-09-25
Also published as: WO2020069049A1; EP3857451A1; EP3857451A4; CN112771539A

Abstract

The disclosed subject matter relates to employing a machine learning model configured to predict 3D data from a 2D image using a deep learning technique to derive 3D data for the 2D image. In some embodiments, a system is described that includes a memory storing computer-executable components, and a processor executing the computer-executable components stored in the memory. The computer-executable components include: a receiving section configured to receive a two-dimensional image; and a three-dimensional data derivation component configured to derive three-dimensional data of the two-dimensional image from a two-dimensional data (3D from 2D) neural network model using one or more three-dimensional data.

Description

Employing three-dimensional data predicted from two-dimensional images using neural networks for 3D modeling applications

Technical Field

The present application relates generally to techniques for employing three-dimensional (3D) data predicted from two-dimensional (2D) images using neural networks for 3D modeling applications and other applications.

Background

Interactive, first person 3D immersive environments are becoming increasingly popular. In these environments, the user is able to navigate through the virtual space. Examples of such environments include first person video games and tools for visualizing 3D models of terrain. The air navigation tool allows a user to virtually explore a three-dimensional metropolitan area from an air viewpoint. Panoramic navigation tools (e.g., street views) allow a user to view multiple 360 degree (360 °) panoramic views of an environment and navigate between these multiple panoramic views through visual blending interpolation.

Such an interactive 3D immersive environment can be generated from a real world environment using 3D depth information of a respective 2D image based on photo-level 2D images captured from the real world environment. Although methods for 2D image capturing 3D depth have existed for over ten years, such methods have traditionally been expensive and require complex 3D capture hardware, such as light detection and ranging (LiDAR) devices, laser rangefinder devices, time-of-flight sensor devices, structured light sensor devices, light field cameras, and the like. Furthermore, current alignment software is still limited in terms of functionality and ease of use. For example, existing alignment methods such as iterative closest point algorithm (ICP) require the user to manually input an initial coarse alignment. Such manual input is typically beyond the capabilities of most non-technical users and inhibits real-time alignment of the captured images. Thus, there is a high need for techniques for generating 3D data of 2D images using affordable, user-friendly devices, and techniques for accurately and efficiently aligning 2D images using the 3D data to generate an immersive 3D environment.

Drawings

Fig. 1 presents an example system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 2 presents an exemplary illustration of a reconstruction environment that may be generated based on 3D data derived from 2D image data in accordance with various aspects and embodiments described herein.

Fig. 3 presents another exemplary reconstruction environment that may be generated based on 3D data derived from 2D image data in accordance with various aspects and embodiments described herein.

Fig. 4 presents another exemplary reconstruction environment that may be generated based on 3D data derived from 2D image data in accordance with various aspects and embodiments described herein.

Fig. 5 presents another example system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 6 presents an exemplary computer-implemented method for deriving 3D data from panoramic 2D image data in accordance with various aspects and embodiments described herein.

Fig. 7 presents an exemplary computer-implemented method for deriving 3D data from panoramic 2D image data in accordance with various aspects and embodiments described herein.

Fig. 8 presents another example system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 9 presents an example assistance data component that facilitates employing assistance data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data and generating a reconstructed 3D model based on the 3D data and the captured 2D image data in accordance with various aspects and embodiments described herein.

Fig. 10 presents an exemplary computer-implemented method for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein.

Fig. 11 presents an exemplary computer-implemented method for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein.

Fig. 12 presents an exemplary computer-implemented method for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein.

Fig. 13 presents another example system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 14-25 present example devices and/or systems that facilitate capturing 2D images of an object or environment and deriving 3D/depth data from the images using one or more 3D self 2D techniques, according to various aspects and embodiments described herein.

Fig. 26 presents an exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 27 presents another exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from 2D image data in accordance with various aspects and embodiments described herein.

Fig. 28 presents another exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 29 presents another exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 30 presents an example system that facilitates using one or more 3D self 2D techniques to associate with an Augmented Reality (AR) application in accordance with various aspects and embodiments described herein.

Fig. 31 presents an exemplary computer-implemented method for associating with an AR application using one or more 3D self 2D techniques in accordance with various aspects and embodiments described herein.

FIG. 32 presents an exemplary computing device employing one or more 3D self 2D technologies associated with object tracking, real-time navigation, and 3D feature-based security applications in accordance with various aspects and embodiments described herein.

FIG. 33 presents an exemplary system for developing and training 2D self 3D models in accordance with various aspects and embodiments described herein.

FIG. 34 presents an exemplary computer-implemented method for developing and training a 2D self 3D model in accordance with various aspects and embodiments described herein.

FIG. 35 is a schematic block diagram illustrating a suitable operating environment in accordance with various aspects and embodiments.

FIG. 36 is a schematic block diagram of a sample-computing environment in accordance with various aspects and embodiments.

Detailed Description

By way of introduction, the present disclosure relates to systems, methods, apparatuses, and computer-readable media that provide techniques for deriving 3D data from 2D images using one or more machine learning models and using the 3D data for 3D modeling applications and other applications. Various techniques for predicting 3D data (e.g., depth data or relative 3D positions of image pixels) from a single 2D (color or grayscale) using machine learning (referred to herein as "3D from 2D prediction" or simply "3D from 2D") have been developed and have recently received increasing attention. In the last decade, the research community has made tremendous efforts to improve the performance of monocular deep learning, and significant accuracy has been achieved due to the rapid development and advancement of deep neural networks.

The disclosed subject matter relates to employing one or more machine learning models configured to predict 3D data from 2D images using a deep learning technique (including one or more neural network models) to derive 2D 3D data. In various embodiments, the predicted depth data may be used to generate a 3D model of the environment captured in the 2D image data. Other applications include employing predicted depth data to facilitate augmented reality applications, real-time object tracking, real-time navigation of the environment, biometric authentication applications based on the user's face, and the like. The various elements described in connection with the disclosed techniques may be embodied in a computer-implemented system or apparatus and/or in a different form, such as a computer-implemented method, a computer program product, or in another form (and vice versa).

In one embodiment, a method is provided for using panoramic image data to generate accurate depth predictions from 2D using 3D. The method may include receiving, by a system including a processor, a panoramic image, and employing, by the system, a 3D self-2D convolutional neural network model to derive 3D data from the panoramic image, wherein the 3D self-2D convolutional neural network model employs a convolutional layer surrounding the panoramic image when the panoramic image is projected onto a 2D plane to facilitate deriving three-dimensional data. According to the present method, the convolution layer minimizes or eliminates edge effects associated with deriving 3D data based on surrounding the panoramic image when projected on a 2D plane. In some implementations, the panoramic image may be received while the panoramic image is projected onto a two-dimensional plane. In other implementations, the panoramic image may be received as a spherical or cylindrical panoramic image, and wherein the method further comprises projecting, by the system, the spherical or cylindrical panoramic image onto a 2D plane prior to employing the 3D self-2D convolutional neural network model to derive the 3D data.

In one or more implementations, the 3D self 2D neural network model may include a model trained based on weighting values applied to respective pixels of the projected panoramic image in relation to deriving depth data for the respective pixels, wherein the weighting values vary based on angular areas of the respective pixels. For example, during training, the weighting value decreases as the angular area of the corresponding pixel decreases. Further, in some implementations, a downstream convolution layer of the convolution layers subsequent to the previous layer is configured to re-project a portion of the panoramic image processed by the previous layer in association with deriving depth data for the panoramic image, thereby producing a re-projected version of the panoramic image for each downstream convolution layer. In this regard, the downstream convolution layer is further configured to employ input data from a previous layer by extracting the input data from a re-projected version of the panoramic image. For example, in one implementation, input data may be extracted from a re-projected version of a panoramic image based on a location in a portion of the panoramic image corresponding to the re-projected version of the panoramic image based on a defined angle reception field.

In another embodiment, a method for using panoramic image data to generate accurate depth predictions using 3D from 2D is provided, which may include receiving, by a system operably coupled to a processor, a request for depth data associated with an area of an environment depicted in a panoramic image. The method may further include deriving, by the system, depth data for the entire panoramic image based on the receiving, using a neural network model configured to derive the depth data from the single 2D image. The method may also include extracting, by the system, a portion of the depth data corresponding to the environmental area, and providing, by the system, the portion of the depth data to an entity associated with the request.

Other embodiments of the disclosed subject matter provide techniques for optimizing 3D-from-2D based depth prediction that use enhanced input data in addition to using a single 2D image as input to a 3D-from-2D neural network model and/or using two or more images as input to a 3D-from-2D neural network model. For example, in one embodiment, a method is provided that includes receiving, by a system operatively coupled to a processor, a 2D image, and determining, by the system, assistance data for the 2D image, wherein the assistance data includes orientation information regarding a capture orientation of the 2D image. The method may also include deriving, by the system, 3D information of the 2D image using one or more neural network models configured to infer the 3D information based on the 2D image and the assistance data. In some implementations, orientation information may be determined based on internal measurement data associated with a 2D image generated by an IMD associated with capture of the 2D image.

The auxiliary data may further comprise location information about the capturing location of the 2D image, and wherein determining the auxiliary data comprises identifying the location information in metadata associated with the 2D image. The auxiliary data may also include one or more image capture parameters associated with the 2D image capture, and wherein determining the auxiliary data includes extracting the one or more image capture parameters from metadata associated with the 2D image. For example, the one or more image capture parameters may include one or more camera settings of a camera used to capture the 2D image. In another example, the one or more image capture parameters are selected from the group consisting of: camera lens parameters, illumination parameters, and color parameters.

In some implementations, the 2D image includes a first 2D image, and wherein the method further includes receiving, by the system, one or more second 2D images related to the first 2D image, and determining, by the system, assistance data based on the one or more second 2D images. For example, the assistance data may comprise a capture location of the 2D images, and wherein determining the assistance data comprises determining the capture location based on the one or more second 2D images. In another example, the first 2D image and the one or more second 2D images are captured in association with movement of the capture device to a different location relative to the environment, and wherein determining the assistance data includes employing at least one of: photogrammetry algorithms, simultaneous localization and mapping (SLAM) algorithms, or by motion restoration structure algorithms. In another example, the second 2D image from the first 2D image and the one or more second 2D images of the stereoscopic image pair, wherein the auxiliary data comprises depth data of the first 2D image, and wherein determining the auxiliary data comprises determining the depth data based on the stereoscopic image pair using a passive stereoscopic function.

The method may further include receiving, by the system, depth information of a 2D image captured by a 3D sensor associated with the capturing of the 2D image, wherein deriving includes deriving the 3D information using a neural network model of one or more neural network models configured to infer the 3D information based on the 2D image and the depth information. Additionally, in some implementations, the auxiliary data includes one or more semantic tags for one or more objects depicted in the 2D image, and wherein determining the auxiliary data includes determining the semantic tags by the system using one or more machine learning algorithms.

In still other implementations, the 2D image includes a first 2D image, and wherein the auxiliary data includes one or more second 2D images related to the first 2D image based on image data including a different perspective depicting the same object or environment as the first 2D image. For example, the first 2D image and the one or more second 2D images may include partially overlapping fields of view of the object or environment. According to these implementations, the assistance data may further include information regarding one or more relationships between the first 2D images, and wherein determining the assistance data includes determining relationship information including determining at least one of a relative capture location of the first 2D image and the one or more second 2D images, a relative capture orientation of the first 2D image, and a relative capture time of the first 2D image and the one or more second 2D images.

In another embodiment, a method is provided that includes receiving, by a system operatively coupled to a processor, a captured related 2D image of an object or environment, wherein the 2D images are associated based on providing different perspectives of the object or environment. The method may further include deriving, by the system, depth information for at least one of the correlated 2D images based on the correlated 2D images using the one or more neural network models and the correlated 2D images as inputs to the one or more neural network models. In some implementations, the method further includes determining, by the system, relationship information regarding one or more relationships between the related images, wherein deriving further includes deriving the depth information using the relationship information as input to one or more neural network models. For example, the relationship information may include the relative capture locations of the related 2D images. In another example, the relationship information may include a relative capture orientation of the related 2D images. In another example, the relationship information includes relative capture times of the plurality of 2D images.

In other embodiments, a system includes a memory storing computer-executable components and a processor executing the computer-executable components stored in the memory. The computer-executable components may include: a receiving section that receives a 2D image; and a preprocessing component that alters one or more characteristics of the 2D image to convert the image into a preprocessed image according to a standard representation format. The computer-executable components may also include a depth derivation component that derives 3D information of the preprocessed 2D image using one or more neural network models configured to infer the 3D information based on the preprocessed 2D image.

In some implementations, the preprocessing component alters one or more characteristics based on one or more image capture parameters associated with the capture of the 2D image. The preprocessing component can also extract one or more image capture parameters from metadata associated with the 2D image. The one or more image capture parameters may include one or more camera settings of a camera, for example, for capturing 2D images. For example, the one or more image capture parameters are selected from the group consisting of: camera lens parameters, illumination parameters, and color parameters. In some implementations, the one or more characteristics may include one or more visual characteristics of the 2D image, and the preprocessing component alters the one or more characteristics based on differences between the one or more characteristics and one or more defined image characteristics of the standard representation format.

Various additional embodiments relate to example devices and/or systems that facilitate capturing 2D images of an object or environment and deriving 3D/depth data from the images using one or more 3D self 2D techniques according to various aspects and embodiments described herein. Various arrangements of devices and/or systems are disclosed that include one or more cameras configured to capture 2D images, a 3D data derivation component configured to derive 3D data of the images, and a 3D modeling component configured to generate a 3D model of an environment included in the images. These arrangements may include: some embodiments in which all of the components are disposed on a single device, embodiments in which the components are distributed between two devices, and embodiments in which the components are distributed between three devices.

For example, in one embodiment, there is provided an apparatus comprising: a camera configured to capture a 2D image; a memory storing computer-executable components; and a processor executing the computer-executable components stored in the memory. The computer-executable components may include 3D data derivation components configured to derive 3D data of the 2D image using one or more 3D self-2D neural network models. In some implementations, the computer-executable components may also include modeling components configured to align the 2D images based on the 3D data to generate a 3D model of the object or environment included in the 2D images. In other implementations, the computer-executable components may include a communication component configured to transmit the 2D image and the 3D data to an external device, wherein the external device generates a 3D model of an object or environment included in the 2D image by aligning the 2D images with each other based on the 3D data based on receiving the two-dimensional image and the three-dimensional data. With these implementations, the communication component may also be configured to receive the 3D model from an external device, and the device may render the 3D model via a display of the device.

In some implementations of this embodiment, the 2D image may include one or more images characterized as a wide field of view image based on having a field of view exceeding a minimum threshold. In another implementation, the computer-executable component may further include a stitching component configured to combine two or more first images in the two-dimensional image to generate a second image having a field of view larger than respective fields of view of the two or more first images, and wherein the three-dimensional data derivation component is configured to employ the one or more 3D self 2D neural network models to derive at least some of the three-dimensional data from the second image.

In some implementations of the present embodiment, the device may further comprise a 3D sensor configured to capture depth data of the partial 2D image in addition to the camera, wherein the 3D data derivation component is further configured to use the depth data as input to one or more 3D self 2D neural network models to derive 3D data of the 2D image. For example, the 2D image may include a panoramic color image having a first vertical field of view, wherein the 3D sensor includes a structured light sensor configured to capture depth data of a second vertical field of view within the first vertical field of view, and wherein the second vertical field of view includes a field of view that is narrower than the first vertical field of view.

In another embodiment, an apparatus is provided, the apparatus comprising: a memory storing computer-executable components; and a processor executing the computer-executable components stored in the memory. The computer-executable components include: a receiving section configured to receive a 2D image from a 2D image capturing device; and a 3D data derivation component configured to derive 3D data of the 2D image using the one or more 3D self-2D neural network models. In some implementations, the computer-executable components further include a modeling component configured to align the 2D image based on the 3D data to generate a 3D model of the object or environment included in the 2D image. The computer-executable components may also include rendering components configured to facilitate rendering the 3D model via a display of the device (e.g., directly, using a web browser, using a web application, etc.). In some implementations, the computer-executable components may also include navigation components configured to facilitate navigating the displayed 3D model. In one or more alternative implementations, the computer-executable components may include a communication component configured to transmit the 2D image and the 3D data to an external device, wherein the external device generates a three-dimensional model of an object or environment included in the two-dimensional image by aligning the two-dimensional image with each other based on the three-dimensional data based on receiving the two-dimensional image and the three-dimensional data. Using these implementations, the communication component can receive the 3D model from an external device, and wherein the computer-executable component further comprises a rendering component configured to render the 3D model via a display of the device. The external device may further facilitate navigating the 3D model (e.g., using a web browser, etc.) in association with accessing and rendering the 3D model.

In yet another embodiment, an apparatus is provided that includes a memory storing computer-executable components; and a processor executing the computer-executable components stored in the memory. The computer-executable components include a receiving component configured to receive 2D images of an object or environment captured from different perspectives of the object or environment, and derive depth data for respective ones of the 2D images using one or more 3D self-2D neural network models. The computer-executable components also include a modeling component configured to align the 2D images with each other based on the depth data to generate a 3D model of the object or environment. In some implementations, the computer-executable components further include a communication component configured to send the 3D model to the rendering device for display at the rendering display via the network. With these implementations, the computer-executable components may also include navigation components configured to facilitate navigating a 3D model displayed at the rendering device. In one or more alternative implementations, the computer-executable components may include rendering components configured to facilitate rendering the 3D model via a display of the device. With this alternative implementation, the computer-executable components may also include navigation components configured to facilitate navigating a 3D model displayed at the device.

In another embodiment, a method is provided that includes: capturing, by a device comprising a processor, a 2D image of the object or environment, and transmitting, by the device, the 2D image to a server device, wherein upon receipt of the 2D image, the server device derives 3D data of the 2D image using one or more 3D self-2D neural network models, and generates a 3D reconstruction of the object or environment using the 2D image and the 3D data. The method further comprises the steps of: the 3D reconstruction is received by the device from the server device, and the 3D reconstruction is rendered by the device via a display of the device.

In some implementations, 2D images are captured from different perspectives of an object or environment in association with image scanning of the object or environment. With these implementations, the method may further include sending, by the device, a confirmation message confirming completion of the image scan. In addition, based on receipt of the confirmation message, the server device generates a final 3D reconstruction of the object or environment. For example, in some implementations, the final 3D reconstruction has a higher level of image quality relative to the initial 3D reconstruction. In another implementation, the final 3D reconstruction includes a navigable model of the environment, and wherein the initial 3D reconstruction is not navigable. In another implementation, a final 3D reconstruction is generated using a more accurate alignment process than that used to generate the initial 3D reconstruction.

In various additional embodiments, systems and devices are disclosed that facilitate improving AR applications using 3D self 2D processing techniques. For example, in one embodiment, a system is provided that includes: a memory storing computer-executable components; and a processor executing the computer-executable components stored in the memory. The computer-executable components may include a 3D data derivation component configured to employ one or more 3D self-2D neural network models to derive 3D data from one or more 2D images of an object or environment captured from a current perspective of the object or environment viewed on or through a display of the device. The one or more computer-executable components may also include a spatial alignment component configured to determine a location for integrating (virtual) graphical data objects on or within a representation of an object or environment viewed on or through the display based on the current perspective and the 3D data. For example, the representation of the object or environment may include a real-time view of the environment viewed through a transparent display of the device. In another implementation, the representation of the object or environment may include one or more 2D images and/or video frames of the captured object or environment. In various implementations, the device may include one or more cameras that capture one or more 2D images.

The computer-executable components may also include an integration component configured to integrate the graphical data object on or within a representation of the object or environment based on the location. In some implementations, the computer-executable components may also include an occlusion mapping component configured to determine a relative position of the graphical data object with respect to another object included in the representation of the object or environment based on the current perspective and the 3D data. In this regard, based on determining that the relative position of the graphical data object is behind another object, the integration component may be configured to occlude at least a portion of the graphical data object located behind the other object in association with integrating the graphical data object on or within the representation of the object or environment. Also, based on determining that the relative position of the graphical data object is in front of another object, the integration component is configured to occlude at least a portion of another object located behind the graphical data object in association with integrating the graphical data object on or within the environmental representation.

In yet another embodiment, a system and apparatus are disclosed that facilitate real-time tracking of objects using 3D self 2D processing techniques. For example, there is provided an apparatus comprising: a memory storing computer-executable components; and a processor executing the computer-executable components stored in the memory. The computer-executable components may include: a 3D data derivation component configured to employ one or more 3D self-2D neural network models to derive 3D data from 2D images of an object captured over a period of time; and an object tracking component configured to track a position of the object over a period of time based on the 3D data. For example, the 2D image data includes successive frames of video data captured over a period of time. In some implementations, the object includes a moving object, and wherein the 2D image includes images captured from one or more stationary capture devices. In other implementations, the object includes a fixed object, and wherein the 2D image data includes an image of the object captured by the camera in association with movement of the camera over a period of time. For example, a camera may be attached to the vehicle, and wherein the object tracking component is configured to track a position of the object relative to the vehicle.

It should be noted that the terms "3D model", "3D object", "3D reconstruction", "3D image", "3D representation", "3D rendering", "3D construction", etc. are used interchangeably throughout unless the context ensures that there is a specific distinction between the terms. It should be understood that such terms may refer to data representing objects, spaces, scenes, etc. in three dimensions that may or may not be displayed on an interface. In one aspect, a computing device, such as a Graphics Processing Unit (GPU), may generate executable/visual content in three dimensions based on the data. The term "3D data" refers to data used to generate a 3D model, data describing a perspective or view of a 3D model, captured data (e.g., sensory data, images, etc.), metadata associated with a 3D model, and the like. In various embodiments, the terms 3D data and depth data are used interchangeably throughout unless the context ensures that there is a particular distinction between these terms.

The term image as used herein refers to a 2D image unless otherwise indicated. In various embodiments, the term 2D image is used to clarify and/or emphasize only the fact that: the image is 2D instead of 3D data derived therefrom and/or a 3D model generated based on the image and the derived 3D data. It should be noted that the terms "2D model", "2D image", etc. are used interchangeably throughout unless the context ensures that there is a specific distinction between the terms. It should be understood that such terms may refer to data representing objects, spaces, scenes, etc. in two dimensions that may or may not be displayed on an interface. The terms "2D data," "2D image data," and the like are used interchangeably throughout unless the context warrants a particular distinction between the terms, and may refer to data describing a 2D image (e.g., metadata), captured data associated with a 2D image, a representation of a 2D image, and the like. In one aspect, a computing device, such as a Graphics Processing Unit (GPU), may generate executable/visual content in two dimensions based on the data. In another aspect, the 2D model may be generated based on captured image data, 3D image data, or the like. In an embodiment, a 2D model may refer to a 3D model, a real world scene, a 3D object, or other 3D constructed 2D representation. As an example, the 2D model may include a 2D image, a set of 2D images, a panoramic 2D image, a set of panoramic 2D images, 2D data wrapped onto geometry, or other various 2D representations of the 3D model. It should be noted that the 2D model may include a set of navigation controls.

Furthermore, terms such as "navigation location," "current location," "user location," and the like are used interchangeably throughout unless the context warrants a particular distinction between terms. It should be understood that these terms may refer to data representing a position relative to the digital 3D model during user navigation or the like. For example, according to various embodiments, the 3D model may be viewed and rendered in association with navigation of the 3D model, interaction with the 3D model, generation of the 3D model, and the like from various perspectives and/or fields of view of the virtual camera relative to the 3D model. In some embodiments, different views or perspectives of the model may be generated based on interactions with the 3D model in one or more modes (such as walking mode, playhouse/track mode, floor plan mode, feature mode, etc.). In one aspect, a user may provide input to a 3D modeling system, and the 3D modeling system may facilitate navigation of the 3D model. As used herein, navigation of the 3D model may include changing the perspective and/or field of view, as described in more detail below. For example, the viewing angle may rotate about a viewpoint (e.g., an axis or pivot point) or alternate between viewpoints, and the field of view may enhance a region of the model, change a size of a region of the model (e.g., "zoom in" or "zoom out" or the like), and so forth.

Versions of the 3D model presented from different views or perspectives of the 3D model are referred to herein as representations or renderings of the 3D model. In various implementations, the representation of the 3D model may represent a volume of the 3D model, an area of the 3D model, or an object of the 3D model. The representation of the 3D model may comprise 2D image data, 3D image data or a combination of 2D and 3D image data. For example, in some implementations, the representation or rendering of the 3D model may be a 2D image or panorama associated with the 3D model from a particular perspective of the virtual camera located at a particular navigational position and orientation relative to the 3D model. In other implementations, the representation or rendering of the 3D model may be a 3D model or a portion of a 3D model that is generated from a particular navigational position and orientation of the virtual camera relative to the 3D model and generated using an aligned set or subset of captured 3D data used to generate the 3D model. In still other implementations, the representation or rendering of the 3D model may include a combination of the 2D image and an aligned 3D dataset associated with the 3D model.

Terms such as "user equipment," "user equipment device," "mobile device," "user device," "client device," "cell phone," or terms representing similar terms may refer to a device used by a subscriber or user to receive data, transmit data, control, voice, video, sound, 3D model, game, etc. The foregoing terms are used interchangeably herein and with reference to the associated drawings. In addition, the terms "user," "subscriber," "customer," "consumer," "end user," and the like are used interchangeably throughout, unless the context warrants a particular distinction between terms. It should be appreciated that such terms can refer to human entities, human entities represented by user accounts, computing systems, or automated components supported by artificial intelligence (e.g., the ability to infer based on complex mathematical formalism), which can provide simulated vision, voice recognition, and the like.

In various implementations, the components described herein may perform actions on-line or off-line. On/off line may refer to a state that identifies connectivity between one or more components. Typically, "online" indicates a connected state and "offline" indicates a disconnected state. For example, in online mode, models and tags may be streamed from a first device (e.g., a server device) to a second device (e.g., a client device), such as streaming raw model data or rendered models. In another example, in an offline mode, models and tags may be generated and rendered on one device (e.g., a client device) such that the device does not receive data or instructions from a second device (e.g., a server device). Although the various components are shown as separate components, it should be noted that the various components may be comprised of one or more other components. In addition, it should be noted that embodiments may include additional components not shown for brevity. Additionally, various aspects described herein may be performed by one device or two or more devices in communication with each other.

The embodiments outlined above are now described in more detail with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It may be evident, however, that the embodiments may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the embodiments.

Referring now to the drawings, fig. 1 presents an exemplary system 100 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein. Various aspects of the systems, apparatus, or processes explained in this disclosure may constitute machine-executable components embodied within a machine, such as in one or more computer-readable media (or media) associated with one or more machines. Such components, when executed by one or more computers (e.g., computers, computing devices, virtual machines, etc.), may cause the machine to perform the operations.

In the illustrated embodiment, the system 100 includes a computing device 104 configured to receive and process 2D image data 102 using one or more computer-executable components. These computer-executable components may include a 3D self 2D processing module 106 configured to perform various functions associated with processing the 2D image data 102 to derive 3D data (e.g., derived 3D data 116) from the 2D image data 102. The computer-executable components may also include a 3D model generation component 118 configured to generate a reconstructed 3D model of the object or environment included in the 2D image data 102 based at least in part on the derived 3D data 116. The computer-executable components may also include a navigation component 126 that facilitates navigating the immersive 3D model generated by the 3D model generation component. For example, as described in more detail below, in various embodiments, the 2D image data 102 may include a number of 2D images of a captured object or environment, such as a number of 2D images of a captured house interior. The 3D model generation component 118 may be configured to use the derived 3D data 116 corresponding to the relative 3D positions of the 2D images and/or features (e.g., pixels, super-pixels, objects, etc.) included in the 2D images in order to generate an alignment between the 2D images and/or features included in the respective 2D images relative to a common 3D coordinate space. The 3D model generation component 118 may further employ the alignment between the 2D image data and/or the associated 3D data to generate a reconstructed representation or 3D model of the object or environment represented in the 2D image data. In some embodiments, the 3D model may include an immersive virtual reality VR environment that may be navigated with the aid of navigation component 126. In the illustrated embodiment, the reconstructed representation/3D model and associated alignment data generated by the 3D model generation component 118 are identified as 3D model and alignment data 128. The system 100 may also include a suitable user device 130 that may receive and render a display 132 of the reconstructed/3D model generated by the 3D model generation component 118. For example, user devices 130 may include, but are not limited to: a desktop computer, laptop computer, mobile phone, smart phone, tablet Personal Computer (PC), personal Digital Assistant (PDA), head-up display (HUD), virtual Reality (VR) headset, augmented Reality (AR) headset or device, standalone digital camera, or other type of wearable computing device.

The computing device 104 may include or be operatively coupled to at least one memory 104 and at least one processor 124. The at least one memory 122 may further store computer-executable instructions (e.g., the 3D model generation component 118, the 2D self 3D processing module 106, one or more components of the 2D self 3D processing module 106, and the navigation component 126) that, when executed by the at least one processor 124, cause performance of operations defined by the computer-executable instructions. In some embodiments, the memory 122 may also store data received and/or generated by the computing device, such as (but not limited to) the received 2D image data 102, the derived 3D data 116, and the 3D model and alignment data 128. In other embodiments, the various data sources and data structures of the system 100 (and other systems described herein) may be stored in another memory (e.g., at a remote device or system) accessible to the computing device 104 (e.g., via one or more networks). The computing device 104 may also include a device bus 120 that communicatively couples the various components and data sources/data structures of the computing device 104. Examples of the processor 124 and memory 122, as well as other suitable computers or computing-based elements, may be found with reference to fig. 35, and other suitable computers or computing-based elements may be used in conjunction with one or more systems or components that implement fig. 1 or other figures disclosed herein and described in conjunction with fig. 1 or other figures disclosed herein.

In the illustrated embodiment, the 3D self 2D processing module 106 may include a receiving component 108, a 3D data deriving component 110, and a 3D self 2D model database 112. The receiving component 108 may be configured to receive the 2D image data 102 for 3D processing from the 2D processing module 106 (and/or the 3D model generating component 118). The source of the 2D image data 102 may vary. For example, in some implementations, the receiving component 108 can receive the 2D image data 102 from one or more image capture devices (e.g., one or more cameras), one or more network-accessible data sources (e.g., a archive of network-accessible 2D image data), a user device (e.g., an image uploaded by a user from a personal computing device), and/or the like. In some implementations, the receiving component 108 can receive the 2D image data in real-time as the 2D image data is captured (or receive the 2D image data substantially in real-time as the 2D image data is captured such that the 2D image data is received within a few seconds of capture) in order to facilitate real-time processing applications associated with deriving 3D data from the 2D image data in real-time, including real-time generation and rendering of 3D models based on the 2D image data, real-time object tracking, real-time relative position estimation, real-time AR applications, and the like. In some embodiments, the 2D image data 102 may include images captured by various camera types having various settings and image processing capabilities (e.g., various resolutions, fields of view, color spaces, etc.). For example, the 2D image data may include standard red, green, blue (RGB) images, black and white images, high dynamic range images, and the like. In some implementations, the 2D image data 102 may include images captured using a camera included with another device (such as a mobile phone, smart phone, tablet PC, standalone digital camera, etc.). In various embodiments, the 2D image data 102 may include multiple images that provide different perspectives of the same object or environment. For these embodiments, image data from the respective images may be combined and aligned with respect to each other and the 3D coordinate space by the model generation component 118 to generate a 3D model of the object or environment.

The 3D data derivation component 110 can be configured to process the received 2D image data 102 using one or more 3D self-2D machine learning models to determine (or derive, infer, predict, etc.) derived 3D data 116 for the received 2D image data 102. For example, the 3D data derivation component 110 can be configured to employ one or more 3D self-2D machine learning models configured to determine depth information for one or more visual features (e.g., pixels, super-pixels, objects, planes, etc.) included in a single 2D image. In the illustrated embodiment, these one or more machine learning models may be provided in a 3D self 2D model database 112 accessible to the 3D data derivation component 110.

In various embodiments, the 3D data derivation component 110 can employ one or more existing, proprietary, and/or non-proprietary 3D self-2D machine learning models that have been developed in the field to generate derived 3D data 116 for the received 2D image data 102. These existing 3D self 2D models are represented in the system 100 and are referred to herein as "standard models". For example, in the illustrated embodiment, the 3D self 2D model database 112 may include one or more standard models 114 that may be selected and applied to the received 2D image data 102 by the 3D data derivation component 110 to generate derived 3D data 116 from the 2D image data 102. These standard models 114 may include various types of 3D self 2D predictive models configured to receive a single 2D image as input and process the 2D image using one or more machine learning techniques to infer or predict 3D/depth data of the 2D image. Machine learning techniques may include, for example, supervised learning techniques, unsupervised learning techniques, semi-supervised learning techniques, decision tree learning techniques, association rule learning techniques, artificial neural network techniques, inductive logic programming techniques, support vector machine techniques, clustering techniques, bayesian network techniques, enhanced learning techniques, representation learning techniques, and the like.

For example, the standard model 114 may include one or more standard 3D self 2D models that perform depth estimation using Markov Random Field (MRF) techniques, conditional MRF techniques, and non-parametric methods. These standard 3D makes powerful geometric assumptions from 2D models, i.e. scene structure consists of horizontal planes, vertical walls and superpixels, MRF is employed to estimate depth by using hand-made features. The standard model 114 may also include one or more models that perform 3D self 2D depth estimation using non-parametric algorithms. Non-parametric algorithms rely on the assumption that similarity between regions in an RGB image implies similar depth cues to learn depth from a single RGB image. After clustering the training data sets based on global features, these models first search the feature space for candidate RGB-D of the input RGB image, and then deform and fuse the candidate pairs to obtain the final depth.

In various exemplary embodiments, the standard model 114 may employ one or more deep learning techniques, including the use of one or more nervesDeep learning techniques of the network and/or deep convolutional neural networks to derive 3D data from a single 2D image. In the last decade, the research community has made tremendous efforts to improve the performance of monocular deep learning, and significant accuracy has been achieved due to the rapid development and advancement of deep neural networks. Deep learning is a type of machine learning AlgorithmUse thereofNonlinear processingMultiple layers of cells cascaded for feature extractionAnd conversion. In some implementations, each successive layer uses the output of the previous layer as an input. Deep learning models may include the use of supervised learning (e.g., classification) and/or non-learningMonitoring deviceDu learning (e.g. pattern divisionAnalysis ofOne or more layers of patterns to learn. In some implementations, deep learning techniques for deriving 3D data from 2D images may learn using multiple levels of representation corresponding to different levels of abstraction, where the different levels form a hierarchy of concepts.

There are many models of 3D self 2D depth prediction based on depth convolutional neural networks. One approach is a completely convolved residual network that uses direct predicted depth values as the regression output. Other models use multi-scale neural networks to separate overall scale predictions from predictions of fine detail. Some models refine the results by: the fully connected layers are merged, conditional Random Field (CRF) elements are added to the network, or additional outputs, such as normal vectors, are predicted and combined with the initial depth prediction to generate a refined depth prediction.

In various embodiments, the 3D model generation component 118 may use the derived 3D data 116 for a corresponding image received by the computing device 104 to generate a reconstructed 3D model of the object or environment included in the image. The 3D models described herein may include data representing locations, geometries, curved surfaces, and the like. For example, the 3D model may include a set of points represented by 3D coordinates (such as points in 3D euclidean space). The sets of points may be associated (e.g., connected) with each other by geometric entities. For example, a set of grid connectable points comprising a series of triangles, lines, curved surfaces (e.g., non-uniform rational basis splines (NURBS)), quadrilaterals, n-grams, or other geometric shapes. For example, a 3D model of a building interior environment may include mesh data (e.g., triangle mesh, quadrilateral mesh, parameterized mesh, etc.), one or more texture-mapped meshes (e.g., one or more texture-mapped polygon meshes, etc.), a point cloud, a set of point clouds, a bin, and/or other data constructed with one or more 3D sensors. In one example, the captured 3D data may be configured in a triangle mesh format, a quadrilateral mesh format, a bin format, a parameterized entity format, a geometric primitive format, and/or another type of format. For example, each vertex of a polygon in a texture-mapped mesh may include UV coordinates of a point in a given texture (e.g., a 2D texture), where U and V are axes of the given texture. In a non-limiting example of a triangle mesh, each vertex of the triangle may include UV coordinates of a point in a given texture. Triangles formed by three points (e.g., a set of three UV coordinates) of a triangle in a texture may be mapped onto a mesh triangle for rendering purposes.

Portions of the 3D model geometry data (e.g., mesh) may include image data describing texture, color, intensity, etc. For example, the geometric data may include geometric data points in addition to texture coordinates associated with the geometric data points (e.g., texture coordinates indicating how the texture data is to be applied to the geometric data). In various embodiments, the received 2D image data 102 (or portions thereof) may be associated with portions of a grid to associate visual data (e.g., texture data, color data, etc.) from the 2D image data 102 with the grid. In this regard, the 3D model generation component 118 may generate a 3D model based on the 2D image and 3D data respectively associated with the 2D image. In one aspect, data for generating the 3D model may be collected by scanning (e.g., with sensors) real world scenes, spaces (e.g., houses, office spaces, outdoor spaces, etc.), objects (e.g., furniture, decorations, merchandise, etc.), and the like. The data may also be generated based on a computer-implemented 3D modeling system.

In some embodiments, the 3D model generation component 118 may convert a single 2D image of an object or environment into a 3D model of the object or environment based on the derived depth data 116 for the single image. According to these embodiments, the 3D model generation component 118 may use depth information of respective pixels, superpixels, features, etc., derived for the 2D image to generate a 3D point cloud, 3D mesh, etc., corresponding to the respective pixels in 3D. The 3D model generation component 118 can further register visual data of corresponding pixels, superpixels, features, etc. (e.g., color, texture, brightness, etc.) with their corresponding geometric points in 3D (e.g., color point clouds, color grids, etc.). In some implementations, the 3D model generation component 118 can further manipulate the 3D model to facilitate rotating the 3D model in 3D relative to one or more axes such that the 3D point cloud or mesh can be viewed from a different perspective as an alternative to the original captured perspective.

In other embodiments, where the 2D image data 102 includes a plurality of different images of the environment captured from different capture locations and/or orientations relative to the environment, the 3D model generation component 118 may perform an alignment process that involves aligning features in the 2D image and/or the 2D image with each other and with a common 3D coordinate space based at least in part on the derived 3D data 116 of the respective images to generate alignment between the image data and/or the respective features in the image data. For example, the alignment data may also include information such as mapping corresponding pixels, superpixels, objects, features, etc. represented in the image data with defined 3D points, geometric data, triangles, regions, and/or volumes relative to the 3D space.

For these embodiments, the quality of the alignment will depend in part on the amount, type, and accuracy of the derived 3D data 116 determined for the respective 2D image, which may vary according to the machine learning technique (e.g., the one or more 3D self 2D models used) used by the 3D data derivation component 110 to generate the derived 3D data 116. In this regard, the derived 3D data 116 may include 3D location information for each (or in some implementations one or more) received 2D image(s) of the 2D image data 102. Depending on the machine learning technique used to determine the derived 3D data 116, the derived 3D data may include depth information for each pixel of a single 2D image, depth information for a subset or group of pixels (e.g., super-pixels), depth information for only one or more portions of the 2D image, and so forth. In some implementations, the 2D images may also be associated with additional known or derived spatial information that may be used to facilitate aligning the 2D image data with each other in a 3D coordinate space, including, but not limited to, a relative capture position and a relative capture orientation of the respective 2D images relative to the 3D coordinate space.

In one or more embodiments, the alignment process may involve determining positional information (e.g., relative to a 3D coordinate space) and visual feature information of respective points in the received 2D image relative to each other in a common 3D coordinate space. In this regard, the 2D image, derived 3D data respectively associated with the 2D image, visual feature data mapped to the geometry of the derived 3D data, and other sensor data and auxiliary data (if available) (e.g., the auxiliary data described with reference to fig. 30) may then be used as inputs to an algorithm that determines potential alignment between the different 2D images via coordinate transformation. For example, in some implementations, the 3D location information of the respective pixel or feature derived for a single 2D image may correspond to a point cloud comprising a set of points in 3D space. The alignment process may involve iteratively aligning different point clouds from adjacent and overlapping images captured from different positions and orientations relative to the object or environment in order to generate a global alignment between the respective point clouds using correspondence in the derived position information of the respective points. The visual characteristic information (including correspondence in color data, texture data, brightness data, etc.) of the corresponding points or pixels included in the point cloud (as well as other sensor data, if any) may also be used to generate alignment data. The model generation component 118 may further evaluate the quality of the potential alignment and may align the 2D images together once alignment of sufficiently high relative or absolute quality is achieved. By repeated alignment of the new 2D image (and potential improvement of alignment of the existing dataset), global alignment of all or most of the input 2D images into a single coordinate system can be achieved.

The 3D model generation component 118 may further employ the alignment between corresponding features in the 2D image data and/or the image data (e.g., the alignment data and the alignment data 128 of the 3D model) to generate one or more reconstructed 3D models (e.g., the 3D model data and the alignment data 128 of the 3D model) of the object or environment included in the captured 2D image data. For example, the 3D model generation component 118 may also employ a set of aligned 2D image data and/or associated 3D data in order to generate various representations of the 3D model of the environment or object from different perspectives or viewpoints of the virtual camera position outside or inside the 3D model. In one aspect, the representations may include one or more of captured 2D images and/or image data from one or more 2D images.

The format and appearance of the 3D model may vary. In some embodiments, the 3D model may include a photo-level 3D representation of the object or environment. The 3D model generation component 118 may further remove photographed objects (e.g., walls, furniture, fixtures, etc.) from the 3D model, integrate new 2D and 3D graphical objects on or within the 3D model in spatially aligned positions relative to the 3D model, change the appearance (e.g., color, texture, etc.) of visual features of the 3D model, and so forth. The 3D model generation component 118 may also generate reconstructed views of the 3D model from different perspectives of the 3D model, generate 2D versions/representations of the 3D model, and so forth. For example, the 3D model generation component 118 may generate a 3D model or representation of a 3D model of an environment corresponding to a floor plan model of the environment, a toy house model of the environment (e.g., in implementations where the environment includes a building space such as an interior space of a house), and so forth.

In various embodiments, the floor plan model may be a simplified representation of a surface (e.g., wall, floor, ceiling, etc.), an entrance (e.g., door opening), and/or a window opening associated with an interior environment. The floor plan model may include the location of the boundary edge of each given surface, entrance (e.g., door opening), and/or window opening. The floor plan model may also include one or more objects. Alternatively, the floor plan may be generated without an object (e.g., an object may be omitted from the floor plan). In some implementations, the floor plan model can include one or more dimensions associated with a surface (e.g., wall, floor, ceiling, etc.), an entrance (e.g., door opening), and/or a window opening. In one aspect, dimensions below a particular size may be omitted from the floor plan. The planes included in the floor plan may extend a particular distance (e.g., intersect the build).

In various embodiments, the floor plan model generated by the 3D model generation component 118 may be a schematic floor plan of a building structure (e.g., a house), a schematic floor plan of an interior space of a building structure (e.g., a house), and so forth. For example, the 3D model generation component 118 may generate a floor plan model of the building structure by employing the identified walls associated with the derived 3D data 116 derived from the captured 2D image of the building structure. In some implementations, the 3D model generation component 118 can employ generic architectural symbols to illustrate architectural features of a building structure (e.g., a door, a window, a fireplace, a length of a wall, other features of a building, etc.). In another example, the floor plan model may include a series of lines in 3D space that represent intersections of walls and/or floors, contours of galleries and/or windows, edges of steps, contours of other objects of interest (e.g., mirrors, paintings, fireplaces, etc.). The floor plan model may also include measurements of walls and/or other common annotations that appear in the building floor plan.

The floor plan model generated by the 3D model generating part 118 may be a 3D floor plan model or a 2D floor plan model. The 3D floor plan model may contain edges of each floor, wall, and ceiling as lines. The lines of floors, walls, and ceilings may be sized (e.g., annotated) with the associated sizes. In one or more embodiments, the 3D floor plan model may be navigated in 3D via a viewer on a remote device. In one aspect, a subsection (e.g., room) of the 3D floor plan model may be associated with text data (e.g., name). Measurement data (e.g., square feet, etc.) associated with the surfaces may also be determined based on the derived 3D data corresponding to and associated with the respective surfaces. These measurements may be displayed in association with viewing and/or navigation of the 3D floor plan model. The calculation of the area (e.g., square feet) may be determined for any identified surface or portion of the 3D model having known boundaries, for example, by summing the areas of polygons comprising the identified surface or portion of the 3D model. The display of individual items (e.g., sizes) and/or item categories may be switched in the floor plan via a viewer on the remote device (e.g., via a user interface on the remote client device). The 2D floor plan model may include surfaces (e.g., walls, floors, ceilings, etc.), entrances (e.g., door openings), and/or window openings associated with the derived 3D data 116 used to generate the 3D model and project onto the flat 2D surface. In yet another aspect, the floor plan may be viewed at a plurality of different heights relative to a vertical surface (e.g., a wall) via a viewer on a remote device.

In various embodiments, the 3D model and various representations of the 3D model (e.g., different views of the 3D model, floor plan models in 2D or 3D, etc.) that can be generated by the 3D model generation component 118, and/or the associated aligned 2D and 3D data may be rendered at the user device 130 via the display 132. For example, in some implementations, the 3D model generation component 118 and/or the user device 130 may generate a Graphical User Interface (GUI) that includes the 3D reconstruction model (e.g., depth map, 3D mesh, 3D point cloud, 3D color point cloud, etc.) generated by the 3D model generation component 118.

In some embodiments, the 3D model generation component 118 may be configured to generate such reconstructed 3D models in real time or substantially in real time as the 2D image data is received and the derived 3D data 116 of the 2D image data is generated. Thus, during the entire alignment process, when new 2D image data 102 is received and aligned, real-time or substantially real-time feedback is provided to a user viewing the rendered 3D model regarding the progress of the 3D model. In this regard, in some implementations in which a user is facilitating or controlling the capture of 2D image data 102 for creating a 3D model, the system 100 may facilitate providing real-time or live feedback to the user regarding the progress of a 3D model generated based on the captured and aligned 2D image data (and derived 3D data) during the capture process. For example, in some embodiments, using one or more cameras (or one or more camera lenses) provided on the user device 130 or separate cameras, the user may control capturing 2D images of the environment at various positions and/or orientations relative to the environment. The capture process that involves capturing 2D image data of an environment at various nearby locations in the environment to generate a 3D model of the environment is referred to herein as "scanning". According to this example, when new images are captured, they may be provided to the computing device 104, and 3D data may be derived for the respective images and used to align the images to generate a 3D model of the environment. The 3D model may be further rendered at the user device 130 and updated in real-time based on new image data as it is received during the capture of the 2D image data. For these embodiments, the system 100 may thus provide visual feedback about 2D image data that has been captured and aligned, as well as the quality of the alignment and the resulting 3D model, based on the derived 3D data of the 2D image data during the capture process. In this regard, based on viewing the alignment image data, the user may monitor what has been captured and aligned so far, find potential alignment errors, evaluate scan quality, plan the area to scan next, determine where and how to position one or more cameras for capturing the 2D image data 102, and otherwise complete the scan. Additional details regarding graphical user interfaces that facilitate the viewing and auxiliary capture process are described in U.S. patent No. 9,324,190, filed 2.23 a 2013, and entitled "CAPTURING AND ALIGNING MULTIPLE3-DIMENSIONAL SCENES," which is incorporated herein by reference in its entirety.

Fig. 2-4 present exemplary illustrations of reconstructed 3D models of a building environment that may be generated by the 3D model generation component 118 based on 3D data derived from 2D image data, in accordance with various aspects and embodiments described herein. In the illustrated embodiment, the 3D model is rendered at a user device (e.g., user device 130) that is a tablet PC. It should be appreciated that the type of user device in which the 3D model may be displayed may vary. In some implementations, 2D image data of a corresponding environment represented in a 3D model is captured using one or more cameras (or one or more camera lenses) of a tablet PC to generate a 3D model, from which depth data is derived (e.g., via 3D derivation component 110). In another implementation, the 2D images used to generate the respective 3D models may have been captured by one or more cameras (or one or more camera lenses) of another device. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

Fig. 2 provides a visualization of an exemplary 3D model 200 of a living room associated with the generation of the 3D model by the 3D model generation component 118. In this regard, the depicted 3D model 200 is currently under construction and includes lost image data. In various embodiments, when 3D model generation component 118 is building 3D model 200, the model may be presented to a user at a client device. In this regard, the 3D model 200 may be dynamically updated as new images of the living room are captured, received, and aligned with previously aligned image data based on depth data derived (e.g., by the 3D depth derivation component 110) for the respective images.

FIG. 3 provides a visualization of an exemplary 3D floor plan model 300 that may be generated by the 3D model generation component 118 based on image data of the captured environment. For example, in one implementation, 2D image data of a portion of a house depicted in a 3D floor plan model is captured by cameras held and operated by a user as the user walks from one room to another and takes pictures of the house from different perspectives within the room (e.g., while standing on the floor). Based on the captured image data, the 3D model generation component 118 may use depth data derived from the respective images to generate a 3D floor plan model 300 that provides a completely new (not included in the 2D image data) reconstructed top-down perspective of the environment.

Fig. 4 provides a visualization of an exemplary 3D toy house view representation 400 of a model that may be generated by the 3D model generation component 118 based on image data of the captured environment. For example, in the same manner as described above with respect to fig. 3, in one implementation, when a user walks from one room to another and takes pictures of a house from different perspectives within the room (e.g., while standing on the floor), 2D image data of a house portion depicted in a 3D playhouse view may have been captured by a camera held and operated by the user. Based on the captured image data, the 3D model generation component 118 may use depth data derived from the respective images to generate a 3D model (e.g., a grid) of the environment by aligning the respective images with respect to each other relative to a common 3D coordinate space using the depth data derived for the images, respectively. According to this implementation, the 3D model may be viewed from various perspectives, including the toy house view shown. In this regard, based on the input indicating a particular playhouse perspective for which a 3D model is desired, 3D model generation component 118 may generate 3D playhouse view representation 400 based on the 3D model and the associated alignment image data.

Referring again to FIG. 1, in some embodiments, the computing device 104 may also include a navigation component 126. Navigation component 126 can facilitate viewing, navigating, and interacting with 3D models. The navigation component 126 can facilitate navigating the 3D model after the 3D model has been generated and/or in association with generation of the 3D model by the 3D model generation component 118. For example, in some implementations, the 3D model generated by the 3D model generation component 118, as well as the 2D images used to create the 3D model and the 3D information associated with the 3D model, may be stored in the memory 122 (or another accessible memory device) and accessed by the user device (e.g., via a network using a browser, via a thin client application, etc.). In association with accessing the 3D model, the user device 130 may display (e.g., via the display 132) an initial representation of the 3D model from a predefined initial perspective of the virtual camera relative to the 3D model. The user device 130 may further receive user input (e.g., via a mouse, touch screen, keyboard, gesture detection, gaze detection, etc.) that instructs or requests the virtual camera to traverse or traverse around the 3D model to view different portions of the 3D model and/or to view different portions of the 3D spatial model from different perspectives and navigation modes (e.g., walking mode, playhouse mode, functional view mode, and floor plan mode). Navigation component 126 can facilitate navigating a 3D model by: user gesture input is received and interpreted, and a representation of the 3D model is selected or generated from a new view of the virtual camera relative to the 3D spatial model determined based on the user input. The representation may include a 2D image associated with the 3D model, and a novel view of the 3D model derived from a combination of the 2D image data and the 3D mesh data. The 3D model generation component 118 may also generate and provide a corresponding representation of the 3D model for rendering at the user device 130 via the display 132.

Navigation component 126 can provide various navigation tools that allow a user to provide inputs that facilitate viewing and interacting with different portions of the 3D model. These navigation tools may include, but are not limited to: selecting a location to view (e.g., may include a point, area, object, room, surface, etc.) on the representation of the 3D model, selecting a location for locating the virtual camera (e.g., including a waypoint), selecting an orientation of the virtual camera, selecting a field of view of the virtual camera, selecting a marker icon, moving a location of the virtual camera (forward, backward, left, right, up or down), moving an orientation of the virtual camera (e.g., up translation, down translation, left translation, right translation), and selecting a different viewing mode/context (as described below). The various types of navigation tools described above allow a user to provide input indicating how to move a virtual camera relative to a 3D model in order to view the 3D model from a desired perspective. The navigation component 126 can also interpret received navigation input indicating a desired perspective for viewing the 3D model, thereby facilitating determining a representation of the 3D model for rendering based on the navigation input.

In various implementations, in association with generating a 3D model of an environment, the 3D model generation component 118 can determine a location of an object, an obstacle, a flat plane, and the like. For example, based on the aligned 3D data derived for the respective images of the captured environment, the 3D model generation component 116 may identify obstacles, walls, objects (e.g., countertops, furniture, etc.), or other 3D features included in the aligned 3D data. In some implementations, the 3D data derivation component 110 can identify or partially identify features, objects, etc. included in the 2D image and associate information with the derived 3D data of the respective features, objects, etc., thereby identifying them and/or defining boundaries of the object or feature. In one aspect, objects may be defined as physical objects such that they cannot be traversed when rendered (e.g., during navigation, transitions between modes, etc.). Defining an object as an entity may facilitate various aspects of model navigation. For example, a user may browse a 3D model of an interior living space. Living spaces may include walls, furniture, and other objects. When the user browses the model, the navigation component 126 can prevent the user (e.g., relative to a particular representation that can be provided to the user) from traversing a wall or other object, and can also limit movement according to one or more constraints that can be configured (e.g., maintaining a viewpoint at a specified height above the model surface or defining a floor). In one aspect, the constraints may be based at least in part on a mode (e.g., walking mode) or type of model. It should be noted that in other embodiments, an object may be defined as not an entity object such that the object may be traversed (e.g., during navigation, transitions between modes, etc.).

In one or more implementations, the navigation component 126 can provide different viewing modes or viewing contexts, including but not limited to walking mode, playhouse/track mode, floor plan mode, and feature views. The walking mode may refer to a mode for navigating from a viewpoint within the 3D model and viewing the 3D model. The viewpoint may be based on camera position, points within the 3D model, camera orientation, etc. In one aspect, the walking mode may provide a view of the 3D model that simulates a user walking through or otherwise traveling through the 3D model (e.g., a real world scene). The user may freely rotate and move to view the scene from different angles, high points, heights, or perspectives. For example, when a virtual user walks around the space of the 3D model (e.g., at a defined distance from the floor surface of the 3D model relative), the walking pattern may provide a perspective of the 3D model from a virtual camera that corresponds to the virtual user's eyes. In one aspect, during walking mode, a user may be restricted from having a camera view at a particular height above the model surface unless squat or in the air (e.g., jump, drop edge, etc.). In one aspect, a collision check or navigation grid may be applied such that a user is restricted from passing through an object (e.g., furniture, wall, etc.). The walking pattern may also include moving between waypoints, where the waypoints are associated with known locations of the captured 2D images associated with the 3D model. For example, in association with navigating the 3D model in walking mode, the user may click or select a point or region in the 3D model for viewing, and the navigation component 126 may determine a waypoint associated with a capture location of a 2D image associated with the point or region that provides the best view of the point or region.

The playhouse/track mode represents a mode in which the user perceives the model such that the user is outside or above the model and can freely rotate the model about a center point and move the center point about the model (e.g., relative to the playhouse view representation 400). For example, the playhouse/track mode may provide a perspective of a 3D model in which the virtual camera is configured to view the interior environment from a position removed from the interior environment in a manner similar to looking at the playhouse at various pitches relative to the model floor (e.g., with one or more walls removed). In the playhouse/track mode, there may be multiple types of movement. For example, the viewpoint may pitch up or down, rotate left and right about a vertical axis, zoom in or out, or move horizontally. The pitch, rotation about a vertical axis and the zoom motion may be relative to a center point, such as defined by (X, Y, Z) coordinates. The vertical axis of rotation may pass through the center point. These movements may remain at a constant distance from the center point, both in pitch and in rotation about the vertical axis. Thus, the pitching of the viewpoint and the movement of rotation about the vertical axis can be regarded as vertical and horizontal travel, respectively, on the surface of a sphere centered on the center point. Scaling may be considered as propagation along a ray defined to pass through the viewpoint to the center point. Points on the 3D model rendered in the center of the display may be used as center points with or without back culling or other ceiling removal techniques. Alternatively, the center point may be defined by a point located at a horizontal plane of the center of the display. The horizontal plane may be invisible and its height may be defined by the overall height of the floor of the 3D model. Alternatively, the local floor height may be determined and the intersection of the light rays projected from the camera to the center of the display with the surface of the local floor height may be used to determine the center point.

The floor plan mode presents a view of the 3D model that is orthogonal or substantially orthogonal to the floor of the 3D model (e.g., looking down the model from directly above relative to the 3D floor plan model 300). The floor plan mode may represent a mode in which the user perceives the model such that the user is outside or above the model. For example, a user may view all or a portion of the 3D model from an aerial high point. The 3D model may be moved or rotated about an axis. As an example, the floor plan mode may correspond to a top-down view in which the model is rendered such that the user looks directly down at the model or at a fixed angle down at the model (e.g., about 90 degrees above the floor or bottom plane of the model). In some implementations, the representation of the 3D model generated in the floor plan mode may appear 2D or substantially 2D. The set of motion or navigation controls and mappings in the floor plan mode may be a subset of all available controls for those controls or other models of the playhouse/track mode. For example, a floor plan mode control may be the same as the control described in the context of the track mode, except that the pitch is downward at a fixed degree. Rotation about a central point along a vertical axis is still possible because it is zooming in and out towards and away from the point and moving the central point. However, due to the fixed pitch, the model can only be viewed directly from above.

The feature view may provide a perspective of the 3D model from a narrower field of view than the playhouse/track view context (e.g., a close-up view of a particular item or object of the 3D model). In particular, the feature view allows a user to navigate within and around the details of the scene. For example, with the feature view, a user may view different perspectives of a single object included in an internal environment represented by a 3D model. In various embodiments, selection of a marker icon included in the 3D model or a representation of the 3D model may result in generation of a feature view (as described in more detail below) of a point, region, or object associated with the marker icon.

The navigation component 126 can provide a mechanism for navigating within and between these different modes or perspectives of the 3D model based on discrete user gestures in a virtual 3D space or 2D coordinates relative to the screen. In some implementations, the navigation component 126 may provide navigation tools that allow a user to move the virtual camera relative to the 3D model using the various viewing modes described herein. For example, the navigation component 408 can provide and implement navigation controls that allow a user to change the position and orientation of the virtual camera relative to the 3D model and change the field of view of the virtual camera. In some implementations, based on received user navigation input relative to the 3D model or visualization of the 3D model (including 2D images of the 3D model and hybrid 2D/3D representations), the navigation component 126 can determine a desired position, orientation, and/or field of view of the virtual camera relative to the 3D model.

Referring back to fig. 1, in accordance with one or more embodiments, the computing device 104 may correspond to a server device that facilitates various services associated with deriving 3D data from 2D images, including, for example, 3D model generation and navigation of 3D models based on 2D images. In some implementations of these embodiments, the computing device 104 and the user device 130 may be configured to operate in a client/server relationship, where the computing device 104 provides access to 3D modeling and navigation services to the user device 130 via a network accessible platform (e.g., a website, thin client application, etc.) using a browser or the like. However, the system 100 is not limited to this architectural configuration. For example, in some embodiments, one or more features, functions, and associated components of the computing device 104 may be provided at the user device 130, and vice versa. In another embodiment, one or more features and functions of the computing device 104 may be provided at a capture device (not shown) for capturing 2D image data. For example, in some implementations, at least some of the 3D self 2D processing module 106 or components of the 3D self 2D processing module 106 may be provided at the capture device. According to this example, the capture device may be configured to derive depth data (e.g., derived 3D data 116) from the captured image, and provide the image and associated depth data to the computing device 104 for further processing by the 3D model generation component 118 and optional navigation components. In yet another exemplary embodiment, one or more cameras (or one or more camera lenses) for capturing 2D image data, the 3D self 2D processing module, the 3D model generation component 118, the navigation component 126, and the display 132 displaying the 3D model and a representation of the 3D model may all be provided on the same device. Various architectural configurations of different systems and devices that may provide one or more features and functions of system 100 (and additional systems described herein) are described below with reference to fig. 14-25.

In this regard, the various components and devices of the system 100 and the additional systems described herein may be connected directly or via one or more networks. Such networks may include wired and wireless networks including, but not limited to, cellular networks, wide area networks (WAN, e.g., the internet), local Area Networks (LAN), or Personal Area Networks (PAN). For example, the computing device 104 and the user device 130 may communicate with each other using virtually any desired wired or wireless technology, including, for example, cellular, WAN, wi-Fi, wi-Max, WLAN, bluethooth ^TM Near/approach communication, etc. In one aspect, the system 100 and one or more components of the additional systems described herein are configured to interact via different networks.

Fig. 5 presents another exemplary system 500 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein. The system 500 includes the same or similar features as the system 100 with panoramic image data (e.g., panoramic image data 502) added as input. The system 500 also includes an upgraded 3D self 2D processing module 504 that differs from the 3D self 2D processing module 106 with respect to adding the panorama component 506, the model selection component 512, and one or more 2D self 3D panorama models 514 (hereinafter panorama models 514) to the 2D self 3D model database 112. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

The system 500 is specifically configured to receive and process 2D image data (referred to herein as panoramic image data and identified in the system 100 as panoramic image data 502) having a relatively wide field of view. The term panoramic image or panoramic image data is used herein to refer to a 2D image of an environment having a relatively wide field of view compared to a standard 2D image that typically has a relatively narrow field of view of between about 50 ° to 75 °. Conversely, the field of view of the panoramic image may span up to 360 ° in a horizontal direction (e.g., a cylindrical panoramic image) or both a horizontal direction and a vertical direction (e.g., a spherical panoramic image). In this regard, in some cases, the term panoramic image as used herein may refer to an image having a field of view equal to or substantially equal to 360 ° in the horizontal and/or vertical directions. In other contexts, the term panoramic image as used herein may refer to an image having a field of view less than 360 ° but greater than a minimum threshold, such as 120 °, 150 °, 180 ° (e.g., provided by a fisheye lens), or e.g., 250 °.

Using panoramic images as input to one or more 2D self 3D models to derive 3D data therefrom produces significantly better results than using standard 2D images as input (e.g., with a field of view less than 75 °). According to these embodiments, the system 500 may include one or more panoramic 3D self 2D models, referred to herein and depicted in the system 500 as a panoramic model 514, that have been specially trained to derive 3D data from panoramic images. The 3D data deriving means 110 may further comprise model selecting means 512 for selecting one or more suitable models to be used, comprised in the 3D self 2D model database 112, for deriving 3D data from the received 2D images based on one or more parameters associated with the input data, including whether the input data comprises a 2D image having a field of view exceeding a defined threshold in order to classify it as a panoramic image (e.g. 120 °, 150 °, 180 °, 250 °, 350 °, 359 °, etc.). In this regard, based on receipt of panoramic image data 502 (e.g., an image having a field of view greater than a minimum threshold) and/or generation of a panoramic image by stitching component 508 (as described below), model selection component 512 may be configured to select one or more panoramic models 514 for application by 3D data derivation component 110 to determine derived 3D data 116 of panoramic image data 502.

One or more panoramic models 514 may employ a neural network model that has been trained on panoramic images using 3D ground truth data associated therewith. For example, in various implementations, one or more panoramic models 514 may be generated based on 2D panoramic image data (referred to herein as 2D/3D panoramic data) having associated 3D data, the 3D data captured by a 2D/3D capture device in association with the capture of the 2D panoramic image data. The 2D/3D panorama capturing device may incorporate one or more cameras (or one or more camera lenses) providing up to 360 ° fields of view and one or more depth sensors providing up to 360 ° fields of view, thereby capturing the entire panorama image and simultaneously capturing and incorporating panorama depth data associated therewith into a 2D/3D panorama image. The depth sensor may include one or more 3D capture devices that capture depth information using at least some hardware. For example, depth sensors may include, but are not limited to, liDAR sensors/devices, laser rangefinder sensors/devices, time-of-flight sensors/devices, structured light sensors/devices, light field camera sensors/devices, active stereoscopic depth derivation sensors/devices, and the like. In other embodiments, panoramic 2D/3D training for developing one or more panoramic models 514 may include panoramic image data and associated 3D data generated by a capture device assembly incorporating one or more color cameras and one or more 3D sensors attached to a rotation stage or a device configured to rotate about an axis (e.g., using synchronized rotation signals) during a capture process. During rotation, multiple images and depth readings are captured, which may be combined into a single panoramic 2D/3D image. In some implementations, by rotating the platform, images with fields of view that overlap each other but with different viewpoints may be obtained and 3D information may be derived therefrom using a stereo algorithm. The 2D/3D panoramic training data may also be associated with information identifying a capture location and a capture orientation of the 2D/3D panoramic image, which may be generated by the 2D/3D capture device and/or derived in association with the capture process. Additional details regarding graphical user interfaces that facilitate the viewing and assisted capture process are described in U.S. patent application Ser. No. 15/417,162, filed on 1.26 at 2017, and entitled "CAPTURING AND ALIGNING PANORAMIC IMAGE AND DEPTH DATA," which is incorporated herein by reference in its entirety.

In various embodiments, one or more panoramic models 514 may employ an optimized neural network architecture that has been specifically trained based on the 2D/3D panoramic image training data discussed above to evaluate and process panoramic images in order to derive 3D data therefrom. In various embodiments, unlike various existing 3D self 2D models (e.g., standard model 114), one or more panoramic models 514 may employ a neural network configured to process panoramic image data using convolution layers that surround the panoramic image when projected onto a flat (2D) plane. For example, image projection may refer to mapping a flat image onto a curved surface and vice versa. In this regard, the geometry of panoramic images differs from that of ordinary (camera) pictures in that all points along the horizontal (scan) line are equidistant from the focal point of the camera. In practice, this creates a cylindrical or spherical image that is correctly displayed only when viewed from the exact center of the cylinder. When an image is "spread out" on a flat surface such as a computer display, the image has serious distortion. Such "unfolded" or flat versions of panoramic images are sometimes referred to as isorectangular projections or isorectangular images.

In this regard, in some implementations, one or more panoramic models 514 may be configured to receive panoramic image data 502 in the form of an iso-rectangular projection or already projected onto a 2D plane. In other implementations, the panoramic component 504 may be configured to project the received spherical or cylindrical panoramic image onto a 2D plane to generate a projected panoramic image in an equally rectangular form. To account for inherent distortion in the received panoramic image data associated with deriving depth information therefrom, one or more panoramic models may employ a neural network having convolution layers that surround based on image projections to account for edge effects. In particular, convolutional layers in a neural network typically fill their inputs with zeros when their received fields would otherwise extend outside the effective data region. In order to properly process a rectangular image, etc., a convolutional layer having a received field extending beyond one horizontal edge of the active data region will instead extract inputs from the data at the opposite horizontal edge of the region, rather than setting the inputs to zero.

In some implementations, weighting may be performed during training of the neural network model based on the image projections to enhance accuracy of depth predictions of the trained model. Specifically, the angular area represented by the pixels (poles) near the top or bottom of the isosceles rectangular image is smaller than the angular area represented by the pixels near the equator. To avoid training a network that makes a good prediction near the poles at the expense of making a poor prediction near the equator, the per-pixel training loss propagated through the network during training is proportional to the angular area of the pixel representation based on image projection. Thus, the one or more panoramic models 514 may be configured to apply weighted 3D-from-2D parameters based on the angular area represented by the pixel, wherein the weights due to the 3D predictions determined for the respective pixel decrease as the angular area decreases.

In one or more implementations, the one or more panoramic models 514 may be further configured to compensate for image distortion by re-projecting the panoramic image during each convolution layer. Specifically, instead of each convolution layer extracting input from a square region (e.g., a 3 x 3 region) of the previous layer, the input is instead sampled based on the projection from a location in the previous layer corresponding to a particular corner-junction. For example, for an equirectangular projection, the inputs to the convolution layer may come from square regions (3 x 3) of elements near the equator, while near the poles those same nine inputs will be sampled from regions wider than they are high, which corresponds to horizontal stretching near the poles in the equiangular projection. In this regard, the output of the previous convolutional layer may be interpolated and then used as input for the next subsequent or downstream layer.

In various embodiments, panoramic component 506 may facilitate processing panoramic images to facilitate deriving 3D data therefrom by 3D data derivation component 110 using one or more panoramic models 514. In the illustrated embodiment, the panoramic assembly 506 may include a stitching assembly 508 and a cropping assembly 510.

In some implementations, the received panoramic image data 502 may be directly input to the one or more panoramic models 514 based on being classified as a panoramic image (e.g., having a field of view exceeding a defined threshold). For example, the received panoramic image data 502 may include 360 ° panoramic images captured as a single image using a capture device employing a conical mirror. In other examples, received panoramic image data 502 may include images with a 180 ° field of view captured as a single image using, for example, a fisheye lens. In still other implementations, the 2D panoramic image may be formed via combining two or more 2D images whose common field of view spans at most about 360 °, which are stitched together (by another device) before being received by the receiving component 108.

In other implementations, the panoramic component 506 may include a stitching component 508 that may be configured to generate panoramic images for input to the one or more panoramic models 514 based on receiving two or more images having adjacent perspectives of the environment. For example, in some implementations, two or more images may be captured in conjunction with rotation of the camera about an axis to capture two or more images, with a common field of view of the images equal to 360 ° or another wide field of view range (e.g., greater than 120 °). In another example, the two or more images may include images captured by two or more cameras positioned relative to the environment and each other, respectively, such that the combined field of view of the respective image captures is equal to 360 °, such as two fisheye lens cameras each having 180 ° fields of view positioned in opposite directions. In another example, a single device may include two or more cameras having partially overlapping fields of view configured to capture two or more images whose common field of view spans up to 360 °. For these embodiments, stitching component 508 may be configured to stitch together the respective images to generate a single panoramic image for use as input to one or more panoramic models 514, thereby generating derived 3D data 116 therefrom.

In this regard, the stitching component 508 may be configured to align or "stitch together" respective 2D images that provide different perspectives of the same environment to generate a panoramic 2D image of the environment. For example, the stitching component 508 can also employ known or derived information (e.g., using the techniques described herein) regarding the capture locations and orientations of the respective 2D images in order to align and order the respective 2D images relative to one another, and then combine or combine the respective images to generate a single panoramic image. By combining two or more 2D images into a single larger field of view image before they are input into the 3D self 2D predictive neural network model, the accuracy of the depth results is improved compared to providing the input separately and then combining the depth outputs (e.g., in association with generating the 3D model or for another application). In other words, stitching input images in 2D may provide better results than stitching predicted depth outputs in 3D.

Thus, in some embodiments, one or more standard models 114 or panoramic models 514 may be used to process the wider field of view images generated by the stitching component 508 to obtain a single depth dataset of the wider field of view image as compared to separately processing each image to obtain separate depth datasets of each image. In this regard, a single depth set may be associated with increased accuracy relative to separate depth data sets. In addition, by aligning the wider field-of-view image and its associated depth data with other image and depth data captures for the environment at different capture locations, the 3D model generation component 118 can use the wider field-of-view image and its associated single depth data set in association with generating the 3D model. The resulting alignment generated using the wider field of view image and associated depth data will have a higher accuracy than the alignment generated using the separate image and associated separate depth data set.

In some embodiments, prior to stitching two or more images together to generate a panoramic image, depth information may be derived for the respective images by the 3D data derivation component 110 using one or more standard models 114. The stitching component 508 can further employ the initially derived depth information for respective images (e.g., pixels in the respective images, features in the respective images, etc.) in order to facilitate aligning the respective 2D images with respect to one another in association with generating a single 2D panoramic image of the environment. In this regard, initial 3D data may be derived for each 2D image prior to stitching using one or more standard 3D self 2D models. In association with combining the images to generate a single panoramic image, this initial depth data may be used to align the respective images with each other. Once generated, the panoramic image may be reprocessed by the 3D data derivation component 110 using one or more panoramic models 514 to derive more accurate 3D data for the panoramic image.

In some implementations, in association with combining two or more images together to generate a panoramic image, the stitching component 508 can project the respective images to a common 3D coordinate space based on the initially derived depth information and the calibrated capture position/orientation of the respective images relative to the 3D coordinate space. In particular, stitching component 508 may be configured to project two or more adjacent images (for stitching together as a panorama) and corresponding initially derived 3D depth data to a common spatial 3D coordinate space in order to facilitate accurate alignment of the respective images in association with generating a single panoramic image. For example, in one embodiment, the stitching component 508 may merge the respective image data of the respective images and the initially derived 3D onto a discretized sinusoidal projection (or another type of projection). The stitching component 508 may convert each 3D point included in the initially derived 3D data into a sinogram coordinate space and assign it to a discretized unit. The splice component 508 can further average multiple points mapped to the same unit to reduce sensor noise while detecting and removing anomalous readings from the average calculation

In some implementations, the stitching component 508 can also generate panoramic 3D images (e.g., point clouds, depth maps, etc.) based on projected points relative to the 3D coordinate space. For example, the stitching component 508 may employ the initial depth data to create a sinusoidal depth map or point cloud comprising 3D points projected onto a common 3D spatial coordinate plane. The stitching component 508 may further apply pixel color data to the depth map or point cloud by projecting color data from the respective 2D image onto the depth map or point cloud. This may involve projecting light outward from the color camera along each captured pixel toward a portion of interest of the depth map or point cloud to color the depth map or point cloud. The stitching component 508 can also perform back projection of color data from a color point cloud or depth map to create a single 2D panoramic image. For example, by back projecting color data from a color point cloud or 3D depth map onto the intersection points or areas of the 2D panorama, the stitching component 508 can fill any possible pinholes in the panorama with neighboring color data, thereby unifying the exposure data (if needed) across the boundaries between the respective 2D images. The splice component 508 can also perform blending and/or graphic cutting at the edges to remove seams. The resulting panoramic image may then be reprocessed by the 3D data derivation component 110 to determine more accurate 3D data for the panoramic image using the one or more panoramic models 514.

In some embodiments, panoramic image data captured for an environment may be used to generate optimized derived 3D data for smaller or cropped portions of the panoramic image (e.g., derived by 3D data derivation component 110 using 3D from 2D). For example, in the above-described embodiments, the 3D data derivation component 110 can process the panoramic image (e.g., in the form of equiangular projections) using one or more panoramic models 514 to generate depth data for the entire panoramic image, such as depth data for each pixel, depth data for groups of pixels (e.g., super-pixels, defined features, objects, etc.) that collectively cover the entire panoramic image span, and so forth. However, in various applications, depth data for the entire panoramic image may not be desired or needed. For example, in various contexts/contexts associated with using 3D from 2D to optimize placement of a digital object in an AR application, depth data of a wide field of view of the environment may not be needed (e.g., it may only need depth data of an object in line of sight or of an environment or area of the object in front of an observer's eye). (the AR application of the disclosed technique for deriving 3D from 2D is described below with reference to fig. 30). In another example, depth data of a wide field of view may not be needed in connection with using the derived 3D data to generate real-time relative 3D position data for automated navigation and collision avoidance by intelligent machines (e.g., unmanned aerial vehicles, unmanned vehicles, robots, etc.). For example, accurate real-time depth data for object avoidance may only be needed for forward trajectory paths of, for example, a vehicle.

In some embodiments where depth data is desired for a smaller field of view of the environment relative to the entire panoramic view of the environment, the panoramic image of the environment may still be used to generate optimized derived 3D data for the desired cropped portion of the image. For example, the 3D data derivation component 110 can apply one or more panoramic models 514 to a panoramic image of an environment to derive depth data for the panoramic image. Then, the cropping component 510 may crop the panoramic image with the derived 3D data associated therewith to select a desired portion of the image. For example, the cropping component 510 may select a portion of the panoramic image that corresponds to a narrower field of view. In another example, the cropping component 510 may crop the panoramic image to select a particular segmented object (e.g., person, face, tree, building, etc.) in the panoramic image. The technique used to determine the desired portion of the panoramic image for cropping may vary based on the application of the resulting 3D data. For example, in some implementations, user input may be received that identifies or indicates a desired portion for cropping. In another implementation, for example, where 3D data has been derived for real-time object tracking, cropping component 510 may receive information identifying a desired object being tracked, information defining or characterizing the object, etc., and automatically crop the panoramic image to extract the corresponding object. In another example, the cropping component 510 may be configured to crop the panoramic image according to a default setting (e.g., select a portion of the image having a low level of distortion effect). The cropping component 510 may further identify and associate corresponding derived 3D data associated with the cropped portion of the panoramic image, and associate the corresponding portion of the derived 3D data with the cropped portion of the panoramic image.

For these embodiments, by deriving depth data for the entire panoramic image using one or more panoramic models 514, and then using only the portion of the derived depth data associated with the desired cropped portion of the panoramic image, the accuracy of the derived depth data associated with the cropped portion of the panoramic image may be optimized relative to first cropping the panoramic image, and then deriving depth data for a smaller field of view portion of the panoramic image using one or more standard 2D self 3D models 114 or alternative depth derivation techniques.

Fig. 6 presents an exemplary computer-implemented method 600 for deriving 3D data from panoramic 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 602, a system (e.g., system 500) including a processor may receive a panoramic image. At 604, the system employs a 3D self-2D convolutional neural network model to derive 3D data from the panoramic image, wherein the 3D self-2D convolutional neural network model employs a convolutional layer that surrounds the panoramic image as the panoramic image is projected onto a 2D plane to facilitate deriving three-dimensional data. According to method 600, the convolution layer minimizes or eliminates edge effects associated with deriving 3D data based on surrounding the panoramic image when projected onto a 2D plane. In some implementations, the panoramic image may be received while the panoramic image is projected onto a two-dimensional plane. In other implementations, the panoramic image may be received as a spherical or cylindrical panoramic image, and the system may project (e.g., using panoramic component 506) the spherical or cylindrical panoramic image onto a 2D plane before employing the 3D self-2D convolutional neural network model to derive 3D data.

In one or more implementations, the 3D self 2D convolutional neural network considers the weighting values applied to the respective pixels based on their projected angular areas during training. In this regard, the 3D self 2D neural network model may include a model trained based on weighting values applied to respective pixels of the projected panoramic image in association with deriving depth data for the respective pixels, wherein the weighting values vary based on angular areas of the respective pixels. For example, during training, the weighting value decreases as the angular area of the corresponding pixel decreases. Further, in some implementations, a downstream convolution layer of the convolution layers subsequent to the previous layer is configured to re-project a portion of the panoramic image processed by the previous layer in association with deriving depth data for the panoramic image, thereby producing a re-projected version of the panoramic image for each downstream convolution layer. In this regard, the downstream convolution layer is further configured to employ input data from a previous layer by extracting the input data from a re-projected version of the panoramic image. For example, in one implementation, input data may be extracted from a reprojected version of a panoramic image based on where the reprojected version of the panoramic image corresponds to a defined angular reception field in a portion of the panoramic image.

Fig. 7 presents an exemplary computer-implemented method for deriving 3D data from panoramic 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 702, a system (e.g., system 500) operatively coupled to a processor receives a request for depth data associated with an area of an environment depicted in a panoramic image. For example, in some implementations, a request may be received from a user device based on user-provided input requesting a particular portion of a panoramic image for 3D viewing, for use in association with a 3D imaging or modeling application, and so forth. In another example, a request can be received from a 3D modeling application in association with determining that depth data of the region is needed to facilitate an alignment process or to generate a 3D model. In another example, a request may be received from an AR application based on information indicating that an area of an environment is within a current field of view of a user employing the AR application. In yet another example, a request may be received from an autonomous navigational vehicle based on information indicating that an area of the environment is within a current field of view of the vehicle (e.g., to facilitate avoiding a collision with an object in front of the vehicle). In yet another example, a request may be received from an object tracking device based on information indicating that an object tracked by the device is located within an environment area.

At 704, based on receiving the request, the system may derive depth data for the entire panoramic image using a neural network model configured to derive depth data from the single two-dimensional image (e.g., using one or more panoramic models 514 via the 3D data derivation component 110). At 704, the system extracts a portion of the depth data corresponding to the region of the environment (e.g., via clipping component 510), and at 708, the system provides the portion of the depth data to an entity associated with the request (e.g., a device, system, user device, application, etc. from which the request was received) (e.g., via panorama component 505, computing device 104, etc.).

Fig. 8 presents another exemplary system 800 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein. The system 800 includes the same or similar features as the system 500 with the addition of the native auxiliary data 802 as input. The system 800 also includes an upgraded 3D self 2D processing module 804, which differs from the 3D self 2D processing module 504 in that an auxiliary data component 806, auxiliary data component output data 808, and one or more enhanced 3D self 2D data models configured to process the 2D image data and auxiliary data to provide derived 3D data that is more accurate relative to the data provided by the one or more standard models 114 are added. These enhanced 3D-from-2D models are referred to herein and described in system 800 as enhancement model 810. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

The systems 100 and 500 are generally directed to using only a single 2D image (including panoramic images and narrower field of view images) as input to one or more 3D self 2D models (e.g., one or more standard models and/or one or more panoramic models 514) to derive 3D data therefrom (the 3D data 116 has been derived). The system 800 incorporates the use of various types of auxiliary input data that may be associated with 2D images to facilitate improving the accuracy of 3D self 2D predictions. Such auxiliary input data may include, for example: information about the capture location and orientation of the 2D image, information about the capture parameters of the capture device generating the 2D image (e.g., focal length, resolution, lens distortion, illumination, other image metadata, etc.), actual depth data associated with the 2D image captured by the 3D sensor (e.g., 3D capture hardware), depth data derived for the 2D image using stereo image processing, etc.

In the illustrated embodiment, auxiliary input data that can be used as additional input to facilitate improving accuracy of 3D self 2D predictions can be received in association with one or more 2D images as native auxiliary data 802. In this regard, the assistance data is characterized as "native" to indicate, in some embodiments, that it may include raw sensory data and other types of raw assistance data that may be processed by the assistance data component 806 to generate structured assistance data, which may then be used as input to the one or more enhancement models 810. For these embodiments, as described in greater detail with reference to fig. 9, the assistance data component output data 808 may include structured assistance data generated by the assistance data component 806 based on the native assistance data 802. For example, as described in more detail with reference to fig. 9, in one implementation, the native assistance data 802 may include motion data captured by an Inertial Measurement Unit (IMU) in association with environmental scans involving capturing several images at different capture locations. According to this example, the assistance data component 806 may determine capture position and orientation information for the respective 2D images based on the IMU motion data. The determined capture location and orientation information may be considered structured assistance data, which may then be associated with the respective 2D images and used as input to one or more enhancement models 810.

In other embodiments, the native assistance data 802 may include various assistance data (e.g., actual ground truth data provided by the capture device, actual capture location and orientation information, etc.) that may be used directly as input to one or more 3D self 2D models. With these implementations, the auxiliary data component 806 can ensure accurate correlation of the native auxiliary data 802 with a particular 2D image and/or convert the native 2D image data into a structured machine-readable format (if desired) for input with the 2D image to the one or more enhancement models 810. Thus, the auxiliary data component output data 808 may include native auxiliary data 802 associated with the 2D image in the original form and/or in a structured format.

According to any of these embodiments, the one or more enhancement models 810 may include one or more enhanced 3D self 2D models employing one or more neural networks that have been specially trained to derive 3D data from the 2D image in combination with one or more auxiliary data parameters associated with the 2D image. Thus, the derived 3D data 116 generated by the enhancement model may be more accurate than the 3D data determined by the one or more standard models 114. The one or more enhancement models 110 may also include one or more enhancement panoramic models. In this regard, the enhanced panoramic model may employ one or more features and functions, as with the panoramic model 514 discussed herein, the enhanced panoramic model may also be configured to evaluate auxiliary data associated with panoramic images or images otherwise classified as having a wide field of view. In some implementations, the derived 3D data 116 generated by the enhanced panoramic model may be more accurate than the data determined by the one or more panoramic models 514.

In some implementations, the enhancement model 810 may include a plurality of different 3D self 2D models, each configured to process a different set or subset of auxiliary data parameters associated with the 2D image. With these implementations, the model selection component 512 may be configured to select an applicable enhancement model from the plurality of enhancement models 810 to apply to the 2D image based on the assistance data associated with the 2D image. For example, based on the type of auxiliary data associated with the 2D image (e.g., included in the native auxiliary data 802 and/or determined by the auxiliary data component 806 based on the native auxiliary data 802), the model selection component 512 may be configured to select an appropriate enhancement model from the plurality of enhancement models 810 to apply to the input dataset comprising the 2D image and associated auxiliary data, thereby deriving 3D data of the image. In other implementations, the enhancement model 810 may include a generic model configured to process the 2D image plus one or more defined auxiliary data parameters. With these implementations, the 3D self 2D processing module 804 may be configured to receive and/or determine one or more defined auxiliary parameters of the respective 2D image processed by the 3D data derivation component 110 using the enhancement model. Otherwise, if the 2D image is not associated with assistance data (which is not received or cannot be determined by the assistance data component 806) or is associated with insufficient or incomplete assistance, the 3D data derivation component 110 can employ one or more standard models 114 to derive 3D data of the 2D image.

In various additional embodiments discussed in more detail below with reference to fig. 9, the native assistance data 802 may include assistance data associated with the 2D image that may be used by the assistance data component 806 to pre-process the 2D image prior to input to the one or more 3D self 2D models to generate the derived 3D data 116 of the image. Such preprocessing of the 2D image may convert the image into a unified representation before applying one or more 3D self 2D models thereto to derive 3D data therefrom, such that the result of the neural network is not degraded by differences between the training image and the real image. With these embodiments, the one or more enhancement models 810 may include an enhanced 3D self 2D model that has been specifically configured to derive depth data for the preprocessed 2D image using training data that was preprocessed according to the techniques described below. Thus, in some implementations, after the received 2D image has been preprocessed, the model selection component 512 may select a particular enhancement model configured to evaluate the preprocessed 2D image for use by the 3D data derivation component to generate the derived 3D data 116 of the preprocessed 2D image. In other implementations, the preprocessed 2D image may be used as input to one or more standard models 114, but provides more accurate results due to consistency of the input data. The auxiliary data component 806 can also pre-process the panoramic image prior to inputting it into the one or more panoramic models 514 to further improve accuracy of the results.

In the illustrated embodiment, auxiliary data component output data 808 may also be provided to and used by 3D model generation component 118 to facilitate generation of a 3D model. For example, the assistance data may be used by the 3D model generation component 118 to facilitate alignment of images to be captured at different capture locations and/or orientations (and their associated derived 3D data 116) relative to one another in a three-dimensional coordinate space. In this regard, in various embodiments, some or all of the auxiliary data component output data 808 may not be used as input to a 3D self 2D prediction model associated with a 2D image to improve the accuracy of the derived 3D data. Conversely, the auxiliary data component output data 808 associated with the 2D image may be employed by the 3D model generation component 118 to facilitate generating a 3D model based on the 2D image and the derived 3D data 116 determined for the 2D image. In this regard, the auxiliary data, the 2D image, and the combination of the derived 3D data 116 of the 2D image may be used by the 3D model generation component 118 to facilitate generating an immersive 3D environment of the scene as well as other forms of 3D (and in some implementations 2D) reconstruction.

For example, in one implementation, the assistance data component output data 808 (or the native assistance data 802) may include depth sensor measurements of 2D images captured by one or more depth sensors. For this example, the depth sensor measurements may be combined with the derived 3D data of the 2D image to fill in gaps that lack the derived 3D data, and vice versa. In another example, the assistance data may include location information identifying a capture location of the 2D image. For this example, the location information may not be used as an input to the 3D self 2D model to facilitate depth prediction, but instead is used by the 3D model generation component 118 to facilitate aligning the 2D image and associated derived 3D data 116 with other 2D images and associated derived 3D data sets.

FIG. 9 presents a more detailed representation of the native assistance data 802, assistance data component 806, and assistance data component output data 808 in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

The native assistance data 802 may include various types of assistance data associated with the 2D image and/or the process used to capture the 2D image, which may be used to facilitate improving the accuracy of 3D self 2D predictions and/or used by the 3D modeling component 118 to improve the quality of the 3D model. For example, the native assistance data 802 may include, but is not limited to, capture device motion data 904, capture device location data 906, camera/image parameters 908, and 3D sensor data 910. The capture device motion data 904 may include information regarding movement of a camera associated with capture of multiple images of an object or environment. For example, in some implementations, the capture device motion data 904 may include data captured by an IMU, accelerometer, or the like that is physically coupled to a camera for capturing images. For example, IMU measurements may include captured data that is captured in association with movement of the camera to a different location in the environment while the camera is capturing images (or the camera does not capture images, such as movement between captures), rotation of the camera about a fixed axis, movement of the camera in vertical and horizontal directions, and so forth. In some implementations, IMU measurements may be associated via a timestamp or the like with respective images captured by the camera in association with camera movement during the capture process. For example, in implementations in which a camera is used to capture many images of an environment as a camera operator moves the camera to different locations throughout the environment to capture different areas and perspectives of the environment, each image captured may be associated with a timestamp indicating its relative capture time relative to other images and motion data reflecting camera movement during and/or between captures.

The capture device location data 906 may include information identifying or indicating a capture location of the 2D image. For example, in some implementations, the capture device location data may include Global Positioning System (GPS) coordinates associated with the 2D image. In other implementations, the capture device location data 906 may include location information indicating a relative location of the capture device (e.g., camera and/or 3D sensor) with respect to its environment, such as a relative location or calibration location of the capture device with respect to an object in the environment, another camera in the environment, another device in the environment, etc. In some implementations, this type of location data may be determined by a capture device (e.g., a camera and/or a device operatively coupled to the camera, including positioning hardware and/or software) in association with image capture and received with the image.

The camera/image parameters 908 may include information regarding operating parameters and/or settings of one or more cameras (or one or more camera lenses) used to capture the 2D image data 102, as well as contextual information associated with the capture conditions. For example, various camera operating parameters for capturing images may vary based on the function of the camera, the default or user-selected camera settings employed, the illumination in which the image is captured, and so forth. In this regard, the camera/image parameters 908 may include camera settings and capture context information associated with the 2D image (e.g., as metadata or otherwise associated with the received 2D image), including, but not limited to: focal length, aperture, field of view, shutter speed, lens distortion, illumination (exposure, gamma, tone mapping, black level), color space (white balance), ISO, and/or other parameters that may vary from image to image.

The 3D sensor data 910 may include any type of 3D data associated with a 2D image included in the received 2D image data 102 captured by a 3D sensor or 3D capture hardware. This may include 3D data or depth data captured using one or more structured light sensor devices, liDAR devices, laser rangefinder devices, time-of-flight sensor devices, light field cameras, active stereo devices, and the like. For example, in some embodiments, the received 2D image data 102 may include 2D images captured by a 2D/3D capture device or a 2D/3D capture device component that includes one or more 3D sensors in addition to one or more 2D cameras (e.g., RGB cameras). In various implementations, the 2D/3D capture device may be configured to capture 2D images using one or more cameras (or one or more camera lenses) and to simultaneously (e.g., at or near the same time) capture associated depth data of the 2D images using one or more 3D sensors, or in a manner that may correlate them after capture, if not simultaneously. The complexity (e.g., complexity, hardware cost, etc.) of such 2D/3D capture devices/components may vary. For example, in some implementations, to reduce cost, the 2D/3D capture device may include one or more cameras (or one or more camera lenses) and a limited range/field of view 3D sensor configured to capture partial 3D data of the 2D image. One version of such a 2D/3D capture device may include a 2D/3D capture device that generates spherical color images and depth data. For example, the 2D/3D capture may include one or more color cameras capable of capturing image data (e.g., spherical panoramic images) having fields of view spanning vertically and horizontally up to 360 °, and a structured light sensor configured to capture depth data of a middle portion of the vertical field of view (e.g., near the equator).

Although the native assistance data 802 is described as a separate entity from the 2D image data 102, this description is for exemplary purposes only to indicate that the native assistance data is a new addition (optional) to one or more embodiments of the disclosed system. In this regard, it should be appreciated that the 2D image data 102 may be received with its associated native auxiliary data 802 as a single data object/file, as metadata, and so forth. For example, the 2D image data 102 may include a 2D image having 3D sensor depth data associated therewith for 2D image capture, metadata describing camera/image parameters, and the like.

The auxiliary data component 806 can include various computer-executable components that facilitate processing the native auxiliary data 802 and/or the received 2D image data 102 to generate structured auxiliary data 930 and/or preprocessed 2D image data 932. In the illustrated embodiment, these components include an orientation estimation component 912, a position estimation component 914, a depth estimation component 916, a multiple image analysis component 918, a 3D sensor data association component 924, a preprocessing component 926, and a semantic labeling component 928.

The orientation estimation component 912 may be configured to determine or estimate a captured orientation or pitch of the 2D image and/or a relative orientation/pitch of the 2D with respect to the common 3D coordinate space. For example, in some embodiments, the orientation estimation component 912 may determine an orientation of the received 2D image (as provided by the capture device motion data 904) based on IMU or accelerometer measurements associated with the 2D image. The determined orientation or pitch information may be characterized as structured assistance data 930 and associated with the 2D image. Orientation information determined for the 2D image may be used with the 2D image as input to one or more enhanced 3D self 2D models (e.g., one or more enhancement models 810) to generate derived 3D data 116 of the 2D image, used by model generation component 118 to facilitate an alignment process associated with 3D model generation, and/or stored in memory (e.g., memory 122 or external memory) for additional applications.

The position estimation component 914 may be configured to determine or estimate a capture position of the 2D image and/or a relative position of the 2D image with respect to the common 3D coordinate space. The determined capture location information may also be characterized as structured assistance data 930 and associated with the 2D image. The location information may also be used with the 2D image as input to one or more enhanced 3D self 2D models (e.g., one or more enhanced models 810) to generate derived 3D data 116 of the 2D image, used by the model generation component 118 to facilitate an alignment process associated with 3D model generation, and/or stored in memory (e.g., memory 122 or external memory) for additional applications.

The position estimation component 914 can employ various techniques to determine a capture position (i.e., a capture location) of the 2D image based on the type of assistance data available. For example, in some implementations, the capture device location data 906 may identify or indicate a capture location of the received 2D image (e.g., GPS coordinates of the capture device). In other implementations, the position estimation component 914 can employ the capture device motion data 904 to determine a capture position of the 2D image using inertial position tracking analysis. In other embodiments, the native assistance data 802 may include sensed data captured in association with the capture of one or more 2D images, which may be used to facilitate determining a capture location of the 2D images. For example, the sensed data may include 3D data captured by a stationary sensor, an ultrasound system, a laser scanner, or the like, which may be used to facilitate determining a location of a capture device that captures one or more 2D images using visual ranging techniques, line of sight for mapping and localization, time of flight mapping and localization, or the like.

In some embodiments, the orientation estimation component 912 and/or the position estimation component 914 can employ one or more related images included in the 2D image data 102 in order to facilitate determining a capture orientation and/or position of the 2D image. For example, the related 2D images may include neighboring images, images with partially overlapping fields of view, images with slightly different capture locations and/or capture orientations, stereoscopic image pairs, images providing different perspectives of the same object or environment captured at significantly different capture locations (e.g., exceeding a threshold distance to not constitute a stereoscopic image pair, such as a interocular distance of greater than about 6.5 centimeters), and so forth. The sources of the related 2D images included in the 2D image data 102 and the relationship between the related 2D images may vary. For example, in some implementations, the 2D image data 102 may include video data 902 that includes successive frames of video captured in association with movement of a video camera. The relevant 2D images may also include video frames captured by the video camera in a fixed position/orientation, but captured at different points in time as one or more characteristics of the environment change at different points in time. In another example, similar to successive frames of video captured by a video camera, an entity (e.g., a user, a robot, an autonomous vehicle, etc.) may use the camera to capture several 2D images of an environment in association with the entity's movement about the environment. For example, using a stand-alone digital camera, smart phone, or similar device with a camera, a user may walk around the environment and take 2D images at several points along the way, capturing different perspectives of the environment. In another exemplary implementation, the related 2D images may include 2D images from nearby or overlapping perspectives captured by a single camera in association with rotation of the machine about a fixed axis. In another implementation, the correlated 2D image may include two or more images captured by two or more cameras, respectively, at different perspectives of the partially overlapping fields of view or environment (e.g., captured simultaneously or near simultaneously by different cameras). Using this implementation, the correlated 2D image may include images that form a stereoscopic image pair. The correlated 2D image may also include images captured by two or more different cameras not arranged as a stereo pair.

In some embodiments, the orientation estimation component 912 and/or the position estimation component 914 can employ visual ranging and/or simultaneous localization and mapping (SLAM) to determine or estimate a capture orientation/position of the 2D image based on a sequence of related images captured in association with movement of the camera. Visual ranging can be used to determine an estimate of camera capture orientation and position based on a sequence of images using feature matching (matching features on multiple frames), feature tracking (matching features in adjacent frames), and optical flow techniques (based on the intensity of all pixels or specific regions in sequential images). In some embodiments, the orientation estimation component 912 and/or the position estimation component 914 may employ capture device motion data 904, capture device position data 906, and/or 3D sensor data 910 in association with evaluating the image sequence, using visual ranging and/or SLAM to determine a capture position/orientation of the 2D image. The algorithm employed by SLAM technology is an algorithm configured to simultaneously locate a capture device (e.g., a 2D image capture device or a 3D capture device) relative to the surrounding environment of the capture device (e.g., determine its position or orientation) and simultaneously map the structure of the environment. SLAM algorithms may involve tracking a set of points through a sequence of images, thereby using these tracks to triangulate the 3D position of the points, while using the points to determine the relative position/orientation of the capture device that captured them. In this regard, in addition to determining the position/orientation of the capture device, the SLAM algorithm may be used to estimate depth information for features included in one or more images of the sequence of images.

In some embodiments, the sequence of related data images may include images captured in association with a scan of the environment involving capturing several images at different capture locations. In another example, the sequence of related images may include video data 902 associated with a 2D image of an object or environment captured during movement of a capture device associated with a scan of the object or environment, where the scan involves capturing multiple images of the object or environment from different capture locations and/or orientations. For example, in some implementations, the video data 902 may include video data captured in addition to one or more 2D images (e.g., by a separate camera) during a scan. The video data 902 may also be used by the orientation estimation component 912 and/or the position estimation component 914 to determine a capture orientation/position of one or more 2D images captured during scanning using visual ranging and/or SLAM techniques. In some implementations, the video data 902 may include a primary image that may be processed by the system 800 or the like to derive 3D data therefrom using one or more 3D self 2D techniques described herein (e.g., using one or more standard models 114, panoramic model 514, enhancement model 810, etc.). According to this example, one or more of these frames may be used as the primary input image from which 3D data is derived using one or more 3D self 2D techniques described herein. Additionally, orientation estimation component 912 and/or position estimation component 914 can employ neighboring images to facilitate determining a capture orientation/position of a primary input frame using visual ranging and/or SLAM.

Depth estimation component 916 can also evaluate the correlated images to estimate depth data for one or more correlated images. For example, in some embodiments, depth estimation component 916 may employ SLAM to estimate depth data based on a sequence of images. The depth estimation component 916 can also employ related photogrammetry techniques to determine depth information for the 2D image based on one or more related images. In some implementations, the depth estimation component 916 can also employ the capture device motion data 904 and one or more techniques of motion restoration structures to facilitate estimating depth data of the 2D image.

In some embodiments, the depth estimation component 916 may also be configured to employ one or more passive stereo processing techniques to derive depth data from image pairs classified as stereo image pairs (e.g., image pairs offset by a stereo image pair distance, such as an interocular distance offset of about 6.5 centimeters). For example, passive stereoscopic involves comparing two stereoscopic images that are horizontally displaced from each other and providing two different views of a scene. By comparing the two images, it is possible to obtain the relative depth information in the form of a disparity map, which encodes differences in the horizontal coordinates of the corresponding image points. The values in the disparity map are inversely proportional to the scene depth at the corresponding pixel locations. In this regard, given a pair of stereo images acquired from slightly different viewpoints, the depth estimation component 916 can employ a passive stereo matching function that identifies and extracts corresponding points in the two images. Knowing these correspondences, the capture location of the image, and the scene structure, the 3D world coordinates of each image point can be reconstructed by triangulation. The disparity in which depth data is encoded represents the distance between x-coordinates or a corresponding point in the left and right images.

In various implementations, the stereoscopic image pair may include images (e.g., corresponding to left and right images similar to the image pair seen by the left and right eyes) that are offset along a horizontal axis by a stereoscopic image pair distance (e.g., between two eyes of about 6.5 centimeters). In other implementations, the stereoscopic image pair may include an image pair offset by a stereoscopic image distance along a vertical axis. For example, in some embodiments, the received 2D images may include panoramic image pairs having a field of view span of 360 ° (or up to 360 °) captured from different vertical positions relative to the same vertical axis, wherein the different vertical positions are offset by a stereoscopic image pair distance. In some implementations, the respective vertically offset stereoscopic images may be captured by a camera configured to move to a different vertical position to capture the respective images. In other implementations, the respective vertically offset stereoscopic images may be captured by two different cameras (or camera lenses) located at different vertical positions.

In some implementations, the depth estimation component 916 can also employ one or more active stereo processes to derive depth data for stereo image pairs captured in association with projection light (e.g., structured light, laser light, etc.), in accordance with various active stereo capture techniques. For example, active stereo processing employs light emission (e.g., via a laser, structured light device, etc.) associated with the capture of stereo images to facilitate stereo matching. The word "active" means projecting energy into the environment. In active stereoscopic systems, in association with the capture of stereoscopic images, a light projection unit or laser unit projects light or a light pattern (or multiple simultaneous pieces of light) onto a scene at a time. The light pattern detected in the captured stereoscopic image may be used to facilitate extraction of depth information of features included in the respective image. For example, based in part on the correspondence between light appearing in the respective image and the known position of the beam/laser beam relative to the image capture position, the depth deriving component may perform active stereoscopic analysis by looking up the correspondence between visual features included in the respective image.

The passive and/or active stereo derived depth data may be associated with one or both images in the stereo pair. Depth data determined by the depth estimation component 916 for the 2D image based on analysis of one or more related images (e.g., using SLAM, photogrammetry, by motion recovery structures, stereo processing, etc.) may also be characterized as structured auxiliary data 930. The depth data may also be used with the 2D image as input to one or more enhanced 3D self 2D models (e.g., one or more enhanced models 810) to generate derived 3D data 116 of the 2D image, used by the model generation component 118 to facilitate an alignment process associated with 3D model generation, and/or stored in memory (e.g., memory 122 or external memory) for additional applications.

In other embodiments, rather than determining depth data from a passive stereo algorithm, depth estimation component 916 may evaluate a stereo image pair to determine data about the quality of photometric matches between images at various depths (more intermediate results). In this regard, the depth estimation component 916 can determine auxiliary data for one or both images included in the stereo pair by determining match quality data regarding the quality of photometric matches between respective images at various depths. The photometric matching quality data can be used as auxiliary data for any 2D image in a stereo pair to be used as input to an enhanced 3D self 2D model, thereby facilitating the derivation of depth data for any 2D image.

The multiple image analysis component 918 can facilitate identifying, associating, and/or defining a relationship between two or more related images. In the illustrated embodiment, the multiple image analysis component 918 can include an image correlation component 920 and a relationship extraction component 922.

The image correlation component 920 may be configured to identify and/or classify the correlated images included in the received 2D image data 102. Image correlation component 920 can employ various techniques to identify and/or classify two or more images as correlated images. In some embodiments, the correlation image and information defining the relationship between the correlation images employed by the auxiliary data component 806 may be predefined. In this regard, the auxiliary data component 806 can identify and extract one or more relevant images included in the 2D image data 102 based on predefined information associated therewith. For example, an image may be received with information classifying the image as a stereoscopic image pair. In another example, the capture device may be configured to provide two or more images captured in association with rotation about a fixed axis. According to this example, an image may be received with information annotating the captured scene and identifying their relative capture locations and orientations to one another. Image correlation component 920 may be further configured to automatically classify images captured under the captured scene as correlated. In another example, image correlation component 920 may be configured to automatically classify a set of images captured by the same camera in association with a scan within a defined time window as correlated. Similarly, based on capture device motion data 904 (e.g., movement in a particular direction that is less than a threshold distance or degree of rotation), image correlation component 920 can be configured to automatically classify as correlated respective frames of video that are included in the same video clip for less than a defined duration and/or are associated with a defined range of movement.

In other embodiments, the image correlation component 920 may be configured to identify the correlated image included in the 2D image data 102 based on: the respective capture locations of the images (which may be provided with the received images and/or determined at least in part by the location estimation component 914), and the respective capture orientations of the images (which may be provided with the received images and/or determined at least in part by the orientation estimation component 912). For example, image correlation component 920 may be configured to classify two or more images as correlated based on capture locations and/or capture orientations having defined distances and/or degrees of rotation that differ. For example, based on different capture orientations having the same capture location but having a defined degree of rotation, image correlation component 920 can identify and classify two images as correlated. Likewise, based on different capture locations having the same capture orientation but having a defined distance or range of distances apart, image correlation component 920 can identify and classify two images as correlated. According to this example, image correlation component 920 can also identify and classify image pairs as stereo pairs.

Image correlation component 920 can also identify a correlated image based on a time of capture and/or motion data regarding relative changes in motion between two or more images. For example, image correlation component 920 can identify a correlated image based on having respective capture times within a defined time window, having respective capture times separated by a maximum duration, and so forth. In other implementations, image correlation component 920 can use one or more image analysis techniques to identify correspondence in visual features included in two or more images to identify a correlated image. Image correlation component 920 can further identify/classify the correlated image based on the degree of correspondence in the visual features relative to the defined threshold. The image correlation component 920 can also use depth data (e.g., as 3D sensor data 910) associated with the respective images (if provided) to determine spatial relationships between the relative locations of the corresponding visual features, and employ these spatial relationships to identify/classify the correlated images.

The relationship extraction component 922 may be configured to determine and/or associate relationship information with related images, the related information defining information about relationships between related images. For example, the relationship extraction component 922 can determine information regarding elapsed time between the capture of two or more potentially relevant images, relative capture locations of the two or more potentially relevant images (which can be provided with the received images and/or determined at least in part by the location estimation component 914), relative capture orientations of the two or more potentially relevant images (which can be provided with the received images and/or determined at least in part by the orientation estimation component 912), information regarding correspondence between visual and/or spatial features of the relevant images, and the like. The relationship extraction component 922 can further generate and associate relationship information with two or more related images defining relationships between images (e.g., relative position, orientation, time of capture, visual/spatial correspondence, etc.).

In some embodiments, as described above, one or more components of the assistance data component 806 (e.g., the orientation estimation component 912, the position estimation component 914, and/or the depth estimation component 916) can employ the correlated images to generate structured assistance data 930 for one or more images included in a set (two or more) of correlated images. For example, as described above, in various embodiments, the 2D image data 102 may include video data 902 and/or 2D images captured in association with a scan that provides adjacent (but different) perspectives of an environment to sequential images (e.g., video frames and/or still images). For these embodiments, the position estimation component 914 and/or orientation estimation component 912 can use the related sequential images to determine the captured position/orientation information or a single 2D image using visual ranging and/or SLAM techniques. Similarly, the depth estimation component 916 can employ the correlated images to derive depth data using stereo processing, SLAM, by motion restoration structures, and/or photogrammetry techniques.

In other embodiments, the correlated image may be used as an input to one or more 3D self 2D models (e.g., included in 3D self 2D model database 112) to facilitate the derivation (e.g., by depth data derivation component 110) of depth data for one or more images included in a set (two or more) of correlated images. For these embodiments, the one or more enhancement models 810 may include an enhanced 3D self 2D neural network model configured to receive and process two or more input images (e.g., as opposed to, for example, the standard model 114 configured to evaluate a single image at a time only). The enhanced 3D self 2D neural network model may be configured to evaluate relationships between related images (e.g., using a deep learning technique) in order to facilitate deriving depth data for one or more images of the related images (e.g., where the related images may include a set of two or more related images). For example, in some implementations, one image included in a set of related images may be selected as the primary image for which the derived 3D data 116 is determined, and the neural network model may use one or more other images related in the set in order to facilitate deriving the 3D data of the primary image. In other implementations, the enhanced 3D self 2D model may be configured to derive depth data for multiple input images at once. For example, the enhanced 3D self 2D model may determine depth information for all or some of the relevant input images. In association with using the related images as inputs to the augmented 3D self 2D neural network model, relationship information describing relationships between the respective images (e.g., determined by the relationship extraction component 922 and/or associated with the respective images) may be provided as inputs with the respective images and evaluated by the augmented 3D self 2D neural network model.

The 3D sensor data association component 924 may be configured to identify and associate any received 3D sensor data 910 of an image with a 2D image to facilitate using the 3D sensor data 910 as input to the one or more enhancement models 810. In this regard, the 3D sensor data association component 924 may ensure that 3D data received in association with the 2D image is in a consistent structured machine-readable format prior to input to the neural network. In some implementations, the 3D sensor data association component 924 can process the 3D sensor data 910 to ensure that the data is accurately correlated with the corresponding pixels, super-pixels, etc. of the image for which the data was captured. For example, in implementations in which partial 3D sensor data is received for a 2D image (e.g., for a middle portion of the spherical image that is located near the equator as compared to the entire field of view of the spherical image), the 3D sensor data association component 924 may ensure that partial 3D data is accurately mapped to the region of the 2D image for which the data was captured. In some implementations, the sensor data association component 924 may calibrate 3D depth data received with the 2D image by capturing locations and/or corresponding locations in the common 3D coordinate space such that additional or optimized depth data determined from the 2D model using enhanced 3D for the image may be based on or calibrated to the same reference point. The 3D sensor data 910 associated with the 2D image (e.g., in a standardized format and/or with calibration information in some implementations) may also be used with the 2D image as input to one or more enhanced 3D self 2D models (e.g., one or more enhancement models 810) to generate derived 3D data 116 of the 2D image, used by the model generation component 118 to facilitate an alignment process associated with 3D model generation, and/or stored in memory (e.g., memory 122 or external memory) for additional applications.

The preprocessing component 926 may be configured to pre-process the images to convert the images into a unified representation format based on the camera/image parameters 908 associated with the respective images prior to input into the 3D self 2D neural network model (e.g., included in the 3D self 2D model database 112) such that the results of the neural network are not degraded by differences between the training images and the real images. In this regard, the preprocessing component 926 may alter one or more characteristics of the 2D image to convert the 2D image to a modified version of the 2D image that conforms to the standard representation format defined for the 2D image, processed by the particular neural network model. Thus, the neural network model may include an enhanced neural network model that has been trained to evaluate images conforming to a standard representation format. For example, the preprocessing component 926 can correct or modify image defects to account for lens distortion, illumination changes (exposure, gamma, tone mapping, black level), color space (white balance) changes, and/or other image defects. In this regard, the preprocessing component 926 may comprehensively balance the respective images to account for differences between camera/image parameters.

In various embodiments, the preprocessing component 926 may determine whether and how to change the 2D image based on camera/image parameters associated with the image (e.g., received with the image as metadata). For example, the preprocessing component 926 may identify differences between one or more camera/image parameters associated with the received 2D image and one or more defined camera/image parameters of the standard representation format. The preprocessing component 926 can further alter (e.g., edit, modify, etc.) one or more characteristics of the 2D image based on the differences. In some implementations, the one or more characteristics may include visual characteristics, and the preprocessing component 926 may alter the one or more visual characteristics. The preprocessing component 926 can also change the orientation of the image, the size of the image, the shape of the image, the magnification level of the image, and so forth.

In some embodiments, the preprocessing component 926 may also use position and/or orientation information regarding the relative position and/or orientation from which the input image was captured in order to rotate the input image so that the direction of motion between them is horizontal prior to input to the augmented neural network model. For these embodiments, the horizontal disparity cues may be used to train an enhanced neural network model (e.g., included in one or more enhancement models 810) to predict depth data (e.g., the derived 3D data 116). The image preprocessed by preprocessing component 926 can be characterized as preprocessed 2D image data 932 and used as input to one or more enhanced 3D self 2D models (e.g., one or more enhancement models 810) trained specifically to evaluate such preprocessed images. In some implementations, the preprocessed images may also be used as inputs to one or more standard models 114 and/or panoramic models 514 to improve the accuracy of the results of those models. The preprocessed 2D image data 932 may also be stored in memory (e.g., memory 122 or external memory) for additional applications.

The semantic labeling component 928 may be configured to process the 2D image data 102 to determine semantic tags for features included in the image data. For example, the semantic labeling component 928 can be configured to employ one or more machine learning object recognition techniques to automatically recognize defined objects and functions (e.g., walls, floors, ceilings, windows, doors, furniture, people, buildings, etc.) included in the 2D image. The semantic labeling component 928 can further assign a label identifying an object to the identified object. In some implementations, the semantic labeling component 928 can also perform semantic segmentation and further identify and define boundaries of identified objects in the 2D image. Semantic tags/boundaries associated with features included in the 2D image may be characterized as structured auxiliary data 930 and used to facilitate deriving depth data for the 2D image. In this regard, semantic tag/segmentation information associated with the 2D image may also be used with the 2D image as input to one or more enhanced 3D self 2D models (e.g., one or more enhancement models 810) to generate derived 3D data 116 of the 2D image, used by the model generation component 118 to facilitate alignment processes associated with 3D model generation, and/or stored in memory (e.g., memory 122 or external memory) for additional applications.

Fig. 10 presents an exemplary computer-implemented method 1000 for employing auxiliary data related to captured 2D image data in order to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 1002, a system (e.g., system 800) operatively coupled to a processor receives a 2D image. At 1004, the system receives (e.g., via receiving component 111) or determines (e.g., via auxiliary data component 806) auxiliary data for the 2D image, wherein the auxiliary data includes orientation information regarding a capture orientation of the two-dimensional image. At 1006, the system derives 3D information of the 2D image using one or more neural network models (e.g., one or more enhancement models 810) configured to infer three-dimensional information (e.g., using the 3D data derivation component 110) based on the two-dimensional image and the assistance data.

Fig. 11 presents another exemplary computer-implemented method 1100 for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 1102, a system (e.g., system 800) operatively coupled to a processor receives a captured 2D image of an object or environment, wherein the 2D images are associated based on providing different perspectives of the object or environment. At 1104, the system derives depth information for at least one of the correlated 2D images based on the correlated 2D images using the one or more neural network models (e.g., the one or more enhancement models 810) and the correlated 2D images as inputs to the one or more neural network models (e.g., via the 3D data derivation component 110). For example, the one or more neural network models may include a neural network model configured to evaluate/process more than one 2D image and use information about relationships between respective 2D images to facilitate deriving depth data for some or all of the input images.

Fig. 12 presents another exemplary computer-implemented method 1000 for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 1202, a system (e.g., system 800) operatively coupled to a processor receives a 2D image. At 1204, the system pre-processes the 2D, wherein pre-processing includes changing one or more characteristics of the two-dimensional image to convert the image into a pre-processed image according to a standard representation format (e.g., via pre-processing component 926). At 1206, the system derives 3D information of the preprocessed 2D image using one or more neural network models (e.g., using 3D data derivation component 110) configured to infer 3D information based on the preprocessed 2D image.

Fig. 13 presents another exemplary system 1300 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein. The system 1300 includes the same or similar functionality as the system 800 with the addition of optimized 3D data 1306. The system 1300 also includes an upgraded 3D self 2D processing module 1304, which differs from the 3D self 2D processing module 804 in that a 3D data optimization component 1302 is added, which can be configured to generate optimized 3D data 1306. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

Referring to fig. 9 and 13, in various embodiments, the auxiliary data component output data 808 may include structured auxiliary data 930 that includes certain depth data associated with the 2D image. For example, in some implementations, the depth data may include 3D sensor data captured in association with the capture of the 2D image, and the 3D sensor data is associated with the 2D image. In other implementations, the depth data may include one or more depth measurements determined by the depth estimation component 916 for the 2D image (e.g., determined using SLAM, by motion recovery structures, photogrammetry, etc.). In some embodiments, this depth data (hereinafter referred to as "initial" depth data) may be used as input to one or more enhancement models 810 associated with the 2D image to facilitate generating derived 3D data 116 of the 2D image.

However, in other embodiments, in addition to and/or instead of using the initial depth data as input to one or more 3D-from-2D models included in the 3D-from-2D model database 112, the initial depth data may be provided to the 3D data optimization component 1302. The 3D data optimization component 1302 may be configured to analyze 3D/depth data obtained from different sensors and/or depth derivation modalities, including derived 3D depth data 116 and initial depth data values, in order to determine an optimized unified interpretation of the depth data, referred to herein and depicted in the system 1300 as optimized 3D data 1306. In particular, using different types of depth sensor devices and/or depth derivation techniques (e.g., including different types of 3D sensor depth data, passive stereo processing, active stereo processing, SLAM processing, photogrammetry processing, self-motion restoration structure processing, and 3D self-2D processing), the 3D data optimization component 1302 can analyze the different types of depth data for 2D image capture and/or determination to determine optimized 3D data for respective pixels, super-pixels, features, etc. of the 2D image.

For example, in one implementation, 3D data optimization component 1302 may be configured to combine different depth data values associated with the same pixels, superpixels, features, areas/regions, etc. of a 2D image. The 3D data optimization component 1302 can further employ heuristics to evaluate the quality of depth data generated separately using different modalities to determine a uniform interpretation of depth data for pixels, superpixels, features, regions/zones, etc. In another example, 3D data optimization component 1302 may employ average depth measurements for respective pixels, super-pixels, features, regions/zones, etc. of the 2D image that average the initial depth data and corresponding depth measurements reflected in the derived 3D data 116. In some embodiments, 3D data optimization component 1302 can map determined depth measurements (including derived 3D data 116, depth data received from 3D sensors, depth data determined using stereo processing, depth data determined using SLAM, depth data determined using photogrammetry, etc.) to corresponding pixels, super-pixels, features, etc. of an image using different approaches. The 3D data optimization component 1302 can further combine the respective depth values to determine an optimal depth value for the respective pixel, superpixel, etc., which weights the different measurements based on a defined weighting scheme. For example, the weighting scheme may utilize known disadvantages of the respective depth data sources to determine the accuracy associated with each applicable source and combine the depth data from each applicable source in a principals manner to determine optimized depth information. In another implementation, the initial depth data may include partial depth data of a portion of the 2D image. In one implementation, the 3D data optimization component 1302 may be configured to use the initial depth data for the image portion associated therewith and populate missing depth data for the remainder of the 2D image using the derived 3D data 116 determined for the remainder of the image.

The systems 100, 500, 800, and 1300 discussed above each describe an architecture in which 2D image data and, optionally, auxiliary data associated with the 2D image data are received and processed by a general purpose computing device (e.g., computing device 104) to generate derived depth data of the 2D image, generate a 3D reconstruction model, and/or facilitate navigation of the 3D reconstruction model. For example, a general purpose computing device may be or correspond to a server device, a client device, a virtual machine, a cloud computing device, or the like. The systems 100, 500, 800, and 1300 also include a user device 130 configured to receive and display the reconstructed model, and in some implementations interface with the navigation component 126 to facilitate navigating the 3D model rendered at the user device 130. However, the systems 100, 500, 800, and 1300 are not limited to this architectural configuration. For example, in some embodiments, one or more features, functions, and associated components of the computing device 104 may be provided at the user device 130, and vice versa. In another embodiment, one or more features and functions of the computing device 104 may be provided at a capture device for capturing 2D image data. In yet another exemplary embodiment, one or more cameras (or one or more camera lenses) for capturing 2D image data, the 3D self 2D processing module, the 3D model generation component 118, the navigation component 126, and the display 132 displaying the 3D model and a representation of the 3D model may all be provided on the same device.

Fig. 14-25 present various example devices and/or systems that provide different architectural configurations that may provide one or more features and functions of systems 100, 500, 800, and/or 1300 (and additional systems described herein). In particular, according to various aspects and embodiments described herein, the various example devices and/or systems shown in fig. 14-25 facilitate capturing a 2D image (e.g., 2D image data 102) of an object or environment, respectively, and deriving depth data from the 2D image using one or more 3D self-2D techniques.

In this regard, the respective devices and/or systems presented in fig. 14-25 may include at least one or more cameras 1404 configured to capture 2D images, and a 3D self 2D processing module 1406 configured to derive 3D data from the 2D images (e.g., one or more 2D images). The 3D-from-2D processing module 1406 may correspond to the 3D-from-2D processing module 106, the 3D-from-2D processing module 504, the 3D-from-2D processing module 804, the 3D-from-2D processing module 1304, or a combination thereof. In this regard, the 3D self 2D processing module 1406 is used to collectively represent a 3D self 2D processing module that may provide one or more features and functions (e.g., components) of any of the 3D self 2D processing modules described herein.

The one or more cameras 1404 may include, for example, RGB cameras, HDR cameras, video cameras, and the like. In some embodiments, the one or more cameras 1404 may include one or more cameras capable of generating panoramic images (e.g., panoramic image data 502). According to some embodiments, the one or more cameras 1404 may also include video cameras capable of capturing video (e.g., video data 902). In some implementations, the one or more cameras 1404 can include cameras that provide a relatively standard field of view (e.g., about 75 °). In other implementations, the one or more cameras may include cameras that provide a relatively wide field of view (e.g., from 120 ° to 360 °) such as fisheye cameras, capture devices that use conical mirrors (e.g., capable of capturing 360 ° panoramic images from a single image capture), cameras that are capable of generating spherical color panoramic images (e.g., RICOH theta atm) ^TM Camera), etc.

In some embodiments, the devices and/or systems presented in fig. 14-25 may employ a single camera (or single camera lens) to capture 2D input images. For these embodiments, one or more cameras 1404 may represent a single camera (or camera lens). According to some of these embodiments, a single camera and/or camera-containing device may be configured to rotate about an axis to generate images at different captured orientations relative to the environment, where a common field of view of the images spans horizontally up to 360 °. For example, in one implementation, the camera and/or the device housing the camera may be mounted on a rotatable mount that is rotatable 360 ° while the camera captures two or more images at different points of rotation with a common field of view span of 360 °. In another exemplary implementation, rather than using a rotatable mount, the camera and/or camera-receiving device may be configured to rotate 360 ° using an internal mechanical drive mechanism (such as a wheel or vibratory force) of the camera and/or camera-receiving device when placed on a flat plane. In another implementation, one or more cameras 1404 employed by the devices and/or systems presented in fig. 14-25 may correspond to a single panoramic camera (or a camera capable of rotating to generate panoramic images) employing an actuation mechanism that allows the cameras to move up and down relative to the same vertical axis. Using this implementation, a single camera may capture two or more panoramic images that span different vertical fields of view but provide the same or similar horizontal fields of view. In some embodiments, the two or more panoramic images may be combined (e.g., by stitching component 508 or at a capture device) to generate a single panoramic image having a wider vertical field of view than either image alone. In other embodiments, a single camera may capture two panoramic images with a vertical stereoscopic offset such that the two panoramic images form a stereoscopic image pair. For these embodiments, the stereoscopic panoramic image may be used directly as input to a 3D self 2D neural network model and/or processed by depth estimation component 916 to derive depth data for one or both images using passive stereoscopic processing. This additional depth data may be used as auxiliary input data for a 3D self 2D neural network model (e.g., enhancement model 810).

In other embodiments, the devices and/or systems presented in fig. 14-25 may employ two or more cameras (or two or more camera lenses) to capture 2D input images. For these embodiments, one or more cameras 1404 may represent two or more cameras (or camera lenses). In some of these embodiments, two or more cameras may be arranged on or in the same housing in relative position to each other such that their common field of view spans up to 360 °. In some implementations of these embodiments, a camera pair (or lens pair) capable of generating a stereoscopic image pair (e.g., having slightly offset but partially overlapping fields of view) may be used. For example, a capture device (e.g., a device including one or more cameras 1404 for capturing 2D input images) may include two cameras with horizontally stereo offset fields of view capable of capturing stereo image pairs. In another example, the capture device may include two cameras with vertically offset fields of view capable of capturing a vertical stereoscopic image pair. According to any of these examples, each camera may have a field of view spanning up to 360 °. In this regard, in one embodiment, the capture device may employ two panoramic cameras with vertical stereo offset capable of capturing a pair of panoramic images (with vertical stereo offset) forming a stereo pair. With implementations of these capturable stereo image pairs, the 3D self 2D processing module 1406 may be or include the 3D self 2D processing module 804 or 1304, and the auxiliary data component 806 can use stereo processing (e.g., via the depth estimation component 916) to derive initial depth data for respective images included in the stereo image pairs. As discussed above with reference to fig. 9 and 13, the initial depth data may be used as an input to the enhancement model 3D from the 2D model (selected from one or more enhancement data models 806) to facilitate deriving 3D data for any stereoscopic image included in a pair, used by the 3D data optimization component 1302 to facilitate generating optimized 3D data 1306, and/or used by the 3D model generation component 118 to facilitate generating a 3D model of an object or environment captured in the respective image.

The devices and/or systems described in fig. 14-25 may include machine-executable components embodied within a machine, such as in one or more computer-readable media (or media) associated with one or more machines. Such components, when executed by one or more computers (e.g., computers, computing devices, virtual machines, etc.), may cause the machine to perform the operations. In this regard, although not shown, the devices and/or systems described in fig. 14-25 may include or be operatively coupled to at least one memory and at least one processor. The at least one memory may further store computer-executable instructions/components that, when executed by the at least one processor, cause performance of operations defined by the computer-executable instructions/components. Examples of the memory and processes and other computing device hardware and software components that may be included in the described devices and/or systems are provided with reference to fig. 35. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

Referring to fig. 14, an exemplary user device 1402 that facilitates capturing 2D images and deriving 3D data from the 2D images in accordance with various aspects and embodiments described herein is presented. In this regard, the user device 1402 can include one or more cameras 1404 for capturing 2D images and/or video, and a 3D self 2D processing module 1406 for deriving 3D data from the 2D images, as discussed above. The user device 1402 may also include 3D model generation component 118 for generating a reconstructed 3D model based on the 3D data and the 2D image data, and display/rendering component that facilitates rendering the 3D reconstructed model at the user device 1402 (e.g., via a device display). For example, the display/rendering component 1408 may include suitable hardware and/or software that facilitates accessing or otherwise receiving representations of 3D models and/or 3D models (e.g., including 3D floor plan models, 2D floor plan models, toy house view representations of 3D models, etc.) and displaying them via a display (e.g., display 132) of a user device. In some embodiments, user device 1402 may be or correspond to user device 130. For example, the user device 1402 can be or include, but is not limited to: a desktop computer, laptop computer, mobile phone, smart phone, tablet PC, PDA, standalone digital camera, HUD device, virtual reality VR headset, AR headset or device, or other type of wearable computing device.

In other embodiments, the user device 1402 may not include the 3D model generation component 118 and/or the display/rendering component 1408. For these embodiments, the user device 1402 may simply be configured to capture 2D images (e.g., 2D image data 102) via the one or more cameras 1404 and derive depth data for the 2D images (e.g., the derived 3D data 116). The user device 1402 may further store the 2D image and its associated derived depth data (e.g., in a memory of the user device 1402), and/or provide the 2D image and its associated derived depth data to another device for use by the other device (e.g., to generate a 3D model or for another use context).

Fig. 15 presents another exemplary user device 1502 that facilitates capturing 2D images and deriving 3D data from the 2D images in accordance with various aspects and embodiments described herein. In this regard, the user device 1502 may include the same or similar features and functions as the user device 1402. The user device 1502 differs from the user device 1402 in the addition of one or more 3D sensors 1504 and positioning members 1506. In some embodiments, the user device 1502 may not include the positioning component 1506, but rather one or more 3D sensors, and vice versa. The user device 150 may further (optionally) include a navigation component 126 to provide on-board navigation of the 3D model generated by the 3D model navigation component 118 (in implementations in which the user device 1502 includes the 3D model generation component 118).

Referring to fig. 9 and 15, the one or more 3D sensors 1504 may include one or more 3D sensors or 3D capture devices configured to capture 3D/depth data in association with the capture of 2D images. For example, the one or more 3D sensors 1504 may be configured to capture one or more of the various types of 3D sensor data 910 discussed with reference to fig. 9. In this regard, the one or more 3D sensors 1504 may include, but are not limited to: structured light sensors/devices, liDAR sensors/devices, laser rangefinder sensors/devices, time-of-flight sensors/devices, light field-camera sensors/devices, active stereo sensors/devices, and the like. In one embodiment, the one or more cameras 1404 of the user device 1502 may include a camera that produces spherical color image data, and the one or more 3D sensors 1504 may include a structured light sensor (or another 3D sensor) configured to capture depth data of a portion of the spherical color image (e.g., a middle portion of the vertical FOV or otherwise near the equator). With this embodiment, the 3D self 2D processing module 1406 may be configured to employ a 3D self 2D neural network model (e.g., the enhancement model 810) trained to acquire both spherical color image data and partial depth input and predict the depth of the entire sphere.

Similarly, the positioning component 1506 may include hardware and/or software configured to capture the capture device motion data 904 and/or capture device location data 906. For example, in the illustrated embodiment, the positioning component 1506 may include an IMU configured to generate capture device motion data 904 in association with capturing one or more images via the one or more cameras 1404. The positioning component 1506 may also include a GPS unit configured to provide GPS coordinate information in association with image capture by one or more cameras. In some embodiments, the positioning component 1506 may associate motion data and position data of the user device 1502 with respective images captured via the one or more cameras 1404.

In various embodiments, user device 1502 may provide one or more features and functions of system 800 or 1300. In particular, via inclusion of one or more 3D sensors 1504, the user device 1502 may generate auxiliary data in at least the form of initial 3D depth sensor data associated with 2D images captured by the one or more cameras 1404. This initial depth data may be used by the 3D self 2D processing module 1406 and/or the 3D model generation component 118 as described with reference to fig. 8 and 13. The user device 1502 may also capture additional assistance data and provide the additional assistance data to the 3D self 2D processing module, including capture device motion data 904 and capture device location data 906.

Fig. 16 presents an example system 1600 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. The system 1600 includes a capture device 1601 and a user device 1602. According to this embodiment, the separate capture device 1601 may include one or more cameras 1404 to capture 2D images (and/or video) of an object or environment. For example, the capture device 1601 may include a camera having one or more lenses disposed within a housing that is configured to hold (e.g., a stand-alone hand-held camera, a stand-alone digital camera, a phone or smart phone including one or more cameras, a tablet PC including one or more cameras, etc.), mount on a tripod, be on or within a robotic device, be on or within a vehicle including an autonomous vehicle, be positioned in a fixed location relative to the environment (e.g., mounted to a wall or fixture), or another suitable configuration. The capturing device 1601 may further provide the captured 2D image to the user device 1602 for further processing by a 3D self 2D processing module and/or a 3D model generating component 118 located at the user device 1602. In this regard, the capture device 1601 may include suitable hardware and software to facilitate communication with the user device 1602 and vice versa. In implementations in which the user device 1602 includes the 3D model generation component 118, the user device may also include a display/rendering component 1408 for receiving and displaying the 3D model (and/or a representation of the 3D model) at the user device.

According to this embodiment, the user device 1602 may include a receiving/communicating component 1604 to facilitate communication with the capture device 1601, as well as to receive 2D images captured by the capture device (e.g., via one or more cameras). For example, the receiving/communicating component may facilitate wired and/or wireless communication between the user device 1602 and the capture device 1601, as well as between the user device 1602 and one or more additional devices (e.g., server devices, discussed below). For example, the receive/communication component 1604 can be or include various hardware and software devices associated with establishing and/or conducting wireless communications between the user device 1602 and an external device. For example, the receiving/communication component 1604 may control operation of a transmitter-receiver or transceiver (not shown) of a user device to receive information (e.g., 2D image data) from the capture device 1601, provide information to the capture device 1601, and the like. The receive/communication component 1604 may facilitate wireless communication between the user device and an external device (e.g., the capture device 1601 and/or another device) using various wireless telemetry communication protocols. For example, the receive/communication component 1604 may communicate with external devices using communication protocols including, but not limited to: NFC-based protocol, based on Protocol of technology, based on->A Wi-Fi protocol, an RF-based communication protocol, an IP-based communication protocol, a cellular communication protocol, a UWB technology-based protocol, or other forms of communication (including proprietary and non-proprietary communication protocols).

Fig. 17 presents another example system 1700 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Similar to system 1600, system 1700 includes a capture device 1701 including one or more cameras 1404 configured to capture 2D images (and/or video); and a user device 1702 comprising a receiving/communication component 1604, a 3D self 2D processing module 1506 and (optionally) a 3D model generating component 118 and a display/rendering component 1408. In this regard, the system 1700 may provide the same or similar features as the system 1600.

The system 1700 differs from the system 1600 in that one or more 3D sensors 1504 and positioning components 1506 are added to the capture device 1701. The user device 1702 may also include a navigation component 126 to provide on-board navigation of the 3D model generated by the 3D model generation component 118. According to this embodiment, the capture device 1701 may capture at least some initial depth data (e.g., 3D sensor data 910) of respective images captured by the one or more cameras 1404. The capture device 1701 may also provide the user device 1702 with the captured 2D image and initial depth data associated therewith. For example, in one implementation, the one or more cameras 1404 may be configured to capture and/or generate panoramic images of an environment having a relatively wide field of view (e.g., greater than 120 °) spanning up to 360 ° at least in a horizontal direction. The one or more 3D sensors 1504 may also include a 3D sensor configured to capture depth data of a portion of the panoramic image such that the 3D depth sensor has a smaller field of view of the environment relative to the panoramic 2D image. For these embodiments, the 3D self 2D processing module 1406 of the user device 1702 may include additional features and functionality of the system 800 or 1300 related to using assistance data to enhance 3D self 2D prediction. In this regard, the 3D self 2D processing module 1406 may employ the initial depth data to enhance the 3D self 2D prediction by using the initial depth data as input to the one or more enhancement models 810 and/or in conjunction with the derived 3D data 116 to generate optimized 3D data 1306. For example, in implementations in which the initial depth data includes partial depth data of the panoramic image, the 3D self 2D processing module 1406 may use one or more 3D self 2D predictive models to derive depth data for a remaining portion of the panoramic image for which the initial depth data was not captured. In some implementations, the capture device 1701 may also generate and provide capture device motion data 904 and/or capture device location data 906 to the user device 1702 in association with the 2D image.

Fig. 18 presents another exemplary system 1800 that facilitates capturing 2D image data, deriving 3D data from 2D image data, in accordance with various aspects and embodiments described herein. Similar to system 1600, system 1800 includes a capture device 1801 including one or more cameras 1404 configured to capture 2D images (and/or video); and a user device 1802 configured to communicate with the capture device 1801 (e.g., using the receive/communication component 1604). The system 1800 differs from the system 1600 in that the location of the 3D self 2D processing module 1406 is at the capture device 1801, rather than at the user device 1802. According to this embodiment, the capture device 1801 may be configured to capture 2D images (and/or video) of an object or environment, and further derive depth data (e.g., the derived 3D data 116) for one or more images using the 3D self-2D processing module 1406. The capture device 1801 may further provide the image and its associated derived depth data to a user device for further processing. For example, in the illustrated embodiment, the user device 1802 may include a 3D model generation component 118 to generate one or more 3D models (and/or representations of 3D models) based on received imaging data and derived depth data associated therewith. The user device 1802 may also include a display/rendering component 1408 to render the 3D model and/or a representation of the 3D model at the user device 1802.

Fig. 19 presents another exemplary system 1900 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Similar to system 1700, system 1900 includes a capture device 1901 including one or more cameras 1404 configured to capture 2D images (and/or video); and a user device 1902 configured to communicate with the acquisition device (e.g., using the receiving/communicating component 1604). Also similar to system 1700, capture device 1901 can include one or more 3D sensors 1504 and a positioning component 1506, and user device 1902 can include 3D model generation component 118, display/rendering component 1408, and navigation component 126. The system 1900 differs from the system 1700 in that the location of the 3D self 2D processing module 1406 is at the capture device 1901, rather than at the user device 1902.

According to this embodiment, the capture device 1901 may be configured to capture 2D images (and/or video) of an object or environment, as well as ancillary data, including 3D sensor data 910, capture device motion data 904, and/or capture device location data 906. The capture device 1901 may further derive depth data (e.g., derived 3D data 116) for one or more captured 2D images using the 3D self 2D processing module 1406, where the 3D self 2D processing module corresponds to the 3D self 2D processing module 804 or 1304 (e.g., and is configured to use auxiliary data with the 2D image to facilitate depth data derivation/optimization). The capture device 1901 may further provide the image and its associated derived depth data to the user device 1902 for further processing and use by the navigation component 126. In some embodiments, the capture device 1901 may also provide assistance data to the user device 1902 in order to facilitate an alignment process associated with: a 3D model is generated by the 3D model generation component 118 based on the image data and its associated derived depth data.

Fig. 20 presents another exemplary system 2000 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Unlike previous systems 1600, 1700, 1800, and 1900, which distribute various components between the capture device and the user device, system 2000 distributes components between the user device 2002 and the server device 2003. In the illustrated embodiment, the user device 2002 may include one or more cameras 1404, a 3D self 2D processing module 1406, a display/rendering component 1408, and a receiving/communication component 1604. The server device 2003 may include a 3D model generation component 118 and a navigation component 126.

According to this embodiment, the user device 2002 may operate as a capture device and use one or more cameras 1404 to capture at least 2D images (e.g., 2D image data 102) of an object or environment. For example, the user device 2002 may include a tablet PC, a smart phone, a stand-alone digital camera, a HUD device, an AR device, and the like, having a single camera, a single camera with two lenses that may capture a stereoscopic image pair, a single camera with two lenses that may capture a 2D image with a wide field of view, two or more cameras, and the like. The user device 2002 may also include a device capable of capturing and/or generating (e.g., via the stitching component 508 of the stitching 3D-from-2D processing module 1406) panoramic images (e.g., images having a field of view greater than a minimum threshold and up to 360 °). The user device 2002 may further execute the 3D self 2D processing module 1406 to derive 3D/depth data for respective images captured via the one or more cameras 1404 according to one or more of the various techniques described with reference to the 3D self 2D processing module 106, the 3D self 2D processing module 504, the 3D self 2D processing module 804, and the 3D self 2D processing module 1304.

The user device 2002 and the server device 2003 may be configured to operate in a server-client relationship, wherein the server device 2003 provides services and information to the user device 2002, including various 3D modeling services provided by the 3D model generation component 118, and navigation services provided by the navigation component 126 that facilitate navigation of the 3D model displayed at the user device 2002. The respective devices may communicate with each other via one or more wireless communication networks (e.g., cellular network, internet, etc.). For example, in the illustrated embodiment, the server device 2003 may also include a receiving/communication component 2004, which may include suitable hardware and/or software to facilitate wireless communication with the user device 2002. In this regard, the receive/communication component 2004 may include the same or similar features and functions as the receive/communication component 1604. In some implementations, the server device 2003 can operate as a Web server, an application server, a cloud-based server, or the like to provide 3D modeling and navigation services to the user device 2002 via a website, a Web application, a thin client application, a hybrid application, or another suitable network accessible platform.

In one or more implementations, the user device 2002 may be configured to capture 2D images via the one or more cameras 1404, derive depth data for the 2D images, and provide the captured 2D images and their associated derived depth data to the server device 2003 (e.g., transmit, send, transfer, etc.) for further processing by the 3D model generation component 118 and/or the navigation component 126. For example, using the 3D model generation component 118, the server device 2003 may generate a 3D model of an object or environment included in the received 2D image according to the techniques described herein with reference to fig. 1. The server device 2003 may further provide (e.g., transmit, send, transmit, stream, etc.) the 3D model (or a 2D model, such as a 2D floor plan model) to the user device 2002 for rendering via a display at the user device 2002 (e.g., using the display/rendering component 1408).

In some embodiments, server device 2003 may generate and provide one or more intermediate versions of the 3D model to user device 2002 based on image data and associated depth data that have been received so far during the scanning process. These intermediate versions may include 3D reconstructions, 3D images, or 3D models. For example, during a scanning process in which the user device is positioned at different locations and/or orientations relative to the environment in order to capture different images at different perspectives of the environment, the receiving/communicating component 1604 may be configured to send the corresponding images and associated derived 3D data (as they are captured (and processed by the 3D from the 2D processing module 1406 to derive 3D data)) to the server device 2003. In this regard, as described with reference to system 100 and illustrated with reference to 3D model 200 shown in fig. 2, display/rendering component 1408 may receive and display an intermediate version of the 3D model to facilitate guiding a user during a capture process to determine where to position a camera to capture additional image data that the user wishes to reflect in a final version of the 3D model. For example, based on viewing an intermediate 3D reconstruction generated based on the 2D image data captured so far, an entity (e.g., a user or computing device) controlling the capture process may determine which portions or regions of the object or environment have not been captured and excluded from the intermediate version. The entity may also identify areas of the object or environment associated with poor image data or mis-aligned image data. The entity may also position one or more cameras 1404 to capture additional 2D images of the object or environment based on the lost or misaligned data. In some implementations, when the entity controlling the capture process is satisfied with the last presented intermediate 3D reconstruction or otherwise determines that the collection of the 2D image captured in association with the scan is complete, the user device 2002 may send a confirmation message to the server device 2003 confirming that the scan is complete. Based on receipt of the confirmation message, the server device 2003 may generate a final version of the 3D model based on the complete set of 2D images (and associated 3D data).

Additionally, in some embodiments, after generating (or partially generating) the 3D model, server device 2003 may use features and functions of navigation component 126 to facilitate navigation of the 3D model displayed at the user device. In various implementations, the intermediate 3D reconstruction discussed herein may represent a "draft" version of the final navigable 3D reconstruction. For example, the intermediate version may have a lower image quality relative to the final version and/or may be generated using a less accurate alignment process relative to the final version. In some implementations, the intermediate version may include a static or 3D reconstruction that cannot be navigated, unlike the final 3D representation.

Fig. 21 presents another example system 2100 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. The system 2100 may include the same or similar features and functions as the system 2000, except for the location of the 3D self 2D processing module 1406. In this regard, the 3D self 2D processing module 1406 may be disposed on the server device 2103 instead of the user device 2102. According to system 2100, user device 1202 may include one or more cameras 1404 configured to capture 2D images (and/or video). The user device 2102 may further provide the captured 2D image to the server device 2103 for further processing by the server device 2103 using the 3D self 2D processing module 1406, the 3D model generating component 118, and/or the navigation component 126. In this regard, intermediate versions may be generated and rendered with relatively little processing time, thereby enabling a real-time (or substantially real-time) 3D reconstruction process that provides continuously updated, coarse 3D versions of the scene during the capture process.

According to embodiments in which the 3D self 2D processing module 1406 is provided at a server device (e.g., the server device 2103 or another server device described herein) in a similar manner to the techniques discussed above, the server device 2103 may also generate and provide to the user device 2102 an intermediate 3D reconstruction of the generation of objects or environments included in the received 2D image (e.g., captured in association with a scan). However, unlike the technique described with reference to fig. 20, the server device 2103 may derive depth data of the received 2D image instead of the user device 2102. For example, the user device 2102 may capture a 2D image of an object or environment using one or more cameras 1404 and send the 2D image (e.g., using the receiving/communication component 1604) to the server device 2103. Based on the receipt of the 2D image, the server device 2103 may employ 3D from the 2D processing module 1406 to derive 3D data of the 2D image and use the 2D image and the 3D data to generate an intermediate 3D reconstruction of the object or environment. The server device 2103 may also send the intermediate 3D reconstruction to the user device 2102 for rendering at the user device 2102 as a preview for facilitating the capture process.

Once the user device 2102 informs the server device 2103 of the scan completion (e.g., using a completion acknowledgement message, etc.), the server device 2103 may further perform additional (and in some implementations) more complex processing techniques to generate a final 3D model of the environment. In some implementations, the additional processing may include using additional depth derivation and/or depth data optimization techniques (e.g., provided by panorama component 506, auxiliary data component 806, and/or 3D data optimization component 1302) in order to generate more accurate depth data for the 2D image for use by 3D modeling generation component 118. For example, in one exemplary implementation, the server device 2103 may employ a first 2D self-2D neural network model (e.g., the standard model 114) to derive first depth data for the received 2D image and use this first depth data to generate one or more intermediate 3D reconstructions. Upon receiving the complete set of 2D image data, the server device 2103 may then use the techniques provided by the panorama component 506, the auxiliary data component 806, and/or the 3D data optimization component 1302 in order to derive more accurate depth data for the 2D images in the complete set. The server device 2103 may further employ this more accurate depth data in order to generate a final 3D model of the object or environment using the 3D model generating component 118.

Fig. 22 presents another exemplary system 2200 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. The system 2200 may include the same or similar features and functions as the system 2000, with the addition of one or more 3D sensors 1504 and positioning components 1506 to the user device 2202. According to system 2200, a user device can capture assistance data, including 3D sensor data 910, capture device motion data 904, and/or capture device location data 906 associated with one or more 2D images captured via one or more cameras. In some implementations, the 3D self 2D processing module 1406 may be configured to employ auxiliary data to facilitate generating derived 3D data 116 and/or optimized 3D data 1306 of a 2D image, according to features and functions of the systems 800 and 1300. The user device 2202 can also determine images, associate with respective images, and/or employ other types of assistance data (e.g., camera/image parameters 908) discussed herein, thereby facilitating generation of the derived 3D data 116 by the 3D self 2D processing module 1406 according to the techniques described with reference to the assistance data component 806 and the 3D self 2D processing module 804. The user device 2202 may further provide a 2D image and its associated depth data (e.g., derived 3D data 116 or optimized 3D data 1306). In some implementations, the user device 2202 may also provide assistance data to the server device 2003 to facilitate 3D model generation by the 3D model generation component 118 and/or navigation by the navigation component 126. In other implementations, rather than using assistance data to facilitate 3D-from-2D depth derivation by the 3D-from-2D processing module 1406, the user device may alternatively provide assistance data to the server device 2003 for use by the 3D model generation component 118 and/or the navigation component 126.

Fig. 23 presents another example system 2300 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. The system 2300 may include the same or similar features and functions as the system 2200, except for the location of the 3D self 2D processing module. According to the system 2300, the server device 2103 may include a 3D self 2D processing module 1406 (e.g., in the same or similar manner as described with reference to the system 2100). The user device 2303 may include one or more 3D capture devices, one or more cameras 1404, a positioning component 1506, and a receiving/communication component 1604. According to this embodiment, the user device 2302 may capture 2D images and associated assistance data and further send the images and assistance data associated therewith to the service device for further processing by the 3D self 2D processing module 1406, the 3D model generation component 118, and/or the navigation component 126.

Fig. 24 presents another example system 2400 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. System 2400 can include the same or similar features as the previous systems disclosed herein. However, system 2400 distributes the various components of the previous systems disclosed herein among capture device 2401, user device 2402, and server device 2003 (previously described with reference to fig. 20). With system 2400, capture device 2401 can include one or more cameras 1404 for capturing 2D image data. For example, in one implementation, capture device 2401 may be moved to different positions and orientations relative to an object or environment to capture different images of the object or environment from different perspectives. In some implementations, the different images may include one or more panoramic images (e.g., having a field of view of horizontal 360 ° or between horizontal 120 ° and 360 °) generated using one or more techniques described herein. The capture device 2401 may also provide the captured images to the user device 2402, wherein upon receipt of the images, the user device 2402 may employ the 3D self 2D processing module 1406 to derive 3D/depth data for the respective images using the various techniques described herein. According to this embodiment, the user device 2402 may further (optionally) provide the 2D image and 3D/depth data associated therewith to the server device 2003 for further processing by the 3D model generation component 118 to generate a 3D model of the object or environment (e.g., by aligning the derived depth data associated with the 2D image with each other). In some implementations, as discussed with reference to fig. 20 and 21, the server device 2003 may further provide one or more intermediate versions of the 3D model to the user device for rendering at the user device 2402 (e.g., using the display/rendering component 1408). These intermediate versions of the 3D model may provide a preview of the reconstructed spatial alignment in order to facilitate guiding or controlling the entity operating the capture device 2401 through the capture process (e.g., knowing where to place the camera to obtain additional images). In this regard, once the user has captured as many objects or image data of the environment as they want, the 3D model generation component may further optimize the alignment manner to create a refined 3D reconstruction of the environment. The final 3D reconstruction may be provided to a user device for viewing and navigation as an interactive space (as facilitated by navigation component 126). In various implementations, the intermediate version may represent a "draft" version of the final 3D reconstruction. For example, the intermediate version may have a lower image quality relative to the final version and/or may be generated using a less accurate alignment process relative to the final version. In some implementations, the intermediate version may include a static or 3D reconstruction that cannot be navigated, unlike the final 3D representation.

Fig. 25 presents another example system 2500 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, generating a reconstructed 3D model based on the 3D data and the 2D image data, and navigating the reconstructed 3D model in accordance with various aspects and embodiments described herein. The system 2500 may include the same or similar features and functions as the system 2400, wherein the location of the 3D self 2D processing module 1406 at the server device 2103 (previously described with reference to fig. 21) is modified and one or more 3D sensors 1504 and positioning components 1506 are added to the capture device 1701. In this regard, user device 2502 may include only receive/communication component 1604 to facilitate relaying information between capture device 1701 and server device 2103. For example, the user device 2502 may be configured to receive 2D images and/or associated native assistance data from the capture device 1701 and send the 2D images and/or associated native assistance data to a server device for processing by the 3D self 2D processing module 1406 and optional 3D model generating component 118. The server device 2103 may also provide the user device 2502 with a representation of the 3D model and/or 3D model generated based on the 2D image and/or assistance data.

In another implementation of this embodiment, the server device 2103 may provide a cloud-based, web-based, thin client application-based, etc. service, wherein a user may select an image that is already stored at the user device 2502 and upload it to the server device 2103. The server service 2103 may then automatically align the images in 3D and create a 3D reconstruction using the 3D self 2D techniques described herein.

Fig. 26 presents an example computer-implemented method 2600 that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 2602, a device (e.g., one or more capture devices described with reference to fig. 14-25) operatively coupled to the processor captures a 2D image of the object or environment (e.g., using one or more cameras 1404). At 2704, the device derives 3D data of the 2D image using one or more 3D self 2D neural network models (e.g., using 3D self 2D processing module 1406).

Fig. 27 presents another example computer-implemented method 2700 that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 2702, a device (e.g., one or more of the capture devices, user devices, or server devices described with reference to fig. 14-25) operatively coupled to the processor receives or captures a 2D image of an object or environment. At 2704, the device derives 3D data of the 2D image using one or more 3D self 2D neural network models (e.g., using 3D self 2D processing module 1406). At 2706, the device aligns the 2D image based on the 3D data to generate a 3D model of the object or environment, or the device transmits the 2D image and the 3D data to an external device (e.g., one or more server devices described with reference to fig. 20-25) via a network, wherein the external device generates the 3D model of the object or environment based on the transmission.

Fig. 28 presents another example computer-implemented method 2800 that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 2802, a device (e.g., one or more user devices or server devices described with reference to fig. 14-25) operatively coupled to the processor receives 2D images of the object or environment captured from different perspectives of the object or environment, wherein the device also receives derived depth data for respective images of the 2D images derived using one or more 3D from the 2D neural network model (e.g., using the 3D data derivation component 110). At 2804, the device aligns the 2D images with one another based on the depth data to generate 3D of the object or environment (e.g., via the 3D model generation component 118).

Fig. 29 presents another example computer-implemented method 2900 that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 2902, a device (e.g., user device 2102, user device 2302, user device 2502, etc.) including a processor captures a 2D image of an object or environment (e.g., using one or more cameras 1404). At 2904, the device sends the 2D image to a server device (e.g., server device 2103), wherein based on receipt of the 2D image, the server device derives 3D data of the 2D image using one or more 3D self 2D neural network models (e.g., using 3D self 2D processing module 1406), and generates a 3D reconstruction of the object or environment using the 2D image and the 3D data (e.g., using 3D model generation component 118). At 2906, the device further receives the 3D reconstruction from the server device, and at 2908, the device renders the 3D reconstruction via a display of the device.

In one or more embodiments, a device may capture 2D images from different perspectives of an object or environment in association with image scanning of the object or environment. For these embodiments, the device may further send a confirmation message to the remote device confirming that the image scan is complete. In this regard, the 3D reconstruction may include a first or initial 3D reconstruction, and wherein based on receipt of the confirmation message, the remote device may generate a second (or final) 3D reconstruction of the object or environment. For example, in some implementations, the second 3D reconstruction has a higher level of image quality relative to the first three dimensional reconstruction. In another exemplary implementation, the second 3D reconstruction includes a navigable model of the environment, and wherein the first 3D reconstruction is not navigable. In another exemplary implementation, the second 3D reconstruction is generated using a more accurate alignment process than the alignment process used to generate the first 3D reconstruction.

Fig. 30 presents an example system 3000 that facilitates using one or more 3D self 2D techniques to associate with an Augmented Reality (AR) application in accordance with various aspects and embodiments described herein. The system 3000 includes at least some features (e.g., one or more cameras 1404, 3D-from-2D processing module 1406, receiving/communicating component 1604, and displaying/rendering component 1408) that are the same or similar to previous systems disclosed herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

In the illustrated embodiment, the system 100 includes a user device 3002 having one or more cameras 1404 configured to capture 2D image data (e.g., including panoramic images and video) of an object or environment, and a 3D self 2D processing module 1406 configured to derive depth data for one or more 2D images included in the 2D image data. As described above, the 3D self 2D processing module 1406 may be or correspond to the 3D self 2D processing module 106, the 3D self 2D processing module 504, the 3D self 2D processing module 804, or the 3D self 2D processing module 1304. Although not shown, in some embodiments, the user device may further include one or more 3D sensors 1504, a positioning component 1506, and/or one or more additional hardware and/or software components that facilitate generating native assistance data 802 to facilitate deriving depth data for a captured image according to various techniques described herein with reference to fig. 8, 9, and 13. The user device also includes an AR component 3004, a receive/communicate component 1604, and a display/render component 1408. The user device 3002 may include various types of computing devices, including one or more cameras on or within a housing configured to capture 2D image data of an environment, and a display/rendering component 1408 including hardware and/or software that facilitates rendering digital objects on or within a representation of the environment in holograms or the like via a display of the user device 3002. For example, in some embodiments, the user device 3002 may include an AR headset configured to be worn by a user and include a display (e.g., a transparent glass display) in front of the user's eyes (e.g., glasses, goggles, HUDs, etc.). In another embodiment, the user device may be or include a mobile handheld device, such as a mobile phone or smart phone, tablet PC, or similar device. In still other embodiments, the user device 3002 may comprise a device (such as a laptop PC, desktop PC, etc.) that may be positioned in a relatively fixed location with respect to the environment.

The user device 3002 may include or be operatively coupled to at least one memory 3020 and at least one processor 3024. The at least one memory 122 may further store computer-executable instructions (e.g., one or more software elements of the 3D self 2D processing module 1406, the AR component 3004, the receive/communication component 1604, and/or one or more software elements of the display/rendering component 1408) that, when executed by the at least one processor 3024, cause performance of the operations defined by the computer-executable instructions. In some embodiments, memory 122 may also store information received, generated, and/or employed by the computing device. For example, in the illustrated embodiment, memory 3020 may store one or more AR data objects 3022 that may be used by AR component 3004. Memory 3020 may also store information including, but not limited to, captured image data and depth information derived for the captured image data, received 2D image data 102, derived 3D data 116, and 3D model and alignment data 128. The user device 3002 may also include a device bus 120 that communicatively couples the various components of the user device. Examples of the processor 3024 and memory 3020, as well as other suitable computers or computing-based elements that may be used in connection with implementing one or more systems or components shown and described in connection with fig. 30 or other figures disclosed herein, may be found with reference to fig. 35.

The system 3000 may also include a server device 3003. The server device 3003 may provide information and/or services to the user device 3002 that facilitate one or more features and functions of the AR component 3004. In this regard, the AR component 3004 can be or correspond to an AR application that provides one or more AR features and functions related to integrating virtual digital data objects on or within a real-time view of an environment. For example, in embodiments in which the user device 3002 comprises a wearable device configured to be worn by a user and includes a transparent display (e.g., glasses, goggles, or other forms of glasses) that is positioned in front of the user's eyes when worn, the real-time view of the environment may include an actual view of the environment that is currently being viewed through the transparent display. With this embodiment, the digital data object may be rendered on a glass display with an appearance and position that causes the digital data object to be aligned with a real-time view of the environment. In another example, the user device may include a tablet PC, smart phone, or the like having a display configured to render real-time image data (e.g., video) of an environment captured via a forward facing camera of the device. According to this exemplary embodiment, the digital data objects may be rendered as overlay data onto real-time image data (e.g., snapshots and/or video) rendered on a device display.

The type of digital data object that can be integrated on or within a real-time view of the environment may vary and is referred to herein as an AR data object (e.g., AR data object 3022). For example, the AR data object 3022 may include a 3D or 2D graphical image or data model of an object or person. In another example, AR data object 3022 may include icons, text, symbolic markers, hyperlinks, etc. that may be visually displayed and interacted with. In another example, AR data object 3022 may include a data object that is not visually displayed (or initially visually displayed) but may interact with it and/or be activated in response to a trigger (e.g., user pointing, viewing along a user's line of sight, gesture, etc.). For example, in one embodiment involving viewing or pointing to an actual object (e.g., a building) that appears in the environment, auxiliary data associated with the building may be rendered, such as text overlays identifying the building, video data, sound data, graphical image data corresponding to objects or things that appear from an open window of the building, and so forth. In this regard, the AR data object 3022 may include various types of auxiliary data sets. For example, AR data object 3022 may include a marker or tag identifying an object or location captured in image data (e.g., real-time video and/or snapshot) by one or more cameras 1404. These markers may be made manually or automatically (via image or object recognition algorithms) during the current or previous capture environment, or previously generated and associated with known objects or environmental locations being viewed. In another example, AR data object 3022 may include an image or 3D object having a predefined association with one or more actual objects or locations or things included in the current environment. In yet another example, AR data object 3022 may include a video data object, an audio data object, a hyperlink, and the like.

The AR component 3004 may employ 3D/depth data derived by the 3D self 2D processing module 1406 from real-time 2D image data (e.g., snapshots or video frames) of an object or environment captured via the one or more cameras 1404 to facilitate various AR applications. In particular, the AR component 3004 can employ 3D self 2D techniques described herein to facilitate enhancing various AR applications with more accurate and photo-level integration of AR data objects as overlays onto a real-time view of an environment. In this regard, according to various embodiments, one or more cameras 1404 may capture real-time image data of an environment that corresponds to a current perspective of an environment view viewed on or through a display of user device 3002. The 3D self 2D processing module 1406 may also derive depth data from the image data in real-time or substantially real-time. For example, in implementations in which a user walks around in an open house to make a potential purchase while wearing or holding the user device 3002 such that at least one of the one or more cameras 1404 of the user device 3002 captures image data corresponding to the user's current perspective, the 3D self-2D processing module 1406 may derive depth data from the image data that corresponds to the actual 3D location (e.g., depth/distance) of the user relative to the physical structure of the house (e.g., wall, ceiling, counter, appliance, opening, door, window, etc.). The AR component 3004 can use the depth data to facilitate integration of one or more AR data objects on or within a real-time view of the environment.

In the illustrated embodiment, the AR component 3004 may include a spatial alignment component 3006, an integration component 3008, an occlusion mapping component 3010, and AR data object interaction component 3012, AR data object generation component 3014, and 3D model localization component 3016.

The spatial alignment component 3006 may be configured to determine a location for integrating the AR data object on or within a representation of an object or environment corresponding to a current perspective of the object environment viewed by the user based on the derived depth/3D data. The integration component 3008 may integrate the AR data object on or within a representation of the object or environment at the location. For example, the integration component 3008 may render an auxiliary data object on a display having a size, shape, orientation, and position that aligns the auxiliary data object with a real-time view of the environment at the determined location. In this regard, if the display is a transparent display, the integrated component 3008 may render the AR data object on the glass of the transparent display at a location on the display and in a size, shape, and/or orientation that aligns the AR data object with the determined location in the environment. The integration component 3008 may also determine a suitable position, size, shape, and/or orientation for the AR data object based on the relative position of the user's eyes and display and the type of AR data object. In other implementations where the representation of the environment includes image data captured from the environment and rendered on a display, the integration component 3008 may render the AR data object as an overlay on the image data having a size, shape, and/or orientation that aligns the AR data object with a determined location in the environment.

For example, based on depth data indicating a relative 3D position of a user with respect to an actual object, thing, person, etc. included in an environment (such as a wall, appliance, window, etc.), the spatial alignment component 3006 may determine a position for integrating the AR data object that spatially aligns the AR data object with the wall, appliance, window, etc. For example, in one implementation, based on the known relative position of the user to the actual wall, appliance, window, etc. (as determined based on the derived depth data), the spatial alignment component 3006 may determine the assumed 3D position and orientation of the AR data object relative to the actual wall, appliance, window, etc. The integration component 3008 may further use this hypothetical 3D position and orientation of the AR data object to determine a position for overlaying the data object onto the display or a real-time representation of the environment viewed through the display that spatially aligns the data object at the hypothetical 3D position with the appropriate scale of size and shape (e.g., based on what the data object is).

The occlusion mapping component 3010 can facilitate accurate integration of AR data objects into a real-time view of an environment based upon derived 3D/depth data considering relative positions of objects in the environment to each other and to a current viewpoint of a user. In this regard, the occlusion mapping component 3010 may be configured to determine a relative position of the AR data object with respect to another object included in the real-time representation of the object or environment viewed on or through the display based on the current perspective of the user and the derived 3D/depth data. For example, the occlusion mapping component 3010 may ensure that if an AR object is placed in front of an actual object that appears in the environment, a portion of the AR data object that is in front of the actual data object occludes the actual data object. Also, if the AR object is placed behind an actual object that appears in the environment, a portion of the AR data object that is located behind the actual data object is occluded by the actual data object. Thus, relative to the current position and viewpoint of the user for the respective object, thing, etc., the occlusion mapping component 3010 can employ the derived 3D/depth data of the respective object, thing, etc. in the environment to ensure a correct occlusion mapping or virtual object relative to the actual object (e.g., drawing the virtual object behind the actual object that is closer than they are).

The AR data object interaction component 3012 can employ derived 3D/depth data of the environment based on a current position and perspective of the environment by a viewer in order to facilitate user interaction with virtual AR data objects spatially integrated with the environment through the spatial alignment component 3006 and the integration component 3008. In this regard, the AR data object interaction component 3012 may directly employ the derived 3D/depth data by having the virtual object interact with its environment in a more realistic manner or be environmentally constrained.

The AR data object generating component 3014 may provide for generating 3D virtual data objects for use by the AR component 3004. For example, in one or more embodiments, the AR data object generating part 3014 may be configured to extract object image data of an object included in the 2D image. For example, using the features and functions of cropping component 510 discussed below, as well as substantially any 2D image (including objects that may be segmented from the image), AR data object generation component 3014 may crop, segment, or otherwise extract a 2D representation of the object from the image. The AR data object generation component 3014 may further employ 3D data (i.e., object image data) derived by the 3D from the 2D processing module 1406 for and associated with the extracted 2D object to generate a 3D representation or model of the object. In various embodiments, the spatial alignment component may be further configured to determine a location of a 3D representation or model (i.e., object image data) for integrating the object on or within the real-time representation of the object based on the object three-dimensional data.

In some embodiments, a real-time environment that is viewed and/or interacted with by a user using AR (e.g., using features and functionality of user device 3002) may be associated with a previously generated 3D model of the environment. The previously generated 3D model of the environment may also include or be otherwise associated with information identifying a defined position and/or orientation of the AR data object relative to the 3D model. For example, the 3D model generated by the 3D model generation component 118 can be associated with a marker at various defined locations relative to the 3D model that identifies an object (e.g., appliance, furniture, wall, building, etc.), provides information about the object, provides hyperlinks to applications associated with the object, and so forth. Other types of AR data objects that may be associated with a previously generated 3D model of an object or environment may include, but are not limited to:

a tag or label identifying the captured object or location; these markers may be made manually or automatically (via image or object recognition algorithms) during the current or previous capture of the environment or via a user manipulating an external tool that has captured 3D data.

Images or 3D objects added at specific locations relative to previous 3D captures of the same environment; for example, an upholstery or other user may capture a 3D environment, import the 3D environment into a 3D design program, make changes and additions to the 3D environment, and then use a 3D reconstruction system to see how these changes and additions will occur in the environment.

Previously captured 3D data from the same object or environment; in this case, the difference between the previous 3D data and the current 3D data may be highlighted.

A 3D CAD model of the captured object or environment; in this case, differences between the CAD model and the current 3D data may be highlighted, which is useful for finding defects in manufacturing or construction or for installing incorrect items.

Data captured by additional sensors during the current or previous 3D capture process.

AR data objects (e.g., markers, the above, etc.) that have been previously associated with a defined location relative to a 3D model of an object or environment are referred to herein as aligned AR data objects. In the illustrated embodiment, such previously generated 3D models of the environment and associated aligned AR data objects may be stored in a network accessible 3D spatial model database 3026 as a 3D spatial model 3028 and aligned AR data object 3030, respectively. In the illustrated embodiment, the 3D spatial model database 3026 may be provided by the server device 3003 and accessed by the AR component 3004 via one or more networks (e.g., using the receiving/communication component 1604). In other embodiments, the 3D spatial model database 3026 and/or some information provided by the 3D spatial model database 3026 may be stored locally at the user device 3002.

According to these embodiments, the 3D model localization component 3016 may provide for using a previously generated 3D model of the environment and aligned AR data objects (e.g., markers and other AR data objects discussed herein) in order to facilitate integrating the aligned AR data objects 3030 with a real-time view of the environment. In particular, the 3D model localization component 3016 may employ derived 3D/depth data determined from the current position and orientation of the user device 3002 for the current perspective of the environment to "localize" the user device relative to the 3D model. In this regard, based on the derived 3D data indicative of the location of the user device to the corresponding object in the environment, the 3D model localization component 3016 may determine the relative position and orientation of the user with respect to the 3D model (as if the user were actually standing in the 3D model). The 3D model locating component 3016 may further identify AR data objects associated with defined locations in the 3D model that are within a current perspective of the user relative to the 3D model and the actual environment. The 3D model localization component 3016 may also determine how to spatially align the AR data object with a real-time/field view of the environment based on how the auxiliary data object is aligned with the 3D model and the relative position of the user to the 3D model.

For example, assume a scenario in which a 3D spatial model of a house or building was previously generated, and various objects included in the 3D spatial model are associated with a marker, such as a marker associated with an electrical panel, that indicates a respective function describing different circuits on the electrical panel. Now imagine that the user operating user device 3002 is looking at the house on site and in person, and has a current view of the real electrical panel (e.g. looking through a transparent display). The AR component 3004 can provide marking data that overlays alignment with an actual electrical panel when viewed in real time through a transparent display. In order to accurately align the marking data with the electrical panel, the user device 3002 needs to locate itself by means of a previously generated 3D model. The 3D model locating component 3016 may perform the locating using derived 3D/depth data determined from real-time images of the environment corresponding to the real-time perspective of the electrical panel. For example, the 3D model locating component 3016 may use derived depth information corresponding to the actual position/orientation of the user relative to the electrical panel in the 3D model to determine the relative position/orientation of the user relative to the electrical panel. Using the relative position/orientation and the actual position/orientation, the 3D spatial alignment component 3006 may determine how to position the marker data as an overlay onto a transparent display that aligns the marker data with the actual view of the electrical panel.

Fig. 31 presents an example computer-implemented method 3100 for associating with an AR application using one or more 3D self 2D techniques in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 3102, a device (e.g., user device 3002) operatively coupled to the processor employs the one or more 3D self-2D neural network models to derive 3D data (e.g., using 3D self-2D processing module 1406) from one or more 2D images of the object or environment captured from a current perspective of the object or environment viewed on or through a display of the device. At 3104, the device determines a location (e.g., using spatial alignment component 3006 and/or 3D model localization component 3016) for integrating graphical data objects on or within a representation of an object or environment viewed on or through a display based on the current perspective and the three-dimensional data. At 3106, the device integrates the graphical data object (e.g., using the integration component 3008) on or within the representation of the object or environment based on the location.

Fig. 32 presents an exemplary computing device 3202 employing one or more 3D self 2D technologies associated with object tracking, real-time navigation, and 3D feature-based security applications in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

Referring to fig. 13 and 32, computing device 3202 may include the same or similar features and functions as computing device 104, including 3D self-2D processing module 1304 configured to generate derived 3D data 116 and/or optimized 3D data 1306 based on received 2D image data 102 and optional native auxiliary data 802. Computing device 3202 also includes tracking component 3204, real-time navigation component 3206, and 3D feature authentication component 3208. Tracking component 3204, real-time navigation component 3206, and 3D feature authentication component 3208 may each comprise computer-executable components stored in a memory (e.g., memory 122) that, when executed by a processor (e.g., processor 124), may perform the described operations.

In one or more embodiments, based on derived 3D data 116 and/or optimized 3D data 1306 determined for an object over a period of time from 2D image data of the captured object over the period of time, tracking component 3204 can facilitate tracking a relative position or location of objects, things, people, and the like included in the environment. For example, in some implementations, tracking component 3204 may receive successive frames of video of an object captured via one or more cameras over a period of time. Tracking component 3204 may also use derived 3D data 116 and/or optimized 3D data 1306 determined for objects in at least some sequential frames of video to determine a relative position of the object with respect to the camera over a period of time. In some implementations, computing device 3202 may also house one or more cameras. In some embodiments, the object comprises a moving object, and the one or more cameras may track the position of the object as the one or more cameras also move over a period of time or remain in a fixed position relative to the perspective of the moving object over a period of time. In other embodiments, the object may comprise a fixed object, and the one or more cameras may be movable relative to the object. For example, one or more cameras may be attached to a moving vehicle or object, held in a user's hand as the user moves in the environment, and so forth.

Real-time navigation component 3206 may facilitate real-time navigation of an environment by a mobile entity that includes computing device 3202 and one or more cameras configured to capture and provide 2D image data (and optionally native assistance data 802). For example, a mobile entity may include a user-operated vehicle, an autonomous vehicle, a drone, a robot, or another device that may benefit from knowing its relative location to objects included in the environment in which the device is navigating. According to this embodiment, real-time navigation component 3206 may capture image data corresponding to a current perspective of the computing device relative to the environment continuously, regularly (e.g., at a defined point in time), or in response to a trigger (e.g., a sensed signal indicating that one or more objects are within a defined distance from the computing device). The 3D self 2D processing module 1304 may further determine the derived 3D data 116 and/or the optimized 3D data 1306 of the corresponding objects, things, people included in the direct environment of the computing device 3202. Based on the derived 3D data 116 and/or the optimized 3D data 1306 indicating a relative position of the computing device 3202 with respect to one or more objects in the environment being navigated, the real-time navigation component 3206 may determine navigation information for an entity employing the computing device 3202, including a navigation path that avoids collisions with the objects, a navigation path that facilitates bringing the entity to a desired position with respect to the objects in the environment, and so forth. In some implementations, real-time navigation component 3206 may also use information that semantically identifies objects included in the environment to facilitate navigation (e.g., where the vehicle should go, what the vehicle should avoid, etc.).

The 3D feature authentication component 3208 may employ the derived 3D data 116 and/or optimized 3D data determined for the object to facilitate the authentication process. For example, in some embodiments, the object may include a face, and the derived 3D data 116 and/or the optimized 3D data may include a depth map that provides a face surface. The depth map may be used to facilitate facial-based biometric authentication of the user's identity.

FIG. 33 presents an exemplary system 3300 for developing and training 2D self 3D models in accordance with various aspects and embodiments described herein. The system 3300 includes at least some features (e.g., 3D self 2D processing module 1406, 2D image data 102, panoramic image data 502, native auxiliary data 802, exported 3D data 116, and optimized 3D data 1306) that are the same or similar to previous systems disclosed herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

In the illustrated embodiment, the system 3300 includes a computing device 3312 that includes computer-executable components including a 3D self 2D development module 3314 and a 3D self 2D processing module 1406. The computing device 3312 may include or be operatively coupled to at least one memory 3322 and at least one processor 3320. In one or more embodiments, the at least one memory 3322 may further store computer-executable instructions (e.g., the 3D self 2D development module 3314 and the 3D self 2D processing module 1406) that, when executed by the at least one processor 3320, cause performance of operations defined by the computer-executable instructions. In some embodiments, the memory 3322 may also store information received, generated, and/or employed by the computing device 3312 (e.g., the 3D spatial model database 3302, the 3D self 2D model database 3326, the received 2D image data 102, the received native assistance data 802, the derived 3D data 116, the optimized 3D data 1306, and/or additional training data generated by the 3D self 2D model development module 3314 discussed below). The computing device 3312 may also include a device bus 3324 that communicatively couples the various components of the computing device 3312. Examples of the processor 3320 and memory 3322, as well as other suitable computers or computing-based elements that may be used in connection with implementing one or more systems or components shown and described in connection with fig. 33 or other figures disclosed herein, may be found with reference to fig. 35.

The system 3300 also includes a 3D spatial model database 3302 and a 3D self 2D model database 3326. In one or more embodiments, the 3D self 2D model development module 3314 may be configured to facilitate generating and/or training one or more 3D self 2D models included in the 3D self 2D model database 3326 based at least in part on data provided by the 3D spatial model database 3302. For example, in the illustrated embodiment, the 3D self 2D model development module 3314 may include a training data development component 3316 to facilitate collection and/or generation of training data based on various types of 3D-rich model information (described below) provided by the 3D spatial model database 3302. The 3D self 2D model development module 3314 may also include a model training component 3318 that may be configured to employ the training data to train and/or develop one or more 3D self 2D neural network models included in the 3D self 2D model database 3326. The 3D self 2D processing module 1406 may further employ the 3D self 2D models included in the 3D self 2D model database 3326 to generate the exported 3D data 116 and/or optimized 3D data 1306 based on the received input data (including the 2D image data 102 and/or the native auxiliary data 802) according to the various techniques described above.

In one or more embodiments, the 3D spatial model database 3302 may include a large amount of proprietary data associated with previously generated 3D spatial models that were generated using proprietary alignment techniques (e.g., those described herein), captured 2D image data, and associated captured depth data captured by various 3D sensors. In this regard, data for generating a 3D space model may be collected by scanning (e.g., with one or more types of 3D sensors) real world scenes, spaces (e.g., houses, office spaces, outdoor spaces, etc.), objects (e.g., furniture, decorations, merchandise, etc.), and the like. The data may also be generated based on a computer-implemented 3D modeling system. For example, in some embodiments, the 3D space model is generated using one or more 2D/3D capture devices and/or systems described in U.S. patent application No. 15/417,162 filed on 1 month 26 of reference 2017 and entitled "CAPTURING AND ALIGNING PANORAMIC IMAGE AND DEPTH DATA" and U.S. patent application No. 14/070,426 filed on 11 month 1 of 2013 and entitled "CAPTURING AND ALIGNING THREE-DIMENSIONAL SCENES", the entire contents of which are incorporated herein by reference. In some embodiments, the data provided by 3D spatial model database 3302 may also include information of the 3D spatial model generated by 3D model generation component 118 according to the techniques described herein. The 3D spatial model database 3302 may also include information of the 3D spatial model 3028 discussed with reference to fig. 30.

In this regard, the 3D spatial model database 3302 may include 3D model and alignment data 3304, indexed 2D image data 3306, indexed 3D sensor data 3308, and indexed semantic tag data 3310. The 3D model and alignment data 3304 may include previously generated 3D spatial models of various objects and environments and associated alignment information related to the relative positions of geometric points, shapes, etc. forming the 3D model. For example, the 3D spatial model may include data representing locations, geometries, curved surfaces, and the like. The 3D spatial model may also include data comprising a set of points represented by 3D coordinates, such as points in 3D euclidean space. The sets of points may be associated (e.g., connected) with each other by geometric entities. For example, a set of grid connectable points comprising a series of triangles, lines, curved surfaces (e.g., non-uniform rational basis splines (NURBS)), quadrilaterals, n-grams, or other geometric shapes. For example, a 3D model of a building interior environment may include mesh data (e.g., triangle mesh, quadrilateral mesh, parameterized mesh, etc.), one or more texture-mapped meshes (e.g., one or more texture-mapped polygon meshes, etc.), a point cloud, a set of point clouds, a bin, and/or other data constructed with one or more 3D sensors. In some implementations, portions of the 3D model geometry data (e.g., mesh) may include image data describing texture, color, intensity, etc. For example, the geometric data may include geometric data points in addition to texture coordinates associated with the geometric data points (e.g., texture coordinates indicating how the texture data is to be applied to the geometric data).

The indexed 2D image data 3306 may include 2D image data for generating a 3D spatial model represented by the 3D model and alignment data 3304. For example, the indexed 2D image data 3306 may include a set of images used to generate the 3D spatial model, and also include information that associates the respective images with portions of the 3D spatial model. For example, 2D image data may be associated with portions of a 3D model mesh to associate visual data (e.g., texture data, color data, etc.) from the 2D image data 102 with the mesh. The indexed 2D image data 3306 may also include information that associates the 2D image with a particular location of the 3D model and/or a particular perspective for viewing the 3D spatial model. The indexed 3D sensor data 3308 may include 3D/depth measurements associated with the respective 2D images used to generate the 3D spatial model. In this regard, the indexed 3D sensor data 3308 may include captured 3D sensor readings captured by one or more 3D sensors and associated with respective pixels, superpixels, objects, etc. of respective 2D images, which are used to align the 2D images to generate a 3D spatial model. The indexed semantic tag data 3310 may include previously determined semantic tags associated with corresponding objects or features of the 3D spatial model. For example, the indexed semantic tag data 3310 may identify walls, ceilings, fixtures, appliances, etc. included in the 3D model and also include information identifying spatial boundaries of corresponding objects within the 3D spatial model.

Conventional training data for generating 3D self 2D neural network models includes 2D images having known depth data for respective pixels, super pixels, objects, etc. included in the respective 2D images, such as indexed 3D sensor data 3308 associated with the respective 2D images included in the indexed 2D image data 3306, which is used to generate the 3D spatial model and alignment data 3304 included in the 3D model. In one or more embodiments, the training data development module 3314 may extract the training data (e.g., indexed 2D images and associated 3D sensor data) from the 3D spatial model database 3302 for provision to the model training component 3318 for use in association with generating and/or training one or more 3D self 2D neural network models included in the 3D self 2D model database 3326. In various additional embodiments, the training data development component 3316 may further use the reconstructed 3D spatial model to create training examples for the respective 2D images that were not directly captured by the 3D sensor. For example, in some implementations, the training data development component 3316 may employ the textured 3D mesh of the 3D spatial model included in the 3D model and the alignment data 3304 in order to generate a 2D image from camera positions from which the actual camera was not placed. For example, the training data development component 3316 may use the capture location/orientation information of the corresponding images included in the indexed 2D image data 3306 to determine various virtual capture location/orientation combinations that are not represented by the captured 2D images. The training data development component 3316 may further generate a composite image of the 3D model from these virtual capture locations/orientations. In some implementations, the training data development component 3316 may generate composite 2D images from various perspectives of the 3D model that correspond to a series of images captured by the virtual camera in association with navigating the 3D spatial model, wherein the navigation aids in capturing the scene as if the user were actually walking in the environment represented by the 3D model while holding the camera and capturing the images along the way.

The training data development component 3316 may also generate other forms of training data associated with the composite 2D image and the actual 2D image in a similar manner. For example, the training data development component 3316 may generate IMU measurements, magnetometer or depth sensor data, etc., as if such sensors were being placed in 3D space or moved therein. Based on the known locations of points included in the composite image and the virtual camera capture locations and orientations relative to the 3D spatial model (from which the virtual camera captures and synthesizes the image), the training data development component 3316 may generate depth data for the corresponding pixels, super-pixels, objects, etc. included in the composite image. In another example, the training data development component 3316 may determine depth data for a captured 2D image based on aligning visual features of the 2D image with known features of a 2D model for which depth information may be obtained. Other inputs are generated as if a specific sensor was used in 3D space.

In some embodiments, the training data development component 3316 may further employ the 3D spatial model and alignment data 3304 included in the 3D model to create synthetic "ground truth" 3D data from those reconstructed environments in order to match each 2D used to create the 3D spatial model (e.g., included in the indexed 2D image data 3306) and synthetic 2D images generated from the perspective of the 3D spatial model that were never actually captured from the actual environment by the actual camera. Thus, the resultant 3D "ground truth" data of the respective image may exceed the quality of the actual 3D sensor data for the respective image capture (e.g., the actual 3D sensor data for the respective image capture included in the indexed 3D sensor data 3308), thereby improving the training effect. In this regard, because the synthetic 3D data is derived from a 3D model generated based on aligning several images with overlapping or partially overlapping image data to each other using various alignment optimization techniques, the aligned position 3D position of the corresponding point in the image may become more accurate than the 3D sensor data associated with the individual images captured by the 3D sensor. In this regard, the alignment pixels of a single 2D image included in the 3D model will have a 3D position relative to the 3D model that is determined not only based on the captured 3D sensor data associated with the 2D image, but also based on the alignment process used to create the 3D model, where the relative positions of the other images to the 2D image and the 3D coordinate space are used to determine the final 3D position of the alignment pixels. Thus, the aligned 3D pixel locations associated with the 3D model may be considered more accurate than the 3D measurements for the pixels captured by the depth sensor.

In one or more additional embodiments, the training data development component 3316 may also extract additional scene information associated with the 3D spatial model, such as semantic tags included in the indexed semantic tag data 3310, and include it with the corresponding 2D image used as training data. In this regard, the training data development component 3316 may use the indexed semantic tag data 3310 to determine semantic tags and associate the semantic tags with 2D images (e.g., indexed 2D images and/or composite 2D images) that the model training component 3318 uses to develop and/or train the 3D self 2D neural network model. This allows the model training component 3318 to train the 3D self 2D neural network model to predict semantic tags (e.g., walls, ceilings, gates, etc.) without manual annotation of the data set.

In various embodiments, the model training component 3318 may employ training data collected and/or generated by the training data development component 3316 to train and/or develop one or more 3D self 2D neural network models included in the 3D self 2D model database 3326. In some implementations, the 3D self 2D model database 3326 may be, include, or correspond to the 3D self 2D model database 112. For example, in the illustrated embodiment, the 3D self 2D model database may include one or more panoramic models 514 and one or more enhancement models 810. In some implementations, the model training component 3318 may generate and/or train one or more panoramic models 514 and/or one or more enhancement models 810 (discussed above) based at least in part on training data provided by the training data development component 3316. The 3D self 2D model database 3326 may also include one or more optimized models 3328. The one or more optimized models 3328 may include one or more 3D self 2D neural network models that have been specially trained using training data provided by the training data development component 3316. In this regard, the one or more optimized models 3328 may employ various 3D-from-2D derivation techniques to derive 3D data from the 2D images discussed herein, including the 3D-from-2D derivation techniques discussed with reference to the one or more standard models 114. However, the one or more optimized models 3328 may be configured to generate more accurate and precise depth derivatives based on training using training data provided by the training data development component 3316 relative to other 3D self 2D models trained on conventional input data. For example, in some embodiments, the optimized model 3328 may include a standard 3D self 2D model that has been specially trained using training data provided by the training data development component 3316. Thus, the standard model 3D-from-2D model may be converted to an optimized 3D-from-2D model configured to provide more accurate results relative to a standard 3D-from-2D model trained based on alternative training data (e.g., training data not provided by the training data development component 3316).

FIG. 34 presents an exemplary computer-implemented method 3400 for developing and training a 2D self 3D model in accordance with various aspects and embodiments described herein. For brevity, repeated descriptions of similar elements employed in the corresponding embodiments are omitted.

At 3402, a system (e.g., system 3300) operably coupled to the processor accesses (e.g., from the 3D spatial model database 3302) a 3D model of the object or environment, the model being generated based on 2D images of the object or environment captured at different capture locations relative to the object or environment, and depth data captured for the 2D images via one or more depth sensor devices (e.g., using training data development component 3316). At 3318, the system determines auxiliary training data for the 2D image based on the 3D model. For example, the training data development component 3316 may determine semantic tags for the image and/or synthetic 3D data for the 2D image. Then at 3406, the system may train one or more 3D self-2D neural networks with the 2D image and the auxiliary training data to derive 3D information from the new 2D image, the auxiliary data being considered ground truth data in association with training the one or more neural networks with the auxiliary data.

Exemplary operating Environment

In order to provide a context for various aspects of the disclosed subject matter, fig. 35 and 36 and the following discussion are intended to provide a brief, general description of a suitable environment in which the various aspects of the disclosed subject matter may be implemented.

With reference to fig. 35, a suitable environment 3500 for implementing various aspects of the disclosure includes a computer 3512. The computer 3512 includes a processing unit 3514, a system memory 3516, and a system bus 3518. The system bus 3518 couples system components including, but not limited to, the system memory 3516 to the processing unit 3514. The processing unit 3514 can be any of various available processors. Dual microprocessors and other multiprocessor architectures also can be employed as the processing unit 3514.

The system bus 3518 can be any of several types of bus structure including a memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any variety of available bus architectures including, but not limited to, industry Standard Architecture (ISA), micro-channel architecture (MSA), extended ISA (EISA), intelligent Drive Electronics (IDE), VESA Local Bus (VLB), peripheral Component Interconnect (PCI), card bus, universal Serial Bus (USB), advanced Graphics Port (AGP), personal computer memory card international association bus (PCMCIA), firewire (IEEE 1394), and Small Computer System Interface (SCSI).

The system memory 3516 includes volatile memory 3520 and nonvolatile memory 3522. A basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 3512, such as during start-up, is stored in nonvolatile memory 3522. By way of illustration, and not limitation, nonvolatile memory 3522 can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, or nonvolatile Random Access Memory (RAM) (e.g., ferroelectric RAM (FeRAM)). Volatile memory 3520 includes Random Access Memory (RAM) which acts as external cache memory. By way of illustration and not limitation, RAM can be available in a variety of forms, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), direct Rambus RAM (DRRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus dynamic RAM.

The computer 3512 also includes volatile/nonvolatile computer storage media that are removable/non-removable. For example, fig. 35 illustrates a disk storage device 3524. Disk storage 3524 includes, but is not limited to, devices like a magnetic disk drive, floppy disk drive, tape drive, jaz drive, zip drive, LS-100 drive, flash memory card, or memory stick. The magnetic disk storage 3524 can also include storage media separately or in combination with other storage media including, but not limited to, an optical disk Drive such as a compact disk ROM device (CD-ROM), CD recordable Drive (CD-R Drive), CD rewritable Drive (CD-RW Drive) or a digital versatile disk ROM Drive (DVD-ROM). To facilitate connection of the disk storage devices 3524 to the system bus 3518, a removable or non-removable interface is typically used such as interface 3526.

Fig. 35 also depicts software that acts as an intermediary between users and the basic computer resources described in suitable operating environment 3500. Such software includes, for example, an operating system 3528. The operating system 3528, which can be stored on disk storage 3524, acts to control and allocate resources of the computer system 3512. System applications 3530 take advantage of the management of resources by operating system 3528 through program modules 3532 and program data 3534 (e.g., stored either in system memory 3516 or on disk storage 3524). It is to be appreciated that the present disclosure can be implemented with various operating systems or combinations of operating systems.

A user enters commands or information into the computer 3512 through input device 3536. Input devices 3536 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, TV tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices are connected to the processing unit 3514 through the system bus 3518 via interface ports 3538. The interface port(s) 3538 include, for example, a serial port, a parallel port, a game port, and a Universal Serial Bus (USB). The output device(s) 3540 use some of the same type of ports as the input device 3536. Thus, for example, a USB port may be used to provide input to computer 3512 and to output information from computer 3512 to an output device 3540. Output adapter 3542 is provided to illustrate that there are some output devices 3540 like monitors, speakers, and printers, among other output devices 3540, which require special adapters. By way of illustration, and not limitation, output adapters 3542 include video and sound cards that provide a means of connection between the output device 3540 and the system bus 3518. It should be noted that other devices and/or systems of devices provide both input and output capabilities, such as remote computer 3544.

The computer 3512 can operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 3544. The remote computer 3544 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to computer 3512. For simplicity, memory storage 3546 is shown only with respect to remote computer 3544. The remote computer(s) 3544 are logically connected to computer 3512 through a network interface 3548 and then physically connected via communication connection 3550. The network interface 3548 includes wired and/or wireless communication networks such as a Local Area Network (LAN), wide Area Network (WAN), cellular network, and the like. LAN technologies include Fiber Distributed Data Interface (FDDI), copper Distributed Data Interface (CDDI), ethernet, token ring, and the like. WAN technologies include, but are not limited to, point-to-point links, circuit switched networks such as Integrated Services Digital Networks (ISDN) and variants thereof, packet switched networks, and Digital Subscriber Lines (DSL).

The communication connection(s) 3550 refers to hardware/software for connecting the network interface 3548 to the bus 3518. While communication connection 3550 is shown for illustrative clarity inside computer 3512, it can also be external to computer 3512. The hardware/software necessary for connection to the network interface 3548 includes, for exemplary purposes only, internal and external technologies such as, modems including regular telephone grade modems, cable modems and DSL modems, ISDN adapters, and Ethernet cards.

It should be appreciated that the computer 3512 may be utilized in conjunction with one or more of the systems, components, and/or methods illustrated and described in fig. 1-34. According to various aspects and implementations, the computer 3512 can be used to facilitate determining and/or executing commands associated with deriving depth data from 2D images, using the derived depth data for various applications including AR and object tracking, generating training data, and the like (e.g., by the systems 100, 500, 800, 1300, 3000, 3200, and 3300). The computer 3512 may further provide various processes of the 2D image data and the 3D depth data described in association with the primary processing component 104, the secondary processing component 110, the third processing component 114, the processing component 420, the processing component 1222, and the processing component 1908. The computer 3512 may further provide for rendering and/or displaying 2D/3D image data and video data generated by the various 2D/3D panorama capturing devices, apparatuses, and systems described herein. The computer 3512 includes components 3506 that can embody one or more of the various components described in association with the various systems, devices, components, and computer-readable media described herein.

FIG. 36 is a schematic block diagram of a sample-computing environment 3600 with which the subject matter of the present disclosure may interact. The system 3600 includes one or more clients 3610. The client(s) 3610 can be hardware and/or software (e.g., threads, processes, computing devices). The system 3600 also includes one or more servers 3630. Thus, the system 3600 may correspond to, among other models, a two-tier client server model or a multi-tier model (e.g., client, middle tier server, data server). The server(s) 3630 can also be hardware and/or software (e.g., threads, processes, computing devices). The servers 3630 can house threads to perform transformations by employing the present disclosure, for example. One possible communication between a client 3610 and a server 3630 may be in the form of a data packet transmitted between two or more computer processes.

The system 3600 includes a communication framework 3650 that can be employed to facilitate communications between the client(s) 3610 and the server(s) 3630. The client(s) 3610 are operably connected to one or more client data store(s) 3620 that can be employed to store information local to the client(s) 3610. Similarly, the server(s) 3630 are operatively connected to one or more server data store(s) 3640 that can be employed to store information local to the servers 3630.

It is noted that aspects or features of the present disclosure may be utilized in substantially any wireless telecommunications or radio technology, such as Wi-Fi; bluetooth; worldwide Interoperability for Microwave Access (WiMAX); enhanced general packet radio service (enhanced GPRS); third generation partnership project (3 GPP) Long Term Evolution (LTE); third generation partnership project 2 (3 GPP 2) Ultra Mobile Broadband (UMB); 3GPP Universal Mobile Telecommunications System (UMTS); high Speed Packet Access (HSPA); high Speed Downlink Packet Access (HSDPA); high Speed Uplink Packet Access (HSUPA); GSM (global system for mobile communications) EDGE (enhanced data rates for GSM Evolution) Radio Access Network (GERAN); UMTS Terrestrial Radio Access Network (UTRAN); LTE-advanced (LTE-a), and the like. Additionally, some or all aspects described herein may be utilized in conventional telecommunications technology (e.g., GSM). In addition, mobile as well as non-mobile networks (e.g., the internet, data services networks such as Internet Protocol Television (IPTV), etc.) may utilize aspects or features described herein.

While the subject matter has been described above in the general context of computer-executable instructions of a computer program that runs on a computer and/or computers, those skilled in the art will recognize that the disclosure also may or can be implemented in combination with other program modules. Generally, program modules include routines, programs, components, data structures, etc. that perform particular tasks and/or implement particular abstract data types. Furthermore, those skilled in the art will appreciate that the inventive methods can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, minicomputers, mainframe computers, as well as personal computers, hand-held computing devices (e.g., PDAs, telephones), microprocessor-based or programmable consumer electronics, or industrial electronics, and the like. The illustrated aspects may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. However, some, if not all, aspects of the disclosure may be practiced on stand-alone computers. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

As used in this disclosure, the terms "component," "system," "platform," "interface," and the like can refer to and/or can include a computer-related entity or an entity associated with an operating machine having one or more particular functions. The entities disclosed herein may be hardware, a combination of hardware and software, or executing software. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.

In another example, the respective components may execute from various computer readable media having various data structures stored thereon. The components may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems via the signal). As another example, a component may be an apparatus having particular functions provided by mechanical parts operated by electrical or electronic circuitry operated by software or firmware applications executed by a processor. In this case, the processor may be internal or external to the device and may execute at least a portion of the software or firmware application. As yet another example, a component may be a device that provides a particular function through an electronic component without a mechanical portion, where the electronic component may include a processor or other device to execute software or firmware that at least partially imparts functionality to the electronic component. In one aspect, the component may emulate the electronic component via a virtual machine (e.g., in a cloud computing system).

In addition, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to mean any of the natural inclusive permutations. That is, if X employs A; x is B; or X employs both A and B, then "X employs A or B" is satisfied in any of the above cases. Furthermore, the articles "a" and "an" as used in the subject specification and drawings should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.

As used herein, the terms "example" and/or "exemplary" are used to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. Moreover, any aspect or design described herein as "exemplary" and/or "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to preclude equivalent exemplary structures and techniques known to those of ordinary skill in the art.

Various aspects or features described herein may be implemented as a method, apparatus, system, or article of manufacture using standard programming or engineering techniques. Additionally, various aspects or features disclosed in this disclosure can be implemented by program modules that implement at least one or more of the methods disclosed herein, the program modules being stored in memory and executed by at least a processor. The hardware and software, or other combinations of hardware and firmware, may enable or implement aspects described herein, including the disclosed methods. The term "article of manufacture" as used herein may encompass a computer program accessible from any computer-readable device, carrier, or storage media. For example, computer-readable storage media can include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips …), optical disks (e.g., compact Disk (CD), digital Versatile Disk (DVD), blu-ray disk (BD) …), smart cards, and flash memory devices (e.g., card, stick, key drive …), etc.

As used in this specification, the term "processor" may refer to essentially any computing processing unit or device, including, but not limited to: a single core processor; a single processor having software multithreading capability; a multi-core processor; a multi-core processor having software multithreading capability; a multi-core processor having hardware multithreading; a parallel platform; and a parallel platform with distributed shared memory. Additionally, a processor may refer to an integrated circuit, an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Controller (PLC), a Complex Programmable Logic Device (CPLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. In addition, processors may utilize nanoscale architectures such as, but not limited to, molecular and quantum dot based transistors, switches, and gates in order to optimize space usage or enhance performance of user equipment. A processor may also be implemented as a combination of computing processing units.

In this disclosure, terms such as "store," "storage," "data store," "database," and essentially any other information storage component related to the operation and function of the component are used to refer to "memory components," entities embodied in "memory," or components comprising memory. It should be appreciated that the memory and/or memory components described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

By way of illustration, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), flash memory, or nonvolatile Random Access Memory (RAM) (e.g., ferroelectric RAM (FeRAM)). For example, volatile memory can include RAM, which can act as external cache memory. By way of illustration and not limitation, RAM can take many forms, such as Synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), direct Rambus RAM (DRRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM). Additionally, the memory components of the systems or methods disclosed herein are intended to comprise, without being limited to comprising, these and any other suitable types of memory.

It is to be appreciated and understood that components described with respect to a particular system or method may include the same or similar functionality as corresponding components (e.g., correspondingly named components or similarly named components) described with respect to other systems or methods disclosed herein.

What has been described above includes examples of systems and methods that provide the advantages of the present disclosure. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the present disclosure, but one of ordinary skill in the art may recognize that many further combinations and permutations of the present disclosure are possible. Furthermore, with respect to the use of the terms "comprising," "having," "owning," and the like in the description, claims, appendix and drawings, such terms are intended to be inclusive in a manner similar to the term "comprising," as "comprising" is interpreted when employed as a transitional word in a claim.

Claims

1. A system, comprising:

a processor, and

a memory comprising instructions executable by the processor for performing a method comprising:

receiving a plurality of two-dimensional images of an environment, the plurality of two-dimensional images including a panoramic image of the environment; and

generating a three-dimensional image of the environment from two-dimensional data, i.e. a 3D from a 2D neural network model, using one or more three-dimensional data, the one or more 3D from the 2D neural network models being trained based on weighted values applied to respective pixels of a projected panoramic image, the projected panoramic image being associated with deriving depth data from the respective pixels, wherein the weighted values are based on angular areas of the respective pixels, wherein the weight of a first pixel decreases with decreasing angular area of the respective pixel, the one or more 3D from the 2D neural network models receiving the plurality of two-dimensional images as input, wherein using the one or more 3D from the 2D neural network models, a neural network obtains sampled pixel values from the panoramic image between convolutional layers for feature extraction and conversion, and using cascaded layers of a non-linear processing unit, the sampled pixel values being sampled from positions in a preceding layer corresponding to a defined angle reception field based on projection of a current layer.

2. The system of claim 1, the method further comprising:

an alignment between the two-dimensional image and a common three-dimensional coordinate space is determined based on the three-dimensional data respectively associated with the two-dimensional image.

3. The system of claim 2, the method further comprising generating a three-dimensional model of an object or environment included in the two-dimensional image based on the alignment.

4. The system of claim 3, the method further comprising:

the three-dimensional model is rendered via a display of the device.

5. The system of claim 3, the method further comprising:

navigation of the three-dimensional model rendered via a display of the device.

6. The system of claim 1, the method further comprising:

rendering the three-dimensional data of a respective image of the two-dimensional image via a display of a device.

7. The system of claim 1, the method further comprising:

the two-dimensional image and the three-dimensional data are transmitted to an external device via a network, wherein based on receiving the two-dimensional image and the three-dimensional data, the external device generates a three-dimensional model of an object or environment included in the two-dimensional image by aligning the two-dimensional images with each other based on the three-dimensional data.

8. The system of claim 1, wherein the two-dimensional image comprises a wide field-of-view image having a field of view exceeding a minimum threshold and spanning up to 360 degrees.

9. The system of claim 1, the method further comprising:

combining two or more first images of the two-dimensional images to generate a second image having a field of view larger than respective fields of view of the two or more first images, and

the one or more 3D self 2D neural network models are employed to derive at least some of the three-dimensional data from the second image.

10. The system of claim 1, the method further comprising:

receiving depth data of a portion of the plurality of two-dimensional images captured by one or more three-dimensional sensors, and

the depth data is used as input to the one or more 3D self 2D neural network models to derive the three-dimensional data of the two-dimensional image.

11. The system of claim 10, wherein the one or more three-dimensional sensors are selected from the group consisting of: structured light sensors, light detection and ranging sensors, i.e., liDAR sensors, laser rangefinder sensors, time-of-flight sensors, light field camera sensors, and active stereo sensors.

12. The system of claim 10, wherein the two-dimensional image comprises a panoramic color image having a first vertical field of view, wherein the depth data corresponds to a second vertical field of view within the first vertical field of view, and wherein the second vertical field of view comprises a narrower field of view than the first vertical field of view.

13. The system of claim 1, wherein the two-dimensional images comprise a panoramic image pair having a horizontal field of view spanning up to 360 degrees, and wherein respective images included in the panoramic image pair are captured from different vertical positions relative to a same vertical axis, wherein the different vertical positions are offset by a stereoscopic image pair distance, wherein the plurality of two-dimensional images are captured by one or more image capture devices attached to a rotation stage to rotate about an axis during a capture process.

14. The system of claim 1, wherein the system is located on a device selected from the group consisting of: mobile phones, tablet personal computers, notebook personal computers, stand-alone cameras, and wearable optical systems.

15. The system of claim 1, wherein sampling the sampled pixel values from locations in a front layer corresponding to a defined angle-received field comprises sampling near a pole of the panoramic image at a region of high width, the region corresponding to a horizontal stretch near the pole.

16. An apparatus, comprising:

a camera configured to capture a two-dimensional image:

a processor, and

generating a plurality of two-dimensional images of an environment, the plurality of two-dimensional images including a panoramic image of the environment; and

generating a three-dimensional image of the environment from two-dimensional data, i.e. a 3D from a 2D neural network model, using one or more three-dimensional data, the one or more 3D from the 2D neural network model being trained based on weighted values applied to respective pixels of a projected panoramic image, the projected panoramic image being associated with deriving depth data from the respective pixels, wherein the weighted values are based on angular areas of the respective pixels, wherein the weight of a first pixel decreases with decreasing angular area of the respective pixels, the one or more 3D from the 2D neural network model receiving the plurality of two-dimensional images as input to derive three-dimensional data of the plurality of two-dimensional images, wherein using the one or more 3D from the 2D neural network model, a neural network obtains sample pixel values from the panoramic image between convolutional layers, the sample pixel values being based on projections of a current layer from a position in a preceding layer corresponding to a defined angular reception field, and generating three-dimensional data for feature extraction and conversion using a cascade layer of a nonlinear processing unit.

17. The apparatus of claim 16, the method further comprising:

18. The apparatus of claim 16, the method further comprising:

the two-dimensional image and the three-dimensional data are transmitted to an external device, wherein based on receiving the two-dimensional image and the three-dimensional data, the external device generates a three-dimensional model of an object or environment included in the two-dimensional image by aligning the two-dimensional images with each other based on the three-dimensional data.

19. The apparatus of claim 16, wherein the two-dimensional image comprises a wide field-of-view image having a field of view exceeding a minimum threshold and spanning up to 360 degrees.

20. The apparatus of claim 16, the method further comprising:

capturing depth data of a portion of the plurality of two-dimensional images by one or more three-dimensional sensors, and

21. The apparatus of claim 20, wherein the one or more three-dimensional sensors are selected from the group consisting of: structured light sensors, light detection and ranging sensors, i.e., liDAR sensors, laser rangefinder sensors, time-of-flight sensors, light field camera sensors, and active stereo sensors.

22. The apparatus of claim 20, wherein the two-dimensional image comprises a panoramic color image having a first vertical field of view, wherein the one or more three-dimensional sensors are configured to capture the depth data for a second vertical field of view within the first vertical field of view, and wherein the second vertical field of view comprises a field of view narrower than the first vertical field of view.

23. The apparatus of claim 16, wherein the apparatus is selected from the group consisting of: mobile phones, tablet personal computers, notebook personal computers, stand-alone cameras, and wearable optical devices.

24. A method, comprising:

capturing, by a system comprising a processor, a plurality of two-dimensional images of an environment, the plurality of two-dimensional images comprising a panoramic image of the environment; and

transmitting, by the system, the plurality of two-dimensional images to a remote device, wherein based on receipt of the two-dimensional images, the remote device provides one or more three-dimensional data from two-dimensional data, i.e., a 3D from 2D neural network model, the one or more 3D from 2D neural network models trained based on weighted values applied to respective pixels of a projected panoramic image, the projected panoramic image being associated with deriving depth data from the respective pixels, wherein the weighted values are based on angular areas of the respective pixels, wherein weights of a first pixel decrease as angular areas of the respective pixels decrease, the one or more 3D from 2D neural network models receiving the plurality of two-dimensional images as input, the remote device further using a neural network that utilizes the one or more 3D from 2D neural network models, acquiring sampled pixel values from the panoramic image between convolutional layers and generating three-dimensional data for feature extraction and conversion using a cascade layer of non-linear processing units to generate the three-dimensional data for a feature extraction and conversion, the three-dimensional data being based on the sampled three-dimensional data from the respective pixel values, the three-dimensional data being based on the sampled three-dimensional data.

25. The method of claim 24, further comprising:

receiving, by the system, the three-dimensional reconstruction from the remote device; and

the three-dimensional reconstruction is rendered by the system via a display of the device.

26. The method of claim 24, wherein the capturing comprises capturing the two-dimensional image as a panoramic image having a horizontal field of view spanning up to 360 degrees, wherein the plurality of two-dimensional images are captured by one or more image capture devices attached to a rotation stage to rotate about an axis during a capture process.

27. The method of claim 26, wherein the capturing comprises capturing a plurality of pairs of the panoramic images, comprising capturing respective images of the plurality of pairs of panoramic images from different vertical positions relative to a same vertical axis, wherein the different vertical positions are offset by a stereoscopic image pair distance.

28. The method of claim 27, wherein the capturing comprises employing a camera configured to move to the different vertical position to capture the respective image.

29. The method of claim 27, wherein the capturing comprises employing two cameras located at the different vertical positions.