CN112771539A

CN112771539A - Using three-dimensional data predicted from two-dimensional images using neural networks for 3D modeling applications

Info

Publication number: CN112771539A
Application number: CN201980062890.XA
Authority: CN
Inventors: D·A·高斯贝克; M·T·贝尔; W·K·阿卜杜拉; P·K·哈恩
Original assignee: Matt Porter Co
Current assignee: Matt Porter Co
Priority date: 2018-09-25
Filing date: 2019-09-25
Publication date: 2021-05-07
Anticipated expiration: 2039-09-25
Also published as: CN112771539B; EP3857451A4; WO2020069049A1; EP3857451A1

Abstract

The disclosed subject matter relates to employing a machine learning model configured to predict 3D data from a 2D image using a deep learning technique to derive 3D data for the 2D image. In some embodiments, a system is described that includes a memory storing computer-executable components, and a processor executing the computer-executable components stored in the memory. The computer-executable components include: a receiving component configured to receive a two-dimensional image; and a three-dimensional data derivation component configured to employ one or more three-dimensional data from a two-dimensional data (3D from 2D) neural network model to derive three-dimensional data for the two-dimensional image.

Description

Using three-dimensional data predicted from two-dimensional images using neural networks for 3D modeling applications

Technical Field

The present application relates generally to techniques for employing three-dimensional (3D) data predicted from two-dimensional (2D) images using neural networks for 3D modeling applications and other applications.

Background

Interactive, first-person 3D immersive environments are becoming increasingly popular. In these environments, a user is able to navigate through a virtual space. Examples of such environments include first person videogames and tools for visualizing 3D models of terrain. The aerial navigation tool allows a user to virtually explore a three-dimensional urban area from an aerial viewpoint. A panoramic navigation tool (e.g., street view) allows a user to view multiple 360 degree (360 °) panoramas of an environment and navigate between these multiple panoramas by visual hybrid interpolation.

Such interactive 3D immersive environments may be generated from the real world environment based on photo-level 2D images captured from the real environment, with 3D depth information for the respective 2D images. While methods of capturing 3D depth for 2D images have existed for decades, such methods have traditionally been expensive and require complex 3D capture hardware, such as light detection and ranging (LiDAR) devices, laser rangefinder devices, time-of-flight sensor devices, structured light sensor devices, light field cameras, and the like. Furthermore, current alignment software is still limited in functionality and ease of use. For example, existing alignment methods such as the iterative closest point algorithm (ICP) require the user to manually enter an initial coarse alignment. Such manual input is typically beyond the capabilities of most non-technical users and inhibits real-time alignment of the captured images. Accordingly, there is a high need for techniques for generating 3D data for 2D images using affordable, user-friendly devices, and techniques for accurately and efficiently aligning 2D images using the 3D data to generate an immersive 3D environment.

Drawings

Fig. 1 presents an exemplary system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 2 presents an exemplary illustration of a reconstructed environment that may be generated based on 3D data derived from 2D image data according to various aspects and embodiments described herein.

Fig. 3 presents another exemplary reconstruction environment that may be generated based on 3D data derived from 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 4 presents another exemplary reconstruction environment that may be generated based on 3D data derived from 2D image data in accordance with various aspects and embodiments described herein.

Fig. 5 presents another exemplary system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 6 presents an exemplary computer-implemented method for deriving 3D data from panoramic 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 7 presents an exemplary computer-implemented method for deriving 3D data from panoramic 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 8 presents another exemplary system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 9 presents an exemplary assistance data component that facilitates employing assistance data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data and generating a reconstructed 3D model based on the 3D data and the captured 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 10 presents an exemplary computer-implemented method for employing assistance data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein.

Fig. 11 presents an exemplary computer-implemented method for employing assistance data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein.

Fig. 12 presents an exemplary computer-implemented method for employing assistance data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein.

Fig. 13 presents another exemplary system that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 14-25 present example devices and/or systems that facilitate capturing a 2D image of an object or environment and deriving 3D/depth data from the image using one or more 3D self-2D techniques, in accordance with various aspects and embodiments described herein.

Fig. 26 presents an exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 27 presents another exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from the 2D image data according to various aspects and embodiments described herein.

Fig. 28 presents another exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein.

Fig. 29 presents another exemplary computer-implemented method that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein.

Fig. 30 presents an exemplary system that facilitates associating with an Augmented Reality (AR) application using one or more 3D self-2D techniques in accordance with various aspects and embodiments described herein.

Fig. 31 presents an exemplary computer-implemented method for associating with an AR application using one or more 3D self-2D techniques in accordance with various aspects and embodiments described herein.

Fig. 32 presents an exemplary computing device employing one or more 3D self-2D techniques associated with object tracking, real-time navigation, and 3D feature-based security applications in accordance with various aspects and embodiments described herein.

Fig. 33 presents an exemplary system for developing and training a 2D self-3D model in accordance with various aspects and embodiments described herein.

Fig. 34 presents an exemplary computer-implemented method for developing and training a 2D self-3D model according to various aspects and embodiments described herein.

FIG. 35 is a schematic block diagram illustrating a suitable operating environment in accordance with various aspects and embodiments.

FIG. 36 is a schematic block diagram of a sample-computing environment in accordance with various aspects and embodiments.

Detailed Description

By way of introduction, the present disclosure relates to systems, methods, apparatuses, and computer-readable media that provide techniques for using one or more machine learning models to derive 3D data from 2D images and use the 3D data for 3D modeling applications and other applications. Various techniques for predicting 3D data (e.g., depth data or relative 3D position of image pixels) from a single 2D (color or grayscale) using machine learning (referred to herein as "self-2D predictive 3D" or simply "3D self-2D") have been developed and have recently received increasing attention. During the past decade, great efforts have been made by the research community to improve the performance of monocular deep learning, and significant accuracy has been achieved due to the rapid development and advancement of deep neural networks.

The disclosed subject matter relates to employing one or more machine learning models configured to predict 3D data from 2D images using deep learning techniques (including one or more neural network models) to derive 2D 3D data. In various embodiments, the predicted depth data may be used to generate a 3D model of the environment captured in the 2D image data. Other applications include employing predicted depth data to facilitate augmented reality applications, real-time object tracking, real-time navigation of an environment, biometric authentication applications based on a user's face, and the like. Various elements described in connection with the disclosed technology may be embodied in a computer-implemented system or apparatus and/or in a different form, such as a computer-implemented method, a computer program product, or another form (or vice versa).

In one embodiment, a method is provided for using panoramic image data to generate accurate depth predictions using 3D from 2D. The method may include receiving, by a system including a processor, a panoramic image, and employing, by the system, a 3D self-2D convolutional neural network model to derive 3D data from the panoramic image, wherein the 3D self-2D convolutional neural network model surrounds convolutional layers of the panoramic image when the panoramic image is projected on a 2D plane so as to facilitate deriving three-dimensional data. According to the method, the convolutional layer minimizes or eliminates edge effects associated with deriving the 3D data based on surrounding the panoramic image when the panoramic image is projected on the 2D plane. In some implementations, the panoramic image may be received while the panoramic image is projected on a two-dimensional plane. In other implementations, the panoramic image may be received as a spherical or cylindrical panoramic image, and wherein the method further comprises projecting, by the system, the spherical or cylindrical panoramic image on a 2D plane prior to employing the 3D self-2D convolutional neural network model to derive the 3D data.

In one or more implementations, the 3D self-2D neural network model may include a model trained based on weighting values applied to respective pixels of the projected panoramic image in relation to depth data from which the respective pixels are derived, wherein the weighting values vary based on an angular area of the respective pixels. For example, during training, the weighting values decrease as the angular area of the corresponding pixel decreases. Further, in some implementations, a downstream convolutional layer of the convolutional layers that follows the previous layer is configured to re-project a portion of the panoramic image processed by the previous layer in association with depth data from which the panoramic image was derived, thereby producing a re-projected version of the panoramic image for each downstream convolutional layer. In this regard, the downstream convolutional layer is further configured to employ input data from a previous layer by extracting the input data from a re-projected version of the panoramic image. For example, in one implementation, input data may be extracted from a re-projected version of a panoramic image based on a location in a portion of the panoramic image that corresponds to the re-projected version of the panoramic image based on a defined angle bin.

In another embodiment, a method for using panoramic image data to generate accurate depth predictions using 3D from 2D is provided and may include receiving, by a system operatively coupled to a processor, a request for depth data associated with a region of an environment depicted in a panoramic image. The method may also include deriving, by the system, depth data for the entire panoramic image based on the receiving using a neural network model configured to derive the depth data from the single 2D image. The method may also include extracting, by the system, a portion of the depth data corresponding to the environmental region, and providing, by the system, the portion of the depth data to an entity associated with the request.

Other embodiments of the disclosed subject matter provide techniques for optimizing 3D from 2D based depth prediction that use enhanced input data in addition to using a single 2D image as input to a 3D from 2D neural network model and/or using two or more images as input to a 3D from 2D neural network model. For example, in one embodiment, a method is provided that includes receiving, by a system operatively coupled to a processor, a 2D image, and determining, by the system, assistance data for the 2D image, wherein the assistance data includes orientation information about a capture orientation of the 2D image. The method may also include deriving, by the system, 3D information for the 2D image using one or more neural network models configured to infer the 3D information based on the 2D image and the assistance data. In some implementations, the orientation information may be determined based on internal measurement data associated with a 2D image generated by an IMD associated with the capture of the 2D image.

The assistance data may also include location information regarding a capture location of the 2D image, and wherein determining the assistance data includes identifying the location information in metadata associated with the 2D image. The assistance data may further include one or more image capture parameters associated with the 2D image capture, and wherein determining the assistance data includes extracting the one or more image capture parameters from metadata associated with the 2D image. For example, the one or more image capture parameters may include one or more camera settings of a camera used to capture the 2D image. In another example, the one or more image capture parameters are selected from the group consisting of: camera lens parameters, lighting parameters, and color parameters.

In some implementations, the 2D image includes a first 2D image, and wherein the method further includes receiving, by the system, one or more second 2D images related to the first 2D image, and determining, by the system, the assistance data based on the one or more second 2D images. For example, the assistance data may comprise a capture location of the 2D image, and wherein determining the assistance data comprises determining the capture location based on the one or more second 2D images. In another example, the first 2D image and the one or more second 2D images are captured in association with movement of the capture device to different positions relative to the environment, and wherein determining the assistance data comprises employing at least one of: a photogrammetric algorithm, a simultaneous localization and mapping (SLAM) algorithm, or a structure-by-motion algorithm. In another example, a first 2D image from the stereoscopic image pair and a second 2D image of the one or more second 2D images, wherein the assistance data comprises depth data of the first 2D image, and wherein determining the assistance data comprises determining the depth data based on the stereoscopic image pair using passive stereoscopic functionality.

The method may also include receiving, by the system, depth information for a 2D image captured by a 3D sensor associated with the capture of the 2D image, wherein deriving comprises deriving the 3D information using a neural network model of the one or more neural network models, the neural network model configured to infer the 3D information based on the 2D image and the depth information. Additionally, in some implementations, the assistance data includes one or more semantic tags for one or more objects depicted in the 2D image, and wherein determining the assistance data includes determining, by the system, the semantic tags using one or more machine learning algorithms.

In still other implementations, the 2D image comprises a first 2D image, and wherein the auxiliary data comprises one or more second 2D images related to the first 2D image based on the image data comprising different perspectives depicting the same object or environment as the first 2D image. For example, the first 2D image and the one or more second 2D images may comprise partially overlapping fields of view of the object or environment. According to these implementations, the assistance data may further include information regarding one or more relationships between the first 2D image, and wherein determining the assistance data includes determining the relationship information, which includes determining at least one of a relative capture position of the first 2D image and the one or more second 2D images, a relative capture orientation of the first 2D image, and a relative capture time of the first 2D image and the one or more second 2D images.

In another embodiment, a method is provided that includes receiving, by a system operatively coupled to a processor, a captured related 2D image of an object or environment, wherein the 2D image is associated based on providing different perspectives of the object or environment. The method may also include deriving, by the system, depth information for at least one of the related 2D images based on the related 2D images using the one or more neural network models and the related 2D images as inputs to the one or more neural network models. In some implementations, the method further includes determining, by the system, relationship information regarding one or more relationships between the related images, wherein deriving further includes deriving the depth information using the relationship information as an input to the one or more neural network models. For example, the relationship information may include relative capture positions of the related 2D images. In another example, the relationship information may include a relative capture orientation of the related 2D images. In another example, the relationship information includes relative capture times of the plurality of 2D images.

In other embodiments, a system includes a memory storing computer-executable components and a processor executing the computer-executable components stored in the memory. The computer-executable components may include: a receiving section that receives a 2D image; and a pre-processing component that changes one or more characteristics of the 2D image to convert the image into a pre-processed image according to a standard representation format. The computer-executable components may also include a depth derivation component that derives 3D information for the pre-processed 2D image using one or more neural network models configured to infer the 3D information based on the pre-processed 2D image.

In some implementations, the preprocessing component changes one or more characteristics based on one or more image capture parameters associated with the capture of the 2D image. The preprocessing component may also extract one or more image capture parameters from metadata associated with the 2D image. The one or more image capture parameters may include, for example, one or more camera settings of a camera used to capture the 2D image. For example, the one or more image capture parameters are selected from the group consisting of: camera lens parameters, lighting parameters, and color parameters. In some implementations, the one or more characteristics may include one or more visual characteristics of the 2D image, and the preprocessing component changes the one or more characteristics based on differences between the one or more characteristics and one or more defined image characteristics of the standard representation format.

Various additional embodiments are directed to example devices and/or systems that facilitate capturing a 2D image of an object or environment and deriving 3D/depth data from the image using one or more 3D self-2D techniques according to various aspects and embodiments described herein. Various arrangements of devices and/or systems are disclosed including one or more cameras configured to capture 2D images, a 3D data derivation component configured to derive 3D data for the images, and a 3D modeling component configured to generate a 3D model of an environment included in the images. These permutations may include: some embodiments in which all components are disposed on a single device, embodiments in which components are distributed between two devices, and embodiments in which components are distributed between three devices.

For example, in one embodiment, there is provided an apparatus comprising: a camera configured to capture a 2D image; a memory storing computer-executable components; and a processor that executes the computer-executable components stored in the memory. The computer-executable components may include a 3D data derivation component configured to employ one or more 3D self-2D neural network models to derive 3D data for the 2D image. In some implementations, the computer-executable components may also include a modeling component configured to align the 2D image based on the 3D data to generate a 3D model of an object or environment included in the 2D image. In other implementations, the computer-executable components may include a communication component configured to transmit the 2D image and the 3D data to an external device, wherein based on receiving the two-dimensional image and the three-dimensional data, the external device generates a 3D model of an object or environment included in the 2D image by aligning the 2D images with each other based on the 3D data. With these implementations, the communication component may also be configured to receive the 3D model from an external device, and the device may render the 3D model via a display of the device.

In some implementations of this embodiment, the 2D image may include one or more images characterized as a wide field of view image based on having a field of view that exceeds a minimum threshold. In another implementation, the computer-executable components may further include a stitching component configured to combine two or more of the two-dimensional images to generate a second image having a field of view greater than respective fields of view of the two or more first images, and wherein the three-dimensional data derivation component is configured to employ the one or more 3D self-2D neural network models to derive at least some of the three-dimensional data from the second image.

In some implementations of the embodiment, the device may further include, in addition to the camera, a 3D sensor configured to capture depth data of the partial 2D image, wherein the 3D data derivation component is further configured to use the depth data as input to one or more 3D self 2D neural network models to derive the 3D data of the 2D image. For example, the 2D image may include a panoramic color image having a first vertical field of view, wherein the 3D sensor includes a structured light sensor configured to capture depth data of a second vertical field of view within the first vertical field of view, and wherein the second vertical field of view includes a narrower field of view than the first vertical field of view.

In another embodiment, there is provided an apparatus comprising: a memory storing computer-executable components; and a processor that executes the computer-executable components stored in the memory. The computer-executable components include: a receiving section configured to receive a 2D image from a 2D image capturing apparatus; and a 3D data derivation component configured to employ one or more 3D self-2D neural network models to derive 3D data for the 2D image. In some implementations, the computer-executable components further include a modeling component configured to align the 2D image based on the 3D data to generate a 3D model of an object or environment included in the 2D image. The computer-executable components may also include a rendering component configured to facilitate rendering of the 3D model via a display of the device (e.g., directly, using a network (web) browser, using a web application, etc.). In some implementations, the computer-executable components may also include a navigation component configured to facilitate navigating the displayed 3D model. In one or more alternative implementations, the computer-executable components may include a communication component configured to transmit the 2D image and the 3D data to an external device, wherein based on receiving the two-dimensional image and the three-dimensional data, the external device generates a three-dimensional model of an object or environment included in the two-dimensional image by aligning the two-dimensional images with each other based on the three-dimensional data. Using these implementations, the communication component can receive the 3D model from the external device, and wherein the computer-executable components further include a rendering component configured to render the 3D model via a display of the device. The external device may further facilitate navigating the 3D model (e.g., using a web browser, etc.) in association with accessing and rendering the 3D model.

In yet another embodiment, an apparatus is provided that includes a memory storing computer-executable components; and a processor that executes the computer-executable components stored in the memory. The computer-executable components include a receiving component configured to receive 2D images of an object or environment captured from different perspectives of the object or environment, and to derive depth data for respective ones of the 2D images using one or more 3D self-2D neural network models. The computer-executable components also include a modeling component configured to align the 2D images with one another based on the depth data to generate a 3D model of the object or environment. In some implementations, the computer-executable components further include a communication component configured to send the 3D model to a rendering device via a network for display at a rendering display. With these implementations, the computer-executable components may also include a navigation component configured to facilitate navigation of the 3D model displayed at the rendering device. In one or more alternative implementations, the computer-executable components may include a rendering component configured to cause rendering of the 3D model via a display of the device. With this alternative embodiment, the computer-executable components may also include a navigation component configured to facilitate navigation of the 3D model displayed at the device.

In another embodiment, a method is provided, the method comprising: capturing, by a device comprising a processor, a 2D image of an object or environment, and transmitting, by the device, the 2D image to a server device, wherein upon receipt of the 2D image, the server device employs one or more 3D self-2D neural network models to derive 3D data for the 2D image, and generates a 3D reconstruction of the object or environment using the 2D image and the 3D data. The method further comprises the following steps: the method includes receiving, by the device, a 3D reconstruction from a server device, and rendering, by the device, the 3D reconstruction via a display of the device.

In some implementations, 2D images are captured from different perspectives of an object or environment in association with an image scan of the object or environment. With these implementations, the method may further include sending, by the device, a confirmation message confirming completion of the image scan. In addition, upon receipt of the confirmation message, the server device generates a final 3D reconstruction of the object or environment. For example, in some implementations, the final 3D reconstruction has a higher level of image quality relative to the initial 3D reconstruction. In another implementation, the final 3D reconstruction includes a navigable model of the environment, and wherein the initial 3D reconstruction is non-navigable. In another implementation, the final 3D reconstruction is generated using a more precise alignment process than the alignment process used to generate the initial 3D reconstruction.

In various additional embodiments, systems and devices are disclosed that facilitate improving AR applications using 3D self-2D processing techniques. For example, in one embodiment, a system is provided, the system comprising: a memory storing computer-executable components; and a processor that executes the computer-executable components stored in the memory. The computer-executable components may include a 3D data derivation component configured to employ one or more 3D self-2D neural network models to derive 3D data from one or more 2D images of the object or environment captured from a current perspective of the object or environment viewed on or through a display of the device. The one or more computer-executable components may also include a spatial alignment component configured to determine a location for integrating (virtual) graphical data objects on or within a representation of an object or environment on or viewed through the display based on the current perspective and the 3D data. For example, the representation of the object or environment may include a real-time view of the environment viewed through a transparent display of the device. In another implementation, the representation of the object or environment may include one or more 2D images of the captured object or environment and/or video frames of video. In various implementations, a device may include one or more cameras that capture one or more 2D images.

The computer-executable components may also include an integration component configured to integrate graphical data objects on or within the representation of the object or environment based on location. In some implementations, the computer-executable components may further include an occlusion mapping component configured to determine a relative position of the graphical data object with respect to another object included in the representation of the object or environment based on the current perspective and the 3D data. In this regard, based on determining that the relative position of the graphical data object is behind another object, the integration component may be configured to occlude at least a portion of the graphical data object located behind the other object in association with integrating the graphical data object on or within the representation of the object or environment. Also, based on determining that the relative position of the graphical data object is in front of another object, the integration component is configured to occlude, in association with integrating the graphical data object on or within the environmental representation, at least a portion of another object located behind the graphical data object.

In yet another embodiment, systems and devices are disclosed that facilitate real-time tracking of objects using 3D self-2D processing techniques. For example, there is provided an apparatus comprising: a memory storing computer-executable components; and a processor that executes the computer-executable components stored in the memory. The computer-executable components may include: a 3D data derivation component configured to employ one or more 3D self-2D neural network models to derive 3D data from 2D images of an object captured over a period of time; and an object tracking component configured to track a location of the object over a period of time based on the 3D data. For example, 2D image data includes successive frames of video data captured over a period of time. In some implementations, the object includes a moving object, and wherein the 2D image includes images captured from one or more stationary capture devices. In other implementations, the object comprises a fixed object, and wherein the 2D image data comprises an image of the object captured by the camera in association with movement of the camera over a period of time. For example, a camera may be attached to the vehicle, and wherein the object tracking component is configured to track a position of the object relative to the vehicle.

It should be noted that the terms "3D model", "3D object", "3D reconstruction", "3D image", "3D representation", "3D rendering", "3D construction", etc. are used interchangeably throughout, unless context ensures that there is a particular distinction between the terms. It should be understood that such terms may refer to data representing objects, spaces, scenes, etc., in three dimensions, which may or may not be displayed on an interface. In one aspect, a computing device, such as a Graphics Processing Unit (GPU), may generate executable/visual content in three dimensions based on the data. The term "3D data" refers to data used to generate a 3D model, data describing a perspective or viewpoint of a 3D model, captured data (e.g., sensory data, images, etc.), metadata associated with a 3D model, and the like. In various embodiments, the terms 3D data and depth data are used interchangeably throughout, unless context warrants a particular distinction between these terms.

The term image as used herein refers to a 2D image unless otherwise specified. In various embodiments, the term 2D image is used to clarify and/or merely emphasize the fact that: the image is 2D rather than 3D data derived therefrom and/or a 3D model generated based on the image and the derived 3D data. It should be noted that the terms "2D model", "2D image", and the like are used interchangeably throughout unless context warrants a particular distinction between the terms. It should be understood that such terms may refer to data representing objects, spaces, scenes, etc., in two dimensions, which may or may not be displayed on an interface. The terms "2D data," "2D image data," and the like, are used interchangeably throughout unless context ensures that there is a particular distinction between the terms, and may refer to data (e.g., metadata) describing a 2D image, captured data associated with a 2D image, a representation of a 2D image, and the like. In one aspect, a computing device, such as a Graphics Processing Unit (GPU), may generate executable/visual content in two dimensions based on the data. In another aspect, a 2D model may be generated based on captured image data, 3D image data, and the like. In embodiments, a 2D model may refer to a 3D model, a real world scene, a 3D object, or other 3D constructed 2D representation. As an example, the 2D model may include a 2D image, a set of 2D images, a panoramic 2D image, a set of panoramic 2D images, 2D data wrapped onto a geometric figure, or other various 2D representations of the 3D model. It should be noted that the 2D model may include a set of navigation controls.

Furthermore, terms such as "navigational position," "current position," "user position," and the like are used interchangeably throughout, unless context warrants a particular distinction between the terms. It should be understood that these terms may refer to data representing a position relative to a digital 3D model during user navigation, etc. For example, according to various embodiments, 3D models may be viewed and rendered in association with navigation of the 3D model, interaction with the 3D model, generation of the 3D model, and so forth, from various perspectives and/or fields of view of the virtual camera relative to the 3D model. In some embodiments, different views or perspectives of the model may be generated based on interactions with the 3D model in one or more modes (such as a walking mode, a playhouse/track mode, a floor plan mode, a feature mode, and the like). In one aspect, a user may provide input to the 3D modeling system, and the 3D modeling system may facilitate navigation of the 3D model. As used herein, navigation of a 3D model may include changing the perspective and/or field of view, as described in more detail below. For example, the perspective may rotate about a viewpoint (e.g., an axis or pivot point) or alternate between viewpoints, and the field of view may enhance an area of the model, change a size of an area of the model (e.g., "zoom in" or "zoom out," etc.), and so on.

Versions of the 3D model that are presented from different views or perspectives of the 3D model are referred to herein as representations or renderings of the 3D model. In various implementations, the representation of the 3D model may represent a volume of the 3D model, an area of the 3D model, or an object of the 3D model. The representation of the 3D model may include 2D image data, 3D image data, or a combination of 2D and 3D image data. For example, in some implementations, the representation or rendering of the 3D model may be a 2D image or panorama associated with the 3D model from a particular perspective of the virtual camera located at a particular navigation position and orientation relative to the 3D model. In other implementations, the representation or rendering of the 3D model may be the 3D model or a portion of the 3D model that is generated from the particular navigation position and orientation of the virtual camera relative to the 3D model and generated using the aligned set or subset of captured 3D data used to generate the 3D model. In still other implementations, the representation or rendering of the 3D model may include a combination of the 2D image and an aligned 3D data set associated with the 3D model.

Terms such as "user equipment," "user equipment device," "mobile device," "user device," "client device," "handset," or terms representing similar terms may refer to a device used by a subscriber or user to receive data, transmit data, control, voice, video, sound, 3D models, games, and so forth. The foregoing terms are used interchangeably herein and with reference to the associated drawings. In addition, the terms "user," "subscriber," "client," "consumer," "end user," and the like are used interchangeably throughout, unless context warrants a particular distinction between the terms. It should be understood that such terms may refer to a human entity, a human entity represented by a user account, a computing system, or an automated component supported by artificial intelligence (e.g., the ability to reason about complex mathematical formalisms), which may provide simulated vision, voice recognition, and so forth.

In various implementations, the components described herein may perform actions online or offline. Online/offline may refer to a state that identifies connectivity between one or more components. Typically, "online" indicates a connected state, while "offline" indicates a disconnected state. For example, in online mode, the model and markup may be streamed from a first device (e.g., a server device) to a second device (e.g., a client device), such as streaming raw model data or rendered models. In another example, in an offline mode, models and markup may be generated and rendered on one device (e.g., a client device) such that the device does not receive data or instructions from a second device (e.g., a server device). While the various components are shown as separate components, it should be noted that the various components may be comprised of one or more other components. Additionally, it should be noted that embodiments may include additional components that are not shown for the sake of brevity. In addition, various aspects described herein may be performed by one device or two or more devices in communication with each other.

The embodiments summarized above are now described in more detail with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It may be evident, however, that the embodiments may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the embodiments.

Referring now to the drawings, fig. 1 presents an exemplary system 100 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data, in accordance with various aspects and embodiments described herein. Various aspects of the systems, apparatus, or processes explained in this disclosure may constitute machine-executable components embodied within a machine, such as in one or more computer-readable medium (or media) associated with one or more machines. Such components, when executed by one or more computers (e.g., computers, computing devices, virtual machines, etc.), can cause the machine to perform the operations.

In the illustrated embodiment, the system 100 includes a computing device 104 configured to receive and process 2D image data 102 using one or more computer-executable components. These computer-executable components may include a 3D self-2D processing module 106 configured to perform various functions associated with processing 2D image data 102 to derive 3D data from 2D image data 102 (e.g., derived 3D data 116). The computer-executable components may also include a 3D model generation component 118 configured to generate a reconstructed 3D model of an object or environment included in the 2D image data 102 based at least in part on the derived 3D data 116. The computer-executable components may also include a navigation component 126 that facilitates navigation of the immersive 3D model generated by the 3D model generation component. For example, as described in more detail below, in various embodiments, the 2D image data 102 may include several 2D images of a captured object or environment, such as several 2D images of a captured house interior. The 3D model generation component 118 may be configured to use the derived 3D data 116 corresponding to the relative 3D positions of the 2D images and/or features (e.g., pixels, superpixels, objects, etc.) included in the 2D images to generate an alignment between the 2D images and/or features included in the respective 2D images relative to a common 3D coordinate space. The 3D model generation component 118 can further employ the alignment between the 2D image data and/or the associated 3D data to generate a reconstructed representation or 3D model of the object or environment represented in the 2D image data. In some embodiments, the 3D model may include an immersive virtual reality VR environment, which may be navigated with the assistance of navigation component 126. In the illustrated embodiment, the reconstructed representation/3D model and associated alignment data generated by the 3D model generation component 118 are identified as a 3D model and alignment data 128. The system 100 can also include a suitable user device 130 that can receive and render a display 132 of the reconstructed/3D model generated by the 3D model generation component 118. For example, user devices 130 may include, but are not limited to: a desktop computer, a laptop computer, a mobile phone, a smartphone, a tablet Personal Computer (PC), a Personal Digital Assistant (PDA), a Heads Up Display (HUD), a Virtual Reality (VR) headset, an Augmented Reality (AR) headset or device, a standalone digital camera, or other type of wearable computing device.

The computing device 104 may include or be operatively coupled to at least one memory 104 and at least one processor 124. The at least one memory 122 may further store computer-executable instructions (e.g., the 3D model generation component 118, the 2D from 3D processing module 106, one or more components of the 2D from 3D processing module 106, and the navigation component 126) that, when executed by the at least one processor 124, cause performance of operations defined by the computer-executable instructions. In some embodiments, the memory 122 may also store data received and/or generated by the computing device, such as (but not limited to) the received 2D image data 102, the derived 3D data 116, and the 3D model and alignment data 128. In other embodiments, various data sources and data structures of system 100 (and other systems described herein) may be stored in another memory (e.g., at a remote device or system) accessible to computing device 104 (e.g., via one or more networks). The computing device 104 may also include a device bus 120 that communicatively couples various components of the computing device 104 and data sources/data structures. Examples of such processors 124 and memory 122, as well as other suitable computer or computing-based elements, may be found with reference to fig. 35, which may be used in connection with implementing one or more of the systems or components illustrated in fig. 1 or other figures disclosed herein and described in connection with fig. 1 or other figures disclosed herein.

In the illustrated embodiment, the 3D self 2D processing module 106 may include a receiving component 108, a 3D data derivation component 110, and a 3D self 2D model database 112. The receiving component 108 may be configured to receive the 2D image data 102 for processing by the 3D self-2D processing module 106 (and/or the 3D model generating component 118). The source of the 2D image data 102 may vary. For example, in some implementations, the receiving component 108 can receive the 2D image data 102 from one or more image capture devices (e.g., one or more cameras), one or more network accessible data sources (e.g., an archive of network accessible 2D image data), a user device (e.g., an image uploaded by a user from a personal computing device), and so forth. In some implementations, the receiving component 108 can receive the 2D image data in real-time as the 2D image data is captured (or receive the 2D image data substantially in real-time as the 2D image data is captured such that the 2D image data is received within a few seconds of capture) to facilitate real-time processing applications associated with deriving 3D data from 2D image data in real-time, including real-time generation and rendering of 3D models based on 2D image data, real-time object tracking, real-time relative position estimation, real-time AR applications, and the like. In some embodiments, the 2D image data 102 may include images captured by various camera types with various settings and image processing capabilities (e.g., various resolutions, fields of view, color spaces, etc.). For example, the 2D image data may include standard red, green, blue (RGB) images, black and white images, high dynamic range images, and the like. In some implementations, the 2D image data 102 may include images captured using a camera included with another device (such as a mobile phone, a smartphone, a tablet PC, a standalone digital camera, etc.). In various embodiments, the 2D image data 102 may include multiple images that provide different perspectives of the same object or environment. For these embodiments, the image data from the respective images may be combined and aligned relative to each other and the 3D coordinate space by the model generation component 118 to generate a 3D model of the object or environment.

The 3D data derivation component 110 can be configured to process the received 2D image data 102 using one or more 3D self-2D machine learning models to determine (or derive, infer, predict, etc.) derived 3D data 116 for the received 2D image data 102. For example, the 3D data derivation component 110 can be configured to employ one or more 3D self-2D machine learning models configured to determine depth information for one or more visual features (e.g., pixels, superpixels, objects, planes, etc.) included in a single 2D image. In the illustrated embodiment, these one or more machine learning models may be provided in a 3D self-2D model database 112 accessible to the 3D data derivation component 110.

In various embodiments, the 3D data derivation component 110 can employ one or more existing, proprietary, and/or non-proprietary 3D self-2D machine learning models that have been developed in the art to generate the derived 3D data 116 for the received 2D image data 102. These existing 3D self-2D models are characterized in the system 100 and are referred to herein as "standard models". For example, in the illustrated embodiment, the 3D self 2D model database 112 may include one or more standard models 114 that may be selected and applied by the 3D data derivation component 110 to the received 2D image data 102 to generate derived 3D data 116 from the 2D image data 102. These standard models 114 may include various types of 3D self-2D predictive models configured to receive a single 2D image as input and process the 2D image using one or more machine learning techniques to infer or predict 3D/depth data for the 2D image. Machine learning techniques may include, for example, supervised learning techniques, unsupervised learning techniques, semi-supervised learning techniques, decision tree learning techniques, association rule learning techniques, artificial neural network techniques, inductive logic programming techniques, support vector machine techniques, clustering techniques, bayesian network techniques, reinforcement learning techniques, representation learning techniques, and the like.

For example, the standard models 114 may include one or more standard 3D self-2D models that perform depth estimation using Markov Random Field (MRF) techniques, conditional MRF techniques, and nonparametric methods. These standard 3D self-2D models make strong geometric assumptions that scene structures are composed of horizontal planes, vertical walls and superpixels, and MRF is employed to estimate depth by exploiting handmade features. The standard model 114 may also include one or more models that perform 3D self-2D depth estimation using non-parametric algorithms. Non-parametric algorithms rely on the assumption that similarities between regions in RGB images imply similar depth cues to learn depth from a single RGB image. After clustering the training data set based on global features, the models first search the feature space for candidate RGB-D of the input RGB image, and then deform and fuse the candidate pairs to obtain the final depth.

In various exemplary embodiments, the standard model 114 may employ one or more deep learning techniques, including deep learning techniques using one or more neural networks and/or deep convolutional neural networks, to derive 3D data from a single 2D image. During the past decade, great efforts have been made by the research community to improve the performance of monocular deep learning, and significant accuracy has been achieved due to the rapid development and advancement of deep neural networks. Deep learning is a type of machine learning AlgorithmUse ofNon-linear processingCascaded multiple layers of cells for feature extractionAnd conversion. In some implementations, each successive layer uses the output of the previous layer as an input. The deep learning model may include using supervised learning (e.g., classification) and/or negationMonitor forDu learning (e.g., mode score)Analysis)One or more layers learned by the method. In some implementations, a deep learning technique for deriving 3D data from 2D images can be learned using multiple levels of representation corresponding to different levels of abstraction, where the different levels form a hierarchy of concepts.

There are many models for 3D self-2D depth prediction based on deep convolutional neural networks. One approach is a fully convolved residual network using directly predicted depth values as the regression output. Other models use multi-scale neural networks to separate the overall scale prediction from the prediction of fine details. Some models refine the results by: the fully-connected layers are merged, Conditional Random Field (CRF) elements are added to the network, or additional outputs (such as normal vectors) are predicted and combined with the initial depth prediction to generate a refined depth prediction.

In various embodiments, the 3D model generation component 118 may use the derived 3D data 116 for respective images received by the computing device 104 to generate a reconstructed 3D model of the object or environment included in the images. The 3D models described herein may include data representing positions, geometries, curved surfaces, and the like. For example, a 3D model may include a set of points represented by 3D coordinates (such as points in 3D euclidean space). The sets of points may be associated with (e.g., connected to) each other by a geometric entity. For example, a set of lattice connectable points comprising a series of triangles, lines, curved surfaces (e.g., non-uniform rational basis splines (NURBS)), quadrilaterals, n-grams, or other geometric shapes. For example, a 3D model of a building interior environment may include mesh data (e.g., a triangular mesh, a quadrilateral mesh, a parametric mesh, etc.), one or more texture-mapped meshes (e.g., one or more texture-mapped polygonal meshes, etc.), a point cloud, a set of point clouds, bins, and/or other data constructed using one or more 3D sensors. In one example, the captured 3D data may be configured in a triangular mesh format, a quadrilateral mesh format, a bin format, a parameterized entity format, a geometric primitive format, and/or another type of format. For example, each vertex of a polygon in a texture-mapped mesh may include the UV coordinates of a point in a given texture (e.g., a 2D texture), where U and V are the axes of the given texture. In a non-limiting example of a triangular mesh, each vertex of a triangle may include the UV coordinates of a point in a given texture. A triangle formed by three points of the triangle (e.g., a set of three UV coordinates) in the texture may be mapped onto the mesh triangle for rendering purposes.

Portions of the 3D model geometry data (e.g., meshes) may include image data describing textures, colors, intensities, and so forth. For example, in addition to including texture coordinates associated with the geometric data points (e.g., texture coordinates indicating how the texture data is applied to the geometric data), the geometric data may also include the geometric data points. In various embodiments, the received 2D image data 102 (or portions thereof) may be associated with portions of a mesh to associate visual data (e.g., texture data, color data, etc.) from the 2D image data 102 with the mesh. In this regard, the 3D model generation component 118 may generate a 3D model based on the 2D image and the 3D data respectively associated with the 2D image. In one aspect, data for generating a 3D model may be collected by scanning (e.g., with sensors) of a real-world scene, space (e.g., house, office space, outdoor space, etc.), object (e.g., furniture, ornaments, merchandise, etc.), and so forth. The data may also be generated based on a computer-implemented 3D modeling system.

In some embodiments, the 3D model generation component 118 may convert a single 2D image of the object or environment into a 3D model of the object or environment based on the derived depth data 116 for the single image. According to these embodiments, the 3D model generation component 118 may use the depth information for the respective pixels, superpixels, features, etc. derived for the 2D image to generate a 3D point cloud, 3D mesh, etc. corresponding to the respective pixels in 3D. The 3D model generation component 118 can further register visual data of respective pixels, superpixels, features, etc. (e.g., color, texture, brightness, etc.) with their corresponding geometric points in 3D (e.g., a color point cloud, a color mesh, etc.). In some implementations, the 3D model generation component 118 can further manipulate the 3D model to cause the 3D model to be rotated in 3D relative to one or more axes such that the 3D point cloud or mesh can be viewed from a different perspective as an alternative to the original capture perspective.

In other embodiments, where the 2D image data 102 includes a plurality of different images of the environment captured from different capture positions and/or orientations relative to the environment, the 3D model generation component 118 may perform an alignment process that involves aligning features in the 2D images and/or 2D images with each other and with a common 3D coordinate space based at least in part on the derived 3D data 116 of the respective images to generate an alignment between the respective features in the image data and/or image data. For example, the alignment data may also include information mapping respective pixels, superpixels, objects, features, etc. represented in the image data with defined 3D points, geometric data, triangles, regions, and/or volumes relative to the 3D space, for example.

For these embodiments, the quality of the alignment will depend in part on the amount, type, and accuracy of the derived 3D data 116 determined for the respective 2D images, which may vary according to the machine learning technique (e.g., the 3D self 2D model or models used) used by the 3D data derivation component 110 to generate the derived 3D data 116. In this regard, the derived 3D data 116 may include 3D location information for each (or in some implementations one or more) received 2D image (of the 2D image data 102). Depending on the machine learning techniques used to determine the derived 3D data 116, the derived 3D data may include depth information for each pixel of a single 2D image, depth information for a subset or group of pixels (e.g., superpixels), depth information for only one or more portions of the 2D image, and so forth. In some implementations, the 2D images can also be associated with additional known or derived spatial information that can be used to facilitate aligning the 2D image data with each other in 3D coordinate space, including but not limited to the relative capture position and relative capture orientation of the respective 2D images with respect to the 3D coordinate space.

In one or more embodiments, the alignment process may involve determining positional information (e.g., relative to a 3D coordinate space) and visual feature information of respective points in the received 2D image relative to each other in a common 3D coordinate space. In this regard, the 2D images, the derived 3D data respectively associated with the 2D images, the visual feature data mapped to the derived 3D data geometry, and other sensor data and assistance data (if available) (e.g., assistance data described with reference to fig. 30) may then be used as inputs to an algorithm that determines potential alignment between the different 2D images via coordinate transformations. For example, in some implementations, the 3D location information for the respective pixels or features derived for a single 2D image may correspond to a point cloud comprising a set of points in 3D space. The alignment process may involve iteratively aligning different point clouds from adjacent and overlapping images captured from different positions and orientations relative to the object or environment in order to generate a global alignment between the respective point clouds using correspondences in the derived position information of the respective points. The alignment data may also be generated using visual feature information (including correspondences in color data, texture data, brightness data, etc.) of the respective points or pixels included in the point cloud (as well as other sensor data, if any). The model generation component 118 can further evaluate the quality of the potential alignment and can align the 2D images together once a sufficiently high relative or absolute quality alignment is achieved. By repeated alignment of new 2D images (and potential improvements to the alignment of existing data sets), global alignment of all or most of the input 2D images into a single coordinate system can be achieved.

The 3D model generation component 118 can further employ the alignment between the 2D image data and/or corresponding features in the image data (e.g., the alignment data for the 3D model and the alignment data 128) to generate one or more reconstructed 3D models of the object or environment included in the captured 2D image data (e.g., the 3D model data for the 3D model and the alignment data 128). For example, the 3D model generation component 118 can also employ the aligned 2D image data and/or associated sets of 3D data to generate various representations of the 3D model of the environment or object from different perspectives or viewpoints of virtual camera positions external to or internal to the 3D model. In one aspect, the representations may include one or more of a captured 2D image and/or image data from one or more 2D images.

The format and appearance of the 3D model may vary. In some embodiments, the 3D model may include a photo-level 3D representation of the object or environment. The 3D model generation component 118 can further remove shot objects (e.g., walls, furniture, fixtures, etc.) from the 3D model, integrate new 2D and 3D graphical objects on or within the 3D model in a spatially aligned position relative to the 3D model, change the appearance (e.g., color, texture, etc.) of visual features of the 3D model, and so forth. The 3D model generation component 118 may also generate reconstructed views of the 3D model from different perspectives of the 3D model, generate 2D versions/representations of the 3D model, and so on. For example, the 3D model generation component 118 can generate a 3D model or representation of a 3D model of an environment corresponding to a floor plan model of the environment, a toy house model of the environment (e.g., in implementations where the environment includes an interior space of a building space such as a house), and so forth.

In various embodiments, the floor plan model may be a simplified representation of surfaces (e.g., walls, floors, ceilings, etc.), entrances (e.g., door openings), and/or window openings associated with the interior environment. The floorplan model may include the location of the boundary edge of each given surface, entrance (e.g., door opening), and/or window opening. The floorplan model may also include one or more objects. Alternatively, the floor plan may be generated without the objects (e.g., the objects may be omitted from the floor plan). In some implementations, the floor plan model may include one or more dimensions associated with a surface (e.g., a wall, a floor, a ceiling, etc.), an entrance (e.g., a door opening), and/or a window opening. In one aspect, dimensions below a particular size may be omitted from the floor plan. The planes included in the floor plan may extend a particular distance (e.g., intersect a pose).

In various embodiments, the floorplan model generated by the 3D model generation component 118 may be a schematic floorplan of a building structure (e.g., a house), a schematic floorplan of an interior space of a building structure (e.g., a house), or the like. For example, the 3D model generation component 118 can generate a floor plan model of the architectural structure by employing the identified walls associated with the derived 3D data 116 derived from the captured 2D image of the architectural structure. In some implementations, the 3D model generation component 118 can employ common architectural symbols to illustrate architectural features of a building structure (e.g., the length of doors, windows, fireplaces, walls, other features of a building, etc.). In another example, the floor plan model may include a series of lines in 3D space that represent intersections of walls and/or floors, contours of porches and/or windows, edges of steps, contours of other objects of interest (e.g., mirrors, paintings, fireplaces, etc.). The floor plan model may also include measurements of walls and/or other common annotations that appear in the floor plan of the building.

The floorplan model generated by the 3D model generation component 118 may be a 3D floorplan model or a 2D floorplan model. The 3D floorplan model may contain the edges of each floor, wall, and ceiling as lines. The lines of the floor, walls, and ceiling may be sized (e.g., annotated) with an associated size. In one or more embodiments, the 3D floorplan model may be navigated in 3D via a viewer on a remote device. In one aspect, a sub-portion (e.g., a room) of a 3D floorplan model may be associated with text data (e.g., a name). Measurement data (e.g., square feet, etc.) associated with the surface may also be determined based on the derived 3D data corresponding to and associated with the respective surface. These measurements may be displayed in association with viewing and/or navigation of the 3D floorplan model. The calculation of the area (e.g., square feet) may be determined for any identified surface or portion of the 3D model having known boundaries, for example, by summing the areas of polygons comprising the identified surfaces or portions of the 3D model. The display of individual items (e.g., sizes) and/or categories of items may be toggled in the floor plan via a viewer on the remote device (e.g., via a user interface on the remote client device). The 2D floorplan model may include a surface (e.g., a wall, floor, ceiling, etc.), an entrance (e.g., a door opening), and/or a window opening associated with the derived 3D data 116 used to generate the 3D model and projected onto the flat 2D surface. In yet another aspect, the floor plan may be viewed at a plurality of different heights relative to a vertical surface (e.g., a wall) via a viewer on the remote device.

In various embodiments, the 3D model and various representations of the 3D model (e.g., different views of the 3D model, floor plan models in 2D or 3D, etc.) that can be generated by the 3D model generation component 118, and/or associated aligned 2D and 3D data, may be rendered at the user device 130 via the display 132. For example, in some implementations, the 3D model generation component 118 and/or the user device 130 can generate a Graphical User Interface (GUI) that includes a 3D reconstructed model (e.g., a depth map, a 3D mesh, a 3D point cloud, a 3D color point cloud, etc.) generated by the 3D model generation component 118.

In some embodiments, the 3D model generation component 118 may be configured to generate such reconstructed 3D models in real-time or substantially real-time as the 2D image data is received and the derived 3D data 116 of the 2D image data is generated. Thus, during the entire alignment process, real-time or substantially real-time feedback regarding the progress of the 3D model is provided to a user viewing the rendered 3D model as new 2D image data 102 is received and aligned. In this regard, in some implementations in which a user is causing or controlling the capture of 2D image data 102 for creating a 3D model, the system 100 may cause real-time or live feedback to be provided to the user during the capture process regarding the progress of the 3D model generated based on the captured and aligned 2D image data (and derived 3D data). For example, in some embodiments, using one or more cameras (or one or more camera lenses) or a separate camera provided on user device 130, a user may control the capture of 2D images of the environment at various positions and/or orientations relative to the environment. The capture process that involves capturing 2D image data of the environment at various nearby locations in the environment to generate a 3D model of the environment is referred to herein as "scanning. According to this example, as new images are captured, they may be provided to the computing device 104, and 3D data may be derived for the respective images and used to align the images to generate a 3D model of the environment. The 3D model may further be rendered at the user device 130 and updated in real-time based on new image data as it is received during the capture of the 2D image data. For these embodiments, the system 100 may thus provide visual feedback during the capture process on the 2D image data that has been captured and aligned, based on the derived 3D data of the 2D image data, as well as the quality of the alignment and the resulting 3D model. In this regard, based on viewing the alignment image data, the user may monitor what has been captured and aligned so far, look for potential alignment errors, evaluate scan quality, plan areas to scan next, determine where and how to position one or more cameras for capturing the 2D image data 102, and otherwise complete the scan. Additional details regarding the graphical user interface that facilitates the viewing and auxiliary capture process are described in U.S. patent No. 9,324,190, filed on 23/2/2013 and entitled "CAPTURING AND ALIGNING MULTIPLE 3-multimedia services," the entire contents of which are incorporated herein by reference.

Fig. 2-4 present exemplary illustrations of reconstructed 3D models of a building environment that may be generated by the 3D model generation component 118 based on 3D data derived from 2D image data, in accordance with various aspects and embodiments described herein. In the illustrated embodiment, the 3D model is rendered at a user device (e.g., user device 130) that is a tablet PC. It should be understood that the type of user device that can display the 3D model may vary. In some implementations, one or more cameras (or one or more camera lenses) of the tablet PC are used to capture 2D image data of the corresponding environment represented in a 3D model to generate the 3D model, and depth data is derived from the 2D image data (e.g., via 3D derivation component 110). In another implementation, the 2D images used to generate the respective 3D models may have been captured by one or more cameras (or one or more camera lenses) of another device. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

FIG. 2 provides a visualization of an exemplary 3D model 200 of a living room associated with the generation of the 3D model by 3D model generation component 118. In this regard, the depicted 3D model 200 is currently under construction and includes missing image data. In various embodiments, when the 3D model generation component 118 is building the 3D model 200, the model may be presented to the user at the client device. In this regard, the 3D model 200 may be dynamically updated as new images of the living room are captured, received, and aligned with previously aligned image data based on depth data derived for the respective images (e.g., by the 3D depth derivation component 110).

Fig. 3 provides a visualization of an exemplary 3D floorplan model 300 that may be generated by the 3D model generation component 118 based on captured image data of an environment. For example, in one implementation, when a user walks from one room to another and takes pictures of a house from different perspectives within the room (e.g., while standing on the floor), 2D image data of the house portion depicted in the 3D floorplan model is captured by a camera held and operated by the user. Based on the captured image data, the 3D model generation component 118 may use depth data derived from the respective images to generate a 3D floorplan model 300 that provides a completely new (not included in the 2D image data) reconstructed top-down perspective of the environment.

Fig. 4 provides a visualization of an exemplary 3D toy roofview representation 400 of a model that may be generated by the 3D model generation component 118 based on captured image data of an environment. For example, in the same manner as described above with respect to fig. 3, in one implementation, 2D image data of a portion of a house depicted in a 3D toy house view may have been captured by a camera held and operated by a user as the user walks from one room to another and takes pictures of the house from different perspectives within the room (e.g., while standing on the floor). Based on the captured image data, the 3D model generation component 118 may generate a 3D model (e.g., a mesh) of the environment using depth data derived from the respective images by aligning the respective images with one another relative to a common 3D coordinate space using the depth data derived separately for the images. According to this implementation, the 3D model may be viewed from various perspectives, including the playhouse view shown. In this regard, based on the input indicating the particular playhouse perspective from which the 3D model is desired, the 3D model generation component 118 may generate the 3D playhouse view representation 400 based on the 3D model and the associated alignment image data.

Referring again to fig. 1, in some embodiments, the computing device 104 can also include a navigation component 126. Navigation component 126 can facilitate viewing, navigating, and interacting with 3D models. Navigation component 126 can facilitate navigation of the 3D model after the 3D model has been generated and/or in association with generation of the 3D model by 3D model generation component 118. For example, in some implementations, the 3D model generated by the 3D model generation component 118, as well as the 2D image used to create the 3D model and the 3D information associated with the 3D model, can be stored in the memory 122 (or another accessible memory device) and accessed by the user device (e.g., via a network using a browser, via a thin client application, etc.). In association with accessing the 3D model, user device 130 may display (e.g., via display 132) an initial representation of the 3D model from a predefined initial perspective of the virtual camera relative to the 3D model. User device 130 may further receive user input (e.g., via a mouse, touch screen, keyboard, gesture detection, gaze detection, etc.) indicating or requesting movement of the virtual camera through or around the 3D model to view different portions of the 3D model and/or to view different portions of the 3D spatial model from different perspectives and navigation modes (e.g., walking mode, playhouse mode, functional view mode, and floorplan mode). Navigation component 126 can facilitate navigating the 3D model by: receiving and interpreting user gesture input, and selecting or generating a representation of the 3D model from a new perspective of the virtual camera relative to the 3D spatial model determined based on the user input. The representation may comprise a 2D image associated with the 3D model, and a novel view of the 3D model derived from a combination of the 2D image data and the 3D mesh data. The 3D model generation component 118 can also generate and provide a corresponding representation of the 3D model for rendering at the user device 130 via the display 132.

The navigation component 126 can provide various navigation tools that allow a user to provide input that facilitates viewing and interacting with different portions or perspectives of the 3D model. These navigation tools may include, but are not limited to: selecting a location to view (e.g., which may include a point, a region, an object, a room, a surface, etc.) on a representation of the 3D model, selecting a location for positioning a virtual camera (e.g., which includes a waypoint) on a representation of the 3D model, selecting an orientation of the virtual camera, selecting a field of view of the virtual camera, selecting a marker icon, moving the location of the virtual camera (forward, backward, left, right, up, or down), moving the orientation of the virtual camera (e.g., pan up, pan down, pan left, pan right), and selecting a different viewing mode/context (as described below). Various types of navigation tools described above allow a user to provide input indicating how to move the virtual camera relative to the 3D model in order to view the 3D model from a desired perspective. The navigation component 126 can also interpret received navigation input indicating a desired perspective for viewing the 3D model, thereby facilitating determination of a representation of the 3D model for rendering based on the navigation input.

In various implementations, in association with generating a 3D model of an environment, the 3D model generation component 118 can determine the location of objects, obstacles, flat planes, and the like. For example, based on the aligned 3D data derived for the respective images of the captured environment, the 3D model generation component 116 can identify obstacles, walls, objects (e.g., countertops, furniture, etc.), or other 3D features included in the aligned 3D data. In some implementations, the 3D data derivation component 110 can identify or partially identify features, objects, etc. included in the 2D image and associate information with the derived 3D data for the respective features, objects, etc. to identify them and/or define the boundaries of the objects or features. In one aspect, objects may be defined as entity objects such that they cannot be traversed when rendered (e.g., during navigation, transitioning between modes, etc.). Defining objects as entities may facilitate various aspects of model navigation. For example, a user may browse a 3D model of an interior living space. Living spaces may include walls, furniture, and other objects. As the user navigates through the model, the navigation component 126 may prevent the user (e.g., with respect to a particular representation that may be provided to the user) from traversing walls or other objects, and may also limit movement according to one or more configurable constraints (e.g., maintaining a viewpoint at a specified height above the model surface or defining the floor). In one aspect, constraints may be based at least in part on a mode (e.g., a walking mode) or a type of model. It should be noted that in other embodiments, an object may be defined as not being a physical object such that the object may be traversed (e.g., during navigation, transitioning between modes, etc.).

In one or more implementations, the navigation component 126 may provide different viewing modes or viewing contexts, including but not limited to a walking mode, a playhouse/track mode, a floor plan mode, and a feature view. The walking mode may refer to a mode for navigating and viewing the 3D model from a viewpoint within the 3D model. The viewpoint may be based on camera position, points within the 3D model, camera orientation, and the like. In one aspect, the walking mode may provide a view of the 3D model that simulates a user walking or otherwise traveling through the 3D model (e.g., a real-world scene). The user is free to rotate and move in order to view the scene from different angles, altitudes or perspectives. For example, when the virtual user walks around the space of the 3D model (e.g., at a defined distance relative to a floor surface of the 3D model), the walking mode may provide a perspective of the 3D model from a virtual camera that corresponds to the virtual user's eyes. In one aspect, during walking mode, unless squatting or in the air (e.g., jumping, falling off the edge, etc.), the user may be restricted from having a camera viewpoint at a particular height above the model surface. In one aspect, a collision check or navigation grid may be applied such that a user is restricted from passing through an object (e.g., furniture, wall, etc.). The walking mode may also include moving between waypoints associated with known locations of captured 2D images associated with the 3D model. For example, in association with navigating the 3D model in the walking mode, the user may click or select a point or region in the 3D model to view, and the navigation component 126 may determine a waypoint associated with the capture location of the 2D image associated with the point or region that provides the best view of the point or region.

The playhouse/track mode represents a mode in which a user perceives the model such that the user is outside or above the model, and may freely rotate the model about a center point and move the center point about the model (e.g., relative to the playhouse view representation 400). For example, the playhouse/track mode may provide a perspective of the 3D model in which the virtual camera is configured to view the interior environment from a position removed from the interior environment in a manner similar to viewing the playhouse at various elevations relative to the model floor (e.g., in which one or more walls are removed). In the playhouse/track mode, there may be multiple types of movements. For example, the viewpoint may tilt up or down, rotate left or right about a vertical axis, zoom in or out, or move horizontally. Pitch, rotation about a vertical axis, and zoom motion may be relative to a central point, such as defined by (X, Y, Z) coordinates. The vertical axis of rotation may pass through the center point. In the case of pitch and rotation about a vertical axis, these motions can be maintained at a constant distance from a central point. Thus, the pitch of the viewpoint and the motion of rotation about the vertical axis can be considered as vertical and horizontal travel, respectively, on the surface of a sphere centered on the center point. Scaling can be considered as propagation along a ray defined as passing through the viewpoint to the central point. With or without backface culling or other ceiling removal techniques, a point on the 3D model rendered in the center of the display may be used as a center point. Alternatively, the center point may be defined by a point located on a horizontal plane at the center of the display. The horizontal plane may not be visible and its height may be defined by the overall height of the floor of the 3D model. Alternatively, the local floor height may be determined and the intersection of a ray projected from the camera to the center of the display and the surface of the local floor height may be used to determine the center point.

The floorplan mode presents a view of the 3D model that is orthogonal or substantially orthogonal to the floor of the 3D model (e.g., looking down the model from directly above with respect to the 3D floorplan model 300). The floor plan pattern may represent a pattern in which the user perceives the model such that the user is outside or above the model. For example, a user may view all or a portion of the 3D model from an overhead pinch point. The 3D model may be moved or rotated about an axis. As an example, the floorplan mode may correspond to a top-down view, where the model is rendered such that the user looks directly down at the model or at a fixed angle down at the model (e.g., about 90 degrees above the floor or bottom plane of the model). In some implementations, the representation of the 3D model generated in the floorplan mode may appear to be 2D or substantially 2D. The set of motion or navigation controls and mappings in the floor plan mode may be a subset of all available controls for those controls or other models of the playhouse/track mode. For example, the controls for the floor plan mode may be the same as those described in the context of the track mode, except that the pitch is down at a fixed degree. Rotation about the center point along the vertical axis is still possible because the center point is zoomed in and out and moved towards and away from the point. However, due to the fixed pitch, the model can only be viewed directly from above.

The feature view may provide a perspective of the 3D model (e.g., a close-up view of a particular item or object of the 3D model) from a field of view that is narrower in context than the playhouse/track view. In particular, the feature view allows the user to navigate within and around the details of the scene. For example, with the feature view, a user may view different perspectives of a single object included in the internal environment represented by the 3D model. In various embodiments, selection of a marker icon included in the 3D model or a representation of the 3D model may result in generation of a feature view of a point, region, or object associated with the marker icon (as described in more detail below).

The navigation component 126 may provide a mechanism for navigating within and between these different modes or perspectives of the 3D model based on discrete user gestures in virtual 3D space or 2D coordinates relative to the screen. In some implementations, the navigation component 126 can provide navigation tools that allow a user to move the virtual camera relative to the 3D model using the various viewing modes described herein. For example, the navigation component 408 may provide and implement navigation controls that allow a user to change the position and orientation of the virtual camera relative to the 3D model and to change the field of view of the virtual camera. In some implementations, based on received user navigation input (including 2D images and hybrid 2D/3D representations of the 3D model) relative to the 3D model or visualization of the 3D model, the navigation component 126 can determine a desired position, orientation, and/or field of view of the virtual camera relative to the 3D model.

Referring back to fig. 1, in accordance with one or more embodiments, computing device 104 can correspond to a server device that facilitates various services associated with deriving 3D data from 2D images, including, for example, 3D model generation based on 2D images and navigation of 3D models. In some implementations of these embodiments, the computing device 104 and the user device 130 may be configured to operate in a client/server relationship, where the computing device 104 provides the user device 130 with access to 3D modeling and navigation services via a network accessible platform (e.g., a website, thin client application, etc.) using a browser or the like. However, the system 100 is not limited to this architectural configuration. For example, in some embodiments, one or more features, functions, and associated components of computing device 104 may be provided at user device 130, and vice versa. In another embodiment, one or more features and functions of the computing device 104 may be provided at a capture device (not shown) for capturing 2D image data. For example, in some implementations, the 3D from 2D processing module 106 or at least some of the components of the 3D from 2D processing module 106 may be provided at the capture device. According to this example, the capture device may be configured to derive depth data from the captured images (e.g., derived 3D data 116), and provide the images and associated depth data to computing device 104 for further processing by 3D model generation component 118 and optional navigation component. In yet another exemplary embodiment, the one or more cameras (or one or more camera lenses) used to capture 2D image data, the 3D self 2D processing module, the 3D model generation component 118, the navigation component 126, and the display 132 displaying the 3D model and a representation of the 3D model may all be disposed on the same device. Various architectural configurations of different systems and devices that can provide one or more features and functions of system 100 (and additional systems described herein) are described below with reference to fig. 14-25.

In this regard, the various components and devices of system 100 and the additional systems described herein may be connected directly or via one or more networks. Such networks may include wired and wireless networks, including but not limited to cellular networks, wide area networks (WANs, e.g., the internet), Local Area Networks (LANs), or Personal Area Networks (PANs). For example, computing device 104 and user device 130 may communicate with each other using virtually any desired wired or wireless technology, including, for example, cellular, WAN, Wi-Fi, Wi-Max, WLAN, Bluethooth^TMNear field/approach communications, etc. In one aspect, one or more components of system 100 and the additional systems described herein are configured to interact via different networks.

Fig. 5 presents another exemplary system 500 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data, in accordance with various aspects and embodiments described herein. The system 500 includes the same or similar features as the system 100, with panoramic image data (e.g., panoramic image data 502) added as input. The system 500 also includes an upgraded 3D self 2D processing module 504 that is different from the 3D self 2D processing module 106 with respect to adding the panorama component 506, the model selection component 512, and one or more 2D self 3D panorama models 514 (hereinafter referred to as panorama models 514) to the 2D self 3D model database 112. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

The system 500 is specifically configured to receive and process 2D image data (referred to herein as panoramic image data and identified in the system 100 as panoramic image data 502) having a relatively wide field of view. The term panoramic image or panoramic image data is used herein to refer to a 2D image of an environment having a relatively wide field of view compared to a standard 2D image typically having a relatively narrow field of view between about 50 ° to 75 °. In contrast, the field of view of a panoramic image may span up to 360 ° in the horizontal direction (e.g., a cylindrical panoramic image) or both the horizontal and vertical directions (e.g., a spherical panoramic image). In this regard, the term panoramic image as used herein may, in some instances, refer to an image having a field of view equal or substantially equal to 360 ° in the horizontal and/or vertical directions. In other contexts, the term panoramic image as used herein may refer to an image having a field of view less than 360 ° but greater than a minimum threshold, such as 120 °, 150 °, 180 ° (e.g., provided by a fisheye lens), or 250 °, for example.

Using panoramic images as input to one or more 2D self-3D models to derive 3D data therefrom produces significantly better results than using standard 2D images as input (e.g., with a field of view less than 75 °). According to these embodiments, system 500 may include one or more panoramic 3D self 2D models that have been specially trained to derive 3D data from panoramic images, referred to herein and depicted in system 500 as panoramic model 514. The 3D data derivation component 110 can also include a model selection component 512 for selecting one or more appropriate models included in the 3D self-2D model database 112 to use in order to derive 3D data from the received 2D images based on one or more parameters associated with the input data, including whether the input data includes a 2D image having a field of view that exceeds a defined threshold in order to classify it as a panoramic image (e.g., 120 °, 150 °, 180 °, 250 °, 350 °, 359 °, etc.). In this regard, based on receipt of the panoramic image data 502 (e.g., images having a field of view greater than a minimum threshold) and/or generation of a panoramic image by the stitching component 508 (as described below), the model selection component 512 may be configured to select one or more panoramic models 514 for application by the 3D data derivation component 110 to determine the derived 3D data 116 of the panoramic image data 502.

One or more panoramic models 514 may employ a neural network model that has been trained on panoramic images using 3D ground truth data associated therewith. For example, in various implementations, one or more panoramic models 514 may be generated based on 2D panoramic image data (referred to herein as 2D/3D panoramic data) having associated 3D data, the 3D data captured by a 2D/3D capture device in association with the capture of the 2D panoramic image data. The 2D/3D panoramic capture device may incorporate one or more cameras (or one or more camera lenses) providing up to a 360 ° field of view and one or more depth sensors providing up to a 360 ° field of view, thereby capturing the entire panoramic image and simultaneously capturing and incorporating panoramic depth data associated therewith into a 2D/3D panoramic image. The depth sensor may include one or more 3D capture devices that capture depth information using at least some hardware. For example, the depth sensors may include, but are not limited to, LiDAR sensors/devices, laser rangefinder sensors/devices, time-of-flight sensors/devices, structured light sensors/devices, light field camera sensors/devices, active stereo depth-deriving sensors/devices, and the like. In other embodiments, the panoramic 2D/3D training used to develop the one or more panoramic models 514 may include panoramic image data and associated 3D data generated by a capture device component incorporating one or more color cameras and one or more 3D sensors attached to a rotating stage or a device configured to rotate about an axis (e.g., using synchronized rotation signals) during a capture process. During rotation, multiple images and depth readings are captured, which may be combined into a single panoramic 2D/3D image. In some implementations, by rotating the platform, images with mutually overlapping fields of view but with different viewpoints may be obtained, and 3D information may be derived therefrom using stereo algorithms. The 2D/3D panoramic training data may also be associated with information identifying a capture location and a capture orientation of the 2D/3D panoramic image, which may be generated by the 2D/3D capture device and/or derived in association with the capture process. Additional details regarding the graphical user interface that facilitates the viewing and assisted capture process are described in U.S. patent application No. 15/417,162, filed on 26.1.2017 and entitled "CAPTURING AND ALIGNING PANORAMIC IMAGE AND DEPTH DATA," the entire contents of which are incorporated herein by reference.

In various embodiments, one or more panoramic models 514 may employ an optimized neural network architecture that has been specially trained based on the 2D/3D panoramic image training data discussed above to evaluate and process panoramic images in order to derive 3D data therefrom. In various embodiments, unlike various existing 3D self-2D models (e.g., standard model 114), one or more panoramic models 514 may employ a neural network configured to process panoramic image data using convolutional layers that surround the panoramic image when projected onto a flat (2D) plane. For example, image projection may refer to mapping a flat image onto a curved surface, and vice versa. In this regard, the geometry of the panoramic image differs from that of a normal (camera) picture in that all points along the horizontal (scan) line are equidistant from the focal point of the camera. In practice, this creates a cylindrical or spherical image that is displayed correctly only when viewed from the exact center of the cylinder. When an image is "unrolled" on a flat surface, such as a computer display, the image has severe distortion. Such an "unfolded" or flat version of the panoramic image is sometimes referred to as an equirectangular projection or equirectangular image.

In this regard, in some implementations, one or more panoramic models 514 may be configured to receive panoramic image data 502 in the form of equirectangular projections or already projected onto a 2D plane. In other implementations, the panorama component 504 may be configured to project the received spherical or cylindrical panoramic image onto a 2D plane to generate a projected panoramic image in an equirectangular form. To account for inherent distortions in the received panoramic image data associated with deriving depth information therefrom, one or more panoramic models may employ a neural network having convolutional layers that wrap around based on image projection to account for edge effects. In particular, convolutional layers in a neural network typically fill their inputs with zeros when the receive fields of the convolutional layers would otherwise extend outside of the valid data region. For proper processing of an iso-rectangular image, a convolutional layer with received fields extending beyond one horizontal edge of the valid data region will instead extract inputs from the data at the opposite horizontal edge of the region, rather than setting these inputs to zero.

In some implementations, weighting may be performed during training of the neural network model based on the image projections to enhance the accuracy of the depth prediction of the trained model. Specifically, the angular area represented by pixels near the top or bottom of the equirectangular image (poles) is smaller than the angular area represented by pixels near the equator. To avoid training a network that makes good predictions near the poles at the expense of making poor predictions near the equator, the per-pixel training penalty propagated through the network during training is proportional to the angular area of the image projection-based representation of that pixel. Accordingly, one or more panoramic models 514 may be configured to apply weighted 3D self 2D parameters based on the angular area represented by the pixel, where the weight due to the 3D prediction determined for the respective pixel decreases with decreasing angular area.

In one or more implementations, the one or more panorama models 514 may be further configured to compensate for image distortion by re-projecting the panoramic image during each convolutional layer. Specifically, instead of each convolutional layer extracting the input from a square region (e.g., a 3 × 3 region) of the previous layer, the input is sampled from a position in the previous layer corresponding to a particular angular received field based on the projection instead. For example, for an equirectangular projection, the inputs for the convolutional layer may come from a square region (3 × 3) of the elements near the equator, while near the poles those same nine inputs would be sampled from a region wider than its height, which corresponds to a horizontal stretch near the poles in an equiangular projection. In this regard, the output of a previous convolution layer may be interpolated and then used as input for a next subsequent or downstream layer.

In various embodiments, panorama component 506 may facilitate processing the panoramic image to facilitate derivation of 3D data therefrom by 3D data derivation component 110 using one or more panorama models 514. In the illustrated embodiment, the panorama component 506 may include a splice component 508 and a crop component 510.

In some implementations, the received panoramic image data 502 may be input directly to one or more panoramic models 514 based on being classified as a panoramic image (e.g., having a field of view that exceeds a defined threshold). For example, received panoramic image data 502 may include a 360 ° panoramic image captured as a single image using a capture device that employs a conical mirror. In other examples, the received panoramic image data 502 may include an image having a 180 ° field of view captured as a single image using, for example, a fisheye lens. In still other implementations, a 2D panoramic image may be formed via combining two or more 2D images whose common field of view spans at most about 360 °, which are stitched together (by another device) prior to being received by the receiving component 108.

In other implementations, the panorama component 506 can include a mosaic component 508 that can be configured to generate a panoramic image for input to the one or more panoramic models 514 based on receiving two or more images having adjacent perspectives of the environment. For example, in some implementations, two or more images may be captured in conjunction with the camera being rotated about an axis to capture two or more images having a common field of view equal to 360 ° or another wide field of view range (e.g., greater than 120 °). In another example, the two or more images may include images captured by two or more cameras positioned relative to the environment and each other, respectively, such that the combined field of view of the respective image captures is equal to 360 °, such as two fisheye lens cameras each having a 180 ° field of view positioned in opposite directions. In another example, a single device may include two or more cameras with partially overlapping fields of view configured to capture two or more images whose common field of view spans up to 360 °. For these embodiments, the stitching component 508 may be configured to stitch the respective images together to generate a single panoramic image to serve as an input to the one or more panoramic models 514 to generate the derived 3D data 116 therefrom.

In this regard, the stitching component 508 may be configured to align or "stitch together" respective 2D images providing different perspectives of the same environment to generate a panoramic 2D image of the environment. For example, the stitching component 508 can also employ known or derived information (e.g., using techniques described herein) regarding the capture location and orientation of respective 2D images in order to align and order the respective 2D images relative to one another, and then merge or combine the respective images to generate a single panoramic image. By combining two or more 2D images into a single larger field-of-view image before input to the 3D self-2D predictive neural network model, the accuracy of the depth results is improved over providing the input separately and then combining the depth outputs (e.g., combined in association with generating the 3D model or for another application). In other words, stitching the input images in 2D may provide better results than stitching the predicted depth output in 3D.

Thus, in some embodiments, the wider field-of-view image generated by the stitching component 508 may be processed using one or more of the standard model 114 or the panoramic model 514 to obtain a single depth dataset for the wider field-of-view image as compared to separately processing each image to obtain separate depth datasets for each image. In this regard, a single depth set may be associated with increased accuracy relative to separate depth data sets. Additionally, by aligning the wider field of view image and its associated depth data with other image and depth data captures for the environment at different capture locations, the 3D model generation component 118 can use the wider field of view image and its associated single depth data set in association with generating the 3D model. The resulting alignment generated using the wider field-of-view image and associated depth data will be of greater accuracy relative to the alignment generated using the individual image and associated separate depth data sets.

In some embodiments, depth information may be derived by the 3D data derivation component 110 for two or more images using one or more standard models 114 for the respective images prior to stitching the images together to generate a panoramic image. The stitching component 508 may further employ the initially derived depth information for respective images (e.g., pixels in the respective images, features in the respective images, etc.) in order to facilitate aligning the respective 2D images with one another in association with generating a single 2D panoramic image of the environment. In this regard, initial 3D data may be derived for each 2D image prior to stitching using one or more standard 3D self-2D models. In association with combining the images to generate a single panoramic image, this initial depth data may be used to align the respective images with one another. Once generated, the panoramic image may be reprocessed by the 3D data derivation component 110 using one or more panoramic models 514 to derive more accurate 3D data for the panoramic image.

In some implementations, in association with combining two or more images together to generate a panoramic image, the stitching component 508 can project the respective images to a common 3D coordinate space based on the initially derived depth information and the calibrated capture positions/orientations of the respective images relative to the 3D coordinate space. In particular, the stitching component 508 may be configured to project two or more adjacent images (for stitching together as a panorama) and corresponding initially derived 3D depth data into a common spatial 3D coordinate space in order to facilitate accurate alignment of the respective images in association with generating a single panoramic image. For example, in one embodiment, the stitching component 508 can merge the respective image data of the respective images and the initially derived 3D onto a discretized sinusoidal projection (or another type of projection). The stitching component 508 can convert each 3D point included in the initially derived 3D data into the sinogram's coordinate space and assign it to a discretized cell. The stitching component 508 can further average multiple points mapped to the same cell to reduce sensor noise while detecting and removing anomalous readings from the average calculation

In some implementations, the stitching component 508 can also generate panoramic 3D images (e.g., point clouds, depth maps, etc.) based on projection points relative to the 3D coordinate space. For example, the stitching component 508 may employ the initial depth data to create a sinusoidal depth map or a point cloud comprising 3D points projected onto a common 3D space coordinate plane. The stitching component 508 may further apply the pixel color data to the depth map or point cloud by projecting the color data from the respective 2D image onto the depth map or point cloud. This may involve projecting light rays outwardly from the color camera along each captured pixel toward a portion of interest of the depth map or point cloud to color the depth map or point cloud. The mosaics 508 may also perform back projection of color data from the color point cloud or depth map to create a single 2D panoramic image. For example, by back projecting color data from a color point cloud or 3D depth map onto the intersection points or regions of the 2D panorama, the stitching component 508 can fill any possible pinholes in the panorama with adjacent color data, thereby unifying the exposure data across the boundaries between the respective 2D images (if needed). The splice 508 may also perform blending and/or pattern cutting at the edges to remove seams. The resulting panoramic image may then be reprocessed by the 3D data derivation component 110 to determine more accurate 3D data for the panoramic image using one or more panoramic models 514.

In some embodiments, panoramic image data captured for an environment may be used to generate optimized derived 3D data for smaller or cropped portions of the panoramic image (e.g., derived by 3D data derivation component 110 using 3D from 2D). For example, in the above-described embodiments, the 3D data derivation component 110 can process the panoramic image (e.g., in an equiangular projection format) using one or more panoramic models 514 to generate depth data for the entire panoramic image, such as depth data for each pixel, depth data for groups of pixels (e.g., superpixels, defined features, objects, etc.) that collectively cover the entire panoramic image span, and so forth. However, in various applications, depth data for the entire panoramic image may not be desired or needed. For example, in various contexts associated with optimizing the placement of digital objects in AR applications using 3D from 2D, depth data for a wide field of view of the environment may not be needed (e.g., it may only require depth data for objects in line of sight or for the environment or region of objects in front of the observer's eyes). (the AR application of the disclosed techniques for deriving 3D from 2D is described below with reference to fig. 30). In another example, in association with using the derived 3D data to generate real-time relative 3D location data for automatic navigation and collision avoidance by a smart machine (e.g., drone, unmanned vehicle, robot, etc.), a wide field of view of depth data may not be needed. For example, accurate real-time depth data for object avoidance may only be needed for, for example, a forward trajectory path of a vehicle.

In some embodiments, where depth data is desired for a smaller field of view of the environment relative to the entire panoramic view of the environment, the panoramic image of the environment may still be used to generate optimized derived 3D data for the desired cropped portion of the image. For example, the 3D data derivation component 110 can apply one or more panoramic models 514 to a panoramic image of the environment to derive depth data for the panoramic image. Then, cropping component 510 may crop the panoramic image having the derived 3D data associated therewith to select a desired portion of the image. For example, cropping component 510 may select a portion of the panoramic image that corresponds to a narrower field of view. In another example, cropping component 510 may crop the panoramic image to select a particular segmented object (e.g., a person, a face, a tree, a building, etc.) in the panoramic image. The technique for determining the desired portion of the panoramic image for cropping may vary based on the application of the resulting 3D data. For example, in some implementations, user input may be received that identifies or indicates a desired portion for cropping. In another implementation, for example, where 3D data has been derived for real-time object tracking, cropping component 510 may receive information identifying the desired object being tracked, information defining or characterizing the object, etc., and automatically crop the panoramic image to extract the corresponding object. In another example, the cropping component 510 may be configured to crop the panoramic image according to a default setting (e.g., select a portion of the image having a low degree of distortion effect). Cropping component 510 may further identify and associate corresponding derived 3D data associated with the cropped portion of the panoramic image and associate the corresponding portion of the derived 3D data with the cropped portion of the panoramic image.

For these embodiments, the accuracy of the derived depth data associated with the cropped portion of the panoramic image may be optimized by using one or more panoramic models 514 to derive depth data for the entire panoramic image, and then using only the portion of the derived depth data associated with the desired cropped portion of the panoramic image, relative to first cropping the panoramic image, and then using one or more standard 2D self-3D models 114 or alternative depth derivation techniques to derive depth data for the smaller field of view portion of the panoramic image.

Fig. 6 presents an exemplary computer-implemented method 600 for deriving 3D data from panoramic 2D image data, in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 602, a system (e.g., system 500) including a processor may receive a panoramic image. At 604, the system employs a 3D self-2D convolutional neural network model to derive 3D data from the panoramic image, wherein the 3D self-2D convolutional neural network model employs convolutional layers that surround the panoramic image when the panoramic image is projected on the 2D plane in order to facilitate deriving three-dimensional data. According to the method 600, the convolutional layer minimizes or eliminates edge effects associated with deriving 3D data based on surrounding panoramic images when projected on a 2D plane. In some implementations, the panoramic image may be received while the panoramic image is projected on a two-dimensional plane. In other implementations, the panoramic image may be received as a spherical or cylindrical panoramic image, and the system may project (e.g., using the panoramic component 506) the spherical or cylindrical panoramic image onto a 2D plane before employing the 3D self-2D convolutional neural network model to derive 3D data.

In one or more implementations, the 3D self-2D convolutional neural network considers the weighting values applied to the respective pixels based on the projected angular areas of the respective pixels during training. In this regard, the 3D self-2D neural network model may include a model trained based on weighting values applied to respective pixels of the projected panoramic image in association with depth data from which the respective pixels are derived, wherein the weighting values vary based on the angular area of the respective pixels. For example, during training, the weighting values decrease as the angular area of the corresponding pixel decreases. Further, in some implementations, a downstream convolutional layer of the convolutional layers that follows the previous layer is configured to re-project a portion of the panoramic image processed by the previous layer in association with depth data from which the panoramic image was derived, thereby producing a re-projected version of the panoramic image for each downstream convolutional layer. In this regard, the downstream convolutional layer is further configured to employ input data from a previous layer by extracting the input data from a re-projected version of the panoramic image. For example, in one implementation, input data may be extracted from a re-projected version of the panoramic image based on a location in a portion of the panoramic image where the re-projected version of the panoramic image corresponds to a defined angle receive field.

Fig. 7 presents an exemplary computer-implemented method for deriving 3D data from panoramic 2D image data, in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 702, a system (e.g., system 500) operably coupled to a processor receives a request for depth data associated with a region of an environment depicted in a panoramic image. For example, in some implementations, a request may be received from a user device based on input provided by a user requesting a particular portion of a panoramic image for 3D viewing, for use in association with a 3D imaging or modeling application, and so forth. In another example, a request can be received from a 3D modeling application in association with determining that depth data for the region is needed to facilitate an alignment process or to generate a 3D model. In another example, the request may be received from the AR application based on information indicating that the area of the environment is within a current field of view of a user employing the AR application. In yet another example, a request may be received from an autonomous navigation vehicle based on information indicating that an area of the environment is within a current field of view of the vehicle (e.g., to facilitate avoiding a collision with an object in front of the vehicle). In yet another example, the request may be received from an object tracking device based on information indicating that an object tracked by the device is located within the environmental area.

At 704, based on receiving the request, the system may derive depth data for the entire panoramic image using a neural network model configured to derive depth data from the single two-dimensional image (e.g., using one or more panoramic models 514 via 3D data derivation component 110). At 704, the system extracts a portion of the depth data corresponding to the region of the environment (e.g., via cropping component 510), and at 708, the system provides the portion of the depth data to an entity associated with the request (e.g., a device, system, user device, application, etc. from which the request was received) (e.g., via panorama component 505, computing device 104, etc.).

Fig. 8 presents another exemplary system 800 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data in accordance with various aspects and embodiments described herein. The system 800 includes the same or similar features as the system 500, with the addition of native assistance data 802 as input. The system 800 also includes an upgraded 3D self-2D processing module 804, which is different from the 3D self-2D processing module 504, in which an auxiliary data component 806, auxiliary data component output data 808, and one or more enhanced 3D self-2D data models are added that are configured to process the 2D image data and the auxiliary data to provide more accurate derived 3D data relative to data provided by the one or more standard models 114. These enhanced 3D self-2D models are referred to herein and described in system 800 as enhanced models 810. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

Systems

100 and 500 are generally directed to using only a single 2D image (including panoramic images and narrower field of view images) as input to one or more 3D self 2D models (e.g., one or more standard models and/or one or more panoramic models 514) to derive 3D data therefrom (derived 3D data 116). The system 800 introduces the use of various types of auxiliary input data that may be associated with 2D images to facilitate improving the accuracy of 3D self-2D prediction. Such auxiliary input data may include, for example: information about the capture position and orientation of the 2D image, information about the capture parameters of the capture device that generated the 2D image (e.g., focal length, resolution, lens distortion, lighting, other image metadata, etc.), actual depth data associated with the 2D image captured by a 3D sensor (e.g., 3D capture hardware), depth data derived for the 2D image using stereo image processing, etc.

In the illustrated embodiment, auxiliary input data that may be used as additional input to facilitate improving the accuracy of 3D self-2D predictions may be received in association with one or more 2D images as native auxiliary data 802. In this regard, the assistance data is characterized as "native" to indicate that in some embodiments it may include raw sensory data and other types of raw assistance data that may be processed by assistance data component 806 to generate structured assistance data, which may then be used as input to one or more augmentation models 810. For these embodiments, assistance data component output data 808 may include structured assistance data generated by assistance data component 806 based on native assistance data 802, as described in more detail with reference to fig. 9. For example, (as described in more detail with reference to fig. 9), in one implementation, the native assistance data 802 may include motion data captured by an Inertial Measurement Unit (IMU) in association with an environmental scan involving the capture of several images at different capture locations. According to this example, the assistance data component 806 can determine capture position and orientation information for the respective 2D image based on the IMU motion data. The determined capture position and orientation information may be considered structured assistance data, which may then be associated with the respective 2D image and used as input for one or more augmented models 810.

In other embodiments, the native assistance data 802 may include various assistance data (e.g., actual ground truth data provided by the capture device, actual capture position and orientation information, etc.) that may be used directly as input to one or more 3D self-2D models. With these implementations, the assistance data component 806 can ensure accurate correlation of the native assistance data 802 with a particular 2D image and/or convert the native 2D image data into a structured machine-readable format (if needed) for input with the 2D image to one or more augmented models 810. Accordingly, the assistance data component output data 808 may include native assistance data 802 associated with the 2D image in raw form and/or structured format.

According to any of these embodiments, the one or more augmented models 810 may include one or more augmented 3D self-2D models employing one or more neural networks that have been specially trained to derive 3D data from the 2D images in conjunction with one or more auxiliary data parameters associated with the 2D images. Thus, the derived 3D data 116 generated by the augmented model may be more accurate than 3D data that can be determined by one or more standard models 114. The one or more augmented models 110 may also include one or more augmented panoramic models. In this regard, the enhanced panoramic model may employ one or more features and functions, and as with the panoramic model 514 discussed herein, the enhanced panoramic model may also be configured to evaluate auxiliary data associated with panoramic images or images otherwise classified as having a wide field of view. In some implementations, the derived 3D data 116 generated by the enhanced panoramic model may be more accurate than the data that can be determined by one or more panoramic models 514.

In some implementations, the augmented model 810 may include a plurality of different 3D self-2D models, each configured to handle a different set or subset of auxiliary data parameters associated with the 2D image. With these implementations, the model selection component 512 may be configured to select an applicable enhancement model from the plurality of enhancement models 810 to apply to the 2D image based on the assistance data associated with the 2D image. For example, based on the type of assistance data associated with the 2D image (e.g., included in the native assistance data 802 and/or determined by the assistance data component 806 based on the native assistance data 802), the model selection component 512 may be configured to select an appropriate enhancement model from the plurality of enhancement models 810 to apply to an input data set comprising the 2D image and the associated assistance data, thereby deriving 3D data for the image. In other implementations, the augmented model 810 may include a generic model configured to process the 2D image plus one or more defined auxiliary data parameters. With these implementations, the 3D self-2D processing module 804 may be configured to receive and/or determine one or more defined auxiliary parameters of the respective 2D image processed by the 3D data derivation component 110 using the augmented model. Otherwise, if the 2D image is not associated with auxiliary data (which is not received or cannot be determined by auxiliary data component 806) or is associated with insufficient or incomplete assistance, 3D data derivation component 110 can employ one or more standard models 114 to derive 3D data for the 2D image.

In various additional embodiments, discussed in more detail below with reference to fig. 9, the native assistance data 802 may include assistance data associated with the 2D image that may be used by the assistance data component 806 to pre-process the 2D image prior to input to the one or more 3D self 2D models in order to generate the derived 3D data 116 for that image. Such pre-processing of the 2D image may convert the image to a unified representation before applying one or more 3D self-2D models thereto to derive 3D data therefrom, such that the results of the neural network are not degraded by differences between the training image and the real image. With these embodiments, the one or more augmented models 810 may include augmented 3D self-2D models that have been specifically configured to derive depth data for pre-processed 2D images using training data pre-processed according to the techniques described below. Thus, in some implementations, after the received 2D image has been pre-processed, the model selection component 512 can select a particular augmented model configured to evaluate the pre-processed 2D image for use by the 3D data derivation component to generate the derived 3D data 116 for the pre-processed 2D image. In other implementations, the pre-processed 2D image may be used as input to one or more standard models 114, but provides more accurate results due to the consistency of the input data. The auxiliary data component 806 can also pre-process the panoramic image prior to inputting it to one or more panoramic models 514 to further improve the accuracy of the results.

In the illustrated embodiment, the auxiliary data component output data 808 can also be provided to and used by the 3D model generation component 118 to facilitate generation of the 3D model. For example, the assistance data may be used by the 3D model generation component 118 to facilitate aligning images (and their associated derived 3D data 116) captured at different capture positions and/or orientations relative to each other in a three-dimensional coordinate space. In this regard, in various embodiments, some or all of the auxiliary data component output data 808 may not be used as an input to a 3D self-2D predictive model associated with a 2D image to improve the accuracy of the derived 3D data. Instead, the auxiliary data component output data 808 associated with the 2D image may be employed by the 3D model generation component 118 to facilitate generation of a 3D model based on the 2D image and the derived 3D data 116 determined for the 2D image. In this regard, the combination of the auxiliary data, the 2D image, and the derived 3D data 116 of the 2D image may be used by the 3D model generation component 118 to facilitate generating an immersive 3D environment of the scene as well as other forms of 3D (and in some implementations, 2D) reconstruction.

For example, in one implementation, the assistance data component output data 808 (or the native assistance data 802) may include depth sensor measurements of 2D images captured by one or more depth sensors. For this example, the depth sensor measurements may be combined with the derived 3D data of the 2D image to fill in gaps lacking derived 3D data, and vice versa. In another example, the assistance data may include location information identifying a capture location of the 2D image. For this example, the location information may not be used as input to the 3D self 2D model to facilitate depth prediction, but is alternatively used by the 3D model generation component 118 to facilitate aligning the 2D image and associated derived 3D data 116 with other 2D images and associated derived 3D data sets.

Fig. 9 presents a more detailed representation of native assistance data 802, assistance data component 806, and assistance data component output data 808 in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

The native assistance data 802 may include various types of assistance data associated with the 2D image and/or the process used to capture the 2D image, which may be used to facilitate improving the accuracy of the 3D self-2D prediction and/or used by the 3D modeling component 118 to improve the quality of the 3D model. For example, native assistance data 802 may include, but is not limited to, capture device motion data 904, capture device location data 906, camera/image parameters 908, and 3D sensor data 910. Capture device motion data 904 may include information about the movement of a camera associated with the capture of multiple images of an object or environment. For example, in some implementations, capturing device motion data 904 may include data captured by an IMU, accelerometer, or the like physically coupled to a camera used to capture the image. For example, IMU measurements may include captured data that is captured in association with movement of the camera to different locations in the environment while the camera is capturing images (or the camera is not capturing images, such as movement between captures), rotation of the camera about a fixed axis, movement of the camera in vertical and horizontal directions, and so forth. In some implementations, the IMU measurements may be associated with respective images captured by the camera in association with camera movement during the capture process via a timestamp or the like. For example, in implementations where a camera is used to capture many images of an environment as a camera operator moves the camera to different locations throughout the environment to capture different areas and perspectives of the environment, each image captured may be associated with a timestamp indicating its relative capture time relative to the other images and motion data reflecting camera movement during and/or between captures.

Capture device location data 906 may include information identifying or indicating a capture location of the 2D image. For example, in some implementations, the capture device location data may include Global Positioning System (GPS) coordinates associated with the 2D image. In other implementations, the capture device location data 906 may include location information that indicates a relative location of the capture device (e.g., camera and/or 3D sensor) with respect to its environment, such as a relative location or calibrated location of the capture device with respect to an object in the environment, another camera in the environment, another device in the environment, and so forth. In some implementations, this type of position data may be determined by a capture device (e.g., a camera and/or a device operatively coupled to the camera, including positioning hardware and/or software) in association with image capture and received with the image.

The camera/image parameters 908 may include information regarding operating parameters and/or settings of one or more cameras (or one or more camera lenses) used to capture the 2D image data 102, as well as contextual information associated with the capture conditions. For example, various camera operating parameters for capturing images may vary based on the function of the camera, default or user-selected camera settings employed, the lighting in which the images are captured, and so forth. In this regard, the camera/image parameters 908 may include camera settings and capture context information associated with the 2D image (e.g., as metadata or otherwise associated with the received 2D image), including but not limited to: focal length, aperture, field of view, shutter speed, lens distortion, illumination (exposure, gamma, tone mapping, black level), color space (white balance), ISO, and/or other parameters that may vary from image to image.

The 3D sensor data 910 may include any type of 3D data associated with a 2D image included in the received 2D image data 102 captured by a 3D sensor or 3D capture hardware. This may include 3D data or depth data captured using one or more structured light sensor devices, LiDAR devices, laser rangefinder devices, time-of-flight sensor devices, light field cameras, active stereo devices, and the like. For example, in some embodiments, the received 2D image data 102 may include 2D images captured by a 2D/3D capture device or a 2D/3D capture device component that includes one or more 3D sensors in addition to one or more 2D cameras (e.g., RGB cameras). In various implementations, the 2D/3D capture device may be configured to capture 2D images using one or more cameras (or one or more camera lenses) and simultaneously (e.g., at or near the same time) capture associated depth data for the 2D images using one or more 3D sensors, or in a manner that may associate them after capture, if not simultaneously. The complexity (e.g., complexity, hardware cost, etc.) of such 2D/3D capture devices/components may vary. For example, in some implementations, to reduce cost, the 2D/3D capture device may include one or more cameras (or one or more camera lenses) and a limited range/field of view 3D sensor configured to capture partial 3D data of the 2D image. One version of such a 2D/3D capture device may include a 2D/3D capture device that produces spherical color images and depth data. For example, a 2D/3D capture may include one or more color cameras capable of capturing image data (e.g., spherical panoramic images) having fields of view spanning vertically and horizontally up to 360 °, and a structured light sensor configured to capture depth data for a middle portion of the vertical field of view (e.g., near the equator).

Although the native assistance data 802 is described as a separate entity from the 2D image data 102, this description is for exemplary purposes only to indicate that the native assistance data is a new addition (optional) to one or more embodiments of the disclosed system. In this regard, it should be understood that the 2D image data 102 may be received with its associated native auxiliary data 802 as a single data object/file, as metadata, and so forth. For example, the 2D image data 102 may include a 2D image having 3D sensor depth data associated therewith for 2D image capture, metadata describing camera/image parameters, and the like.

The assistance data component 806 may include various computer-executable components that facilitate processing the native assistance data 802 and/or the received 2D image data 102 to generate structured assistance data 930 and/or pre-processed 2D image data 932. In the illustrated embodiment, these components include an orientation estimation component 912, a location estimation component 914, a depth estimation component 916, a multi-image analysis component 918, a 3D sensor data association component 924, a preprocessing component 926, and a semantic labeling component 928.

The orientation estimation component 912 may be configured to determine or estimate the capture orientation or pitch of the 2D image and/or the relative orientation/pitch of the 2D with respect to the common 3D coordinate space. For example, in some embodiments, orientation estimation component 912 may determine an orientation of the received 2D image (as provided by capture device motion data 904) based on IMU or accelerometer measurements associated with the 2D image. The determined orientation or pitch information may be characterized as structured assistance data 930 and associated with the 2D image. The orientation information determined for the 2D image may be used with the 2D image as input to one or more enhanced 3D self-2D models (e.g., one or more enhanced models 810) to generate derived 3D data 116 for the 2D image, used by model generation component 118 to facilitate alignment procedures associated with 3D model generation, and/or stored in a memory (e.g., memory 122 or external memory) for additional applications.

The position estimation component 914 may be configured to determine or estimate a capture position of the 2D image and/or a relative position of the 2D image with respect to a common 3D coordinate space. The determined capture location information may also be characterized as structured assistance data 930 and associated with the 2D image. The location information may also be used with the 2D image as input to one or more enhanced 3D self-2D models (e.g., one or more enhanced models 810) to generate derived 3D data 116 for the 2D image, used by model generation component 118 to facilitate alignment procedures associated with 3D model generation, and/or stored in a memory (e.g., memory 122 or external memory) for additional applications.

The location estimation component 914 can employ various techniques to determine a capture position (i.e., a capture location) of the 2D image based on the type of assistance data available. For example, in some implementations, the capture device location data 906 may identify or indicate a capture location of the received 2D image (e.g., GPS coordinates of the capture device). In other implementations, the position estimation component 914 can employ the capture device motion data 904 to determine a capture position of the 2D image using inertial position tracking analysis. In other embodiments, the native assistance data 802 may include sensed data captured in association with the capture of one or more 2D images, which may be used to facilitate determining a capture location of the 2D images. For example, the sensed data may include 3D data captured by stationary sensors, ultrasound systems, laser scanners, etc., which may be used to facilitate determining a location of a capture device that captures one or more 2D images using visual ranging techniques, line of sight for mapping and positioning, time-of-flight mapping and positioning, etc.

In some embodiments, orientation estimation component 912 and/or location estimation component 914 may employ one or more related images included in 2D image data 102 in order to facilitate determining a capture orientation and/or location of the 2D image. For example, the related 2D images may include adjacent images, images with partially overlapping fields of view, images with slightly different capture locations and/or capture orientations, stereo image pairs, images providing different perspectives of the same object or environment captured at significantly different capture locations (e.g., beyond a threshold distance to not constitute a stereo image pair, such as an interocular distance greater than about 6.5 centimeters), and so on. The relationship between the source of the associated 2D image and the associated 2D image included in the 2D image data 102 may vary. For example, in some implementations, the 2D image data 102 may include video data 902 that includes consecutive frames of video captured in association with movement of a video camera. The related 2D images may also include video frames captured by the video camera in a fixed position/orientation, but captured at different points in time as one or more characteristics of the environment change at the different points in time. In another example, similar to consecutive frames of video captured by a video camera, an entity (e.g., a user, a robot, an autonomous vehicle, etc.) may use the camera to capture several 2D images of the environment in association with movement of the entity about the environment. For example, using a standalone digital camera, smartphone, or similar device with a camera, a user may walk around the environment and take 2D images at several points along the way, capturing different perspectives of the environment. In another exemplary implementation, the relevant 2D images may include 2D images from nearby or overlapping perspectives captured by a single camera in association with rotation of the machine about a fixed axis. In another implementation, the related 2D images may include two or more images respectively captured by two or more cameras at different perspectives of the partially overlapping fields of view or environment (e.g., captured simultaneously or near simultaneously by different cameras). Using this implementation, the related 2D images may include images that form a stereoscopic image pair. The related 2D images may also include images captured by two or more different cameras that are not arranged as a stereo pair.

In some embodiments, orientation estimation component 912 and/or position estimation component 914 may employ visual ranging and/or simultaneous localization and mapping (SLAM) to determine or estimate a capture orientation/position of a 2D image based on a sequence of related images captured in association with movement of a camera. Visual ranging methods may be used to determine an estimate of camera capture orientation and position based on an image sequence using feature matching (matching features over multiple frames), feature tracking (matching features in adjacent frames), and optical flow techniques (based on the intensity of all pixels or a particular region in sequential images). In some embodiments, orientation estimation component 912 and/or position estimation component 914 may employ capture device motion data 904, capture device position data 906, and/or 3D sensor data 910 in association with evaluating the image sequence using visual ranging and/or SLAM to determine the capture position/orientation of the 2D image. SLAM techniques employ algorithms that are configured to simultaneously locate (e.g., determine a position or orientation of) a capture device (e.g., a 2D image capture device or a 3D capture device) relative to its surroundings and simultaneously map the structure of the environment. SLAM algorithms may involve tracking a set of points through a sequence of images, using these tracks to triangulate the 3D position of the points, while using the point locations to determine the relative position/orientation of the capture device that captured them. In this regard, in addition to determining the position/orientation of the capture device, the SLAM algorithm may also be used to estimate depth information for features included in one or more images of the sequence of images.

In some embodiments, the sequence of related data images may include images captured in association with a scan of an environment involving the capture of several images at different capture locations. In another example, the sequence of related images may include video data 902 associated with 2D images of an object or environment captured during movement of a capture device associated with a scan of the object or environment (where the scan involves capturing multiple images of the object or environment from different capture locations and/or orientations). For example, in some implementations, the video data 902 may include video data captured in addition to one or more 2D images (e.g., by a separate camera) during the scanning process. The orientation estimation component 912 and/or the position estimation component 914 may also use the video data 902 to determine a capture orientation/position of one or more 2D images captured during a scan using visual ranging and/or SLAM techniques. In some implementations, the video data 902 may include primary images that may be processed by the system 800, or the like, to derive 3D data therefrom using one or more 3D self-2D techniques described herein (e.g., using one or more of the standard model 114, the panoramic model 514, the augmented model 810, etc.). According to this example, one or more of these frames may be used as a primary input image from which 3D data is derived using one or more of the 3D self-2D techniques described herein. Additionally, orientation estimation component 912 and/or position estimation component 914 may use neighboring images to facilitate determining a captured orientation/position of a primary input frame using visual ranging and/or SLAM.

Depth estimation component 916 can also evaluate the related images to estimate depth data for one or more related images. For example, in some embodiments, depth estimation component 916 may employ SLAM to estimate depth data based on the image sequence. Depth estimation component 916 can also employ related photogrammetry techniques to determine depth information for the 2D image based on one or more related images. In some implementations, depth estimation component 916 can also employ capture device motion data 904 and one or more motion recovery structure techniques to facilitate estimating depth data for a 2D image.

In some embodiments, depth estimation component 916 may also be configured to employ one or more passive stereo processing techniques to derive depth data from image pairs classified as stereo image pairs (e.g., image pairs offset by stereo image pair distance, such as an interocular distance offset of about 6.5 centimeters). For example, passive stereo involves comparing two stereo images that are horizontally displaced from each other and providing two different views of a scene. By comparing the two images, it is possible to obtain relative depth information in the form of a disparity map encoding the difference in the horizontal coordinates of the corresponding image points. The values in the disparity map are inversely proportional to the depth of the scene at the corresponding pixel location. In this regard, given a pair of stereo images taken from slightly different viewpoints, depth estimation component 916 may employ a passive stereo matching function that identifies and extracts corresponding points in the two images. Knowing these correspondences, the image capture location and the scene structure, the 3D world coordinates of each image point can be reconstructed by triangulation. The disparity at which the depth data is encoded represents the x-coordinate or distance between a pair of corresponding points in the left and right images.

In various implementations, a stereoscopic image pair may include images (e.g., corresponding to left and right images similar to an image pair seen by the left and right eyes) that are offset along a horizontal axis by a stereoscopic image pair distance (e.g., an interocular distance of about 6.5 centimeters). In other implementations, the stereo image pair may include an image pair offset by a stereo image distance along a vertical axis. For example, in some embodiments, the received 2D images may include panoramic image pairs having a field of view spanning 360 ° (or up to 360 °) captured from different vertical positions relative to the same vertical axis, wherein the different vertical positions are offset by a stereoscopic image pair distance. In some implementations, the respective vertically offset stereoscopic images can be captured by a camera configured to move to a different vertical position to capture the respective images. In other implementations, the respective vertically offset stereoscopic images may be captured by two different cameras (or camera lenses) located at different vertical positions.

In some implementations, the depth estimation component 916 can also employ one or more active stereo processes to derive depth data for a stereo image pair captured in association with projected light (e.g., structured light, laser light, etc.), in accordance with various active stereo capture techniques. For example, active stereo processing employs light emission associated with the capture of stereo images (e.g., via lasers, structured light devices, etc.) to facilitate stereo matching. The word "active" means projecting energy into the environment. In active stereo vision systems, in connection with the capture of stereo images, a light projection unit or laser unit projects light or a light pattern (or simultaneous multiple sheets of light) onto a scene at a time. Light patterns detected in the captured stereo images may be used to facilitate extraction of depth information for features included in the respective images. For example, based in part on the correspondence between the light appearing in the respective images and the known position of the light/laser beam relative to the image capture location, the depth derivation component can perform active stereo analysis by looking up the correspondence between visual features included in the respective images.

Passive and/or active stereo-derived depth data may be associated with one or both images in a stereo pair. Depth data determined for the 2D image by the depth estimation component 916 based on analysis of one or more related images (e.g., using SLAM, photogrammetry, structure from motion recovery, stereo processing, etc.) can also be characterized as structured auxiliary data 930. The depth data may also be used with the 2D image as input to one or more enhanced 3D self-2D models (e.g., one or more enhanced models 810) to generate derived 3D data 116 for the 2D image, used by model generation component 118 to facilitate alignment procedures associated with 3D model generation, and/or stored in a memory (e.g., memory 122 or external memory) for additional applications.

In other embodiments, rather than determining depth data from passive stereo algorithms, the depth estimation component 916 may evaluate stereo image pairs to determine data regarding the quality of photometric matches between images at various depths (more intermediate results). In this regard, the depth estimation component 916 may determine the auxiliary data for one or both images included in a stereoscopic pair by determining match quality data regarding the quality of photometric matches between corresponding images at various depths. This photometric matching quality data can be used as auxiliary data for any 2D image in a stereo pair to be used as input to enhance a 3D self 2D model, resulting in deriving depth data for any 2D image.

Multiple image analysis component 918 can facilitate identifying, associating, and/or defining a relationship between two or more related images. In the illustrated embodiment, the multiple image analysis component 918 can include an image correlation component 920 and a relationship extraction component 922.

The image correlation component 920 may be configured to identify and/or classify correlated images included in the received 2D image data 102. The image correlation component 920 may employ various techniques to identify and/or classify two or more images as correlated images. In some embodiments, the related images and information defining the relationship between the related images employed by assistance data component 806 may be predefined. In this regard, the auxiliary data component 806 can identify and extract one or more related images included in the 2D image data 102 based on predefined information associated therewith. For example, an image having information classifying the image as a stereoscopic image pair may be received. In another example, the capture device may be configured to provide two or more images captured in association with rotation about a fixed axis. According to this example, an image may be received with information that annotates the captured scene and identifies its relative capture position and orientation to each other. The image correlation component 920 may be further configured to automatically classify images captured under the capture scene as being correlated. In another example, the image correlation component 920 may be configured to automatically classify a set of images captured by the same camera in association with a scan within a defined time window as being correlated. Similarly, based on the capture device motion data 904 (e.g., movement in a particular direction less than a threshold distance or degree of rotation), the image correlation component 920 may be configured to automatically classify respective frames of video included in the same video clip as being correlated for less than a defined duration and/or associated with a defined range of movement.

In other embodiments, the image correlation component 920 may be configured to identify the correlated images included in the 2D image data 102 based on: a respective captured location of the image (which may be provided with the received image and/or determined at least in part by location estimating component 914), and a respective captured orientation of the image (which may be provided with the received image and/or determined at least in part by orientation estimating component 912). For example, the image correlation component 920 may be configured to classify two or more images as being correlated based on the capture positions and/or capture orientations having a defined distance and/or degree of rotation that differs. For example, the image correlation component 920 may identify and classify two images as correlated based on different capture orientations having the same capture position but with a defined degree of rotation difference. Likewise, the image correlation component 920 may identify and classify two images as correlated based on different capture positions having the same capture orientation but differing by a defined distance or range of distances. According to this example, the image correlation component 920 can also identify and classify image pairs as stereo pairs.

The image correlation component 920 may also identify correlated images based on the time of capture and/or motion data regarding relative changes in motion between two or more images. For example, the image correlation component 920 may identify correlated images based on having respective capture times within a defined time window, having respective capture times separated by a maximum duration, and so forth. In other implementations, the image correlation component 920 can identify correspondences in visual features included in two or more images using one or more image analysis techniques to identify correlated images. The image correlation component 920 may further identify/classify correlated images based on the degree of correspondence in the visual features relative to a defined threshold. The image correlation component 920 may also use depth data (e.g., as the 3D sensor data 910) associated with the respective images (if provided) to determine spatial relationships between relative positions of corresponding visual features and employ these spatial relationships to identify/classify related images.

The relationship extraction component 922 may be configured to determine and/or associate relationship information with related images, the related information defining information about relationships between the related images. For example, relationship extraction component 922 may determine information regarding elapsed time between captures of two or more potentially related images, relative capture positions of two or more potentially related images (which may be provided with the received image and/or determined at least in part by position estimation component 914), relative capture orientations of two or more potentially related images (which may be provided with the received image and/or determined at least in part by orientation estimation component 912), information regarding correspondence between visual and/or spatial features of the related images, and so forth. Relationship extraction component 922 may further generate and associate relationship information with two or more related images that define a relationship (e.g., relative position, orientation, time of capture, visual/spatial correspondence, etc.) between the images.

In some embodiments, as described above, one or more components of assistance data component 806 (e.g., orientation estimation component 912, location estimation component 914, and/or depth estimation component 916) may employ the correlated images to generate structured assistance data 930 for one or more images included in the set (two or more) of correlated images. For example, as described above, in various embodiments, the 2D image data 102 may include video data 902 and/or 2D images captured in association with a scan that provides a contiguous (but different) perspective of the environment to sequential images (e.g., video frames and/or still images). For these embodiments, position estimation component 914 and/or orientation estimation component 912 may use the related sequential images to determine the captured position/orientation information or the single 2D image using visual ranging and/or SLAM techniques. Similarly, depth estimation component 916 can employ the correlated images to derive depth data using stereo processing, SLAM, motion recovery from structures, and/or photogrammetry techniques.

In other embodiments, the correlated images may be used as input to one or more 3D self-2D models (e.g., included in the 3D self-2D model database 112) in order to facilitate deriving (e.g., by the depth data derivation component 110) depth data for one or more images included in the set (two or more) of correlated images. For these embodiments, the one or more augmented models 810 may include an augmented 3D self-2D neural network model configured to receive and process two or more input images (e.g., as opposed to, for example, a standard model 114 configured only to evaluate a single image at a time). The enhanced 3D self-2D neural network model may be configured to evaluate relationships between related images (e.g., using depth learning techniques) in order to facilitate deriving depth data for one or more of the related images (e.g., where the related images may include a group of two or more related images). For example, in some implementations, one image included in a set of related images may be selected as the primary image for which derived 3D data 116 is determined, and the neural network model may use one or more other images in the set that are related in order to facilitate the derivation of the 3D data of the primary image. In other implementations, the enhanced 3D self-2D model may be configured to derive depth data for multiple input images at once. For example, enhancing a 3D self-2D model may determine depth information for all or some of the relevant input images. In association with using the relevant images as input to the enhanced 3D from 2D neural network model, relationship information describing relationships between the respective images (e.g., determined by the relationship extraction component 922 and/or associated with the respective images) may be provided as input with the respective images and evaluated by the enhanced 3D from 2D neural network model.

The 3D sensor data association component 924 may be configured to identify and associate any received 3D sensor data 910 of an image with a 2D image in order to facilitate using the 3D sensor data 910 as input to one or more augmented models 810. In this regard, the 3D sensor data correlation component 924 can ensure that the 3D data received in association with the 2D image is in a consistent structured machine-readable format prior to input to the neural network. In some implementations, the 3D sensor data correlation component 924 can process the 3D sensor data 910 to ensure that the data accurately correlates with respective pixels, superpixels, etc. of the image for which the data was captured. For example, in implementations in which partial 3D sensor data is received for a 2D image (e.g., for a middle portion of a spherical image located near the equator as compared to the entire field of view of the spherical image), the 3D sensor data correlation component 924 can ensure that the partial 3D data is accurately mapped to the region of the 2D image for which the data was captured. In some implementations, the sensor data correlation component 924 can calibrate the 3D depth data received with the 2D image by capturing locations and/or corresponding locations in the common 3D coordinate space so that additional or optimized depth data determined from the 2D model using enhanced 3D for the image can be based on or calibrated to the same reference point. The 3D sensor data 910 associated with the 2D image (e.g., in a standardized format and/or with calibration information in some implementations) can also be used with the 2D image as input to one or more enhanced 3D self-2D models (e.g., one or more enhanced models 810) to generate derived 3D data 116 of the 2D image, used by the model generation component 118 to facilitate alignment procedures associated with 3D model generation, and/or stored in a memory (e.g., memory 122 or external memory) for additional applications.

The preprocessing component 926 may be configured to preprocess the images to convert the images into a unified representation format prior to input into the 3D self 2D neural network model (e.g., included in the 3D self 2D model database 112) based on the camera/image parameters 908 associated with the respective images, such that the results of the neural network are not degraded by differences between the training images and the real images. In this regard, the preprocessing component 926 may change one or more characteristics of the 2D image to convert the 2D image into an altered version of the 2D image that conforms to a standard representation format defined for the 2D image, processed by a particular neural network model. Thus, the neural network model may comprise an enhanced neural network model that has been trained to evaluate images that conform to a standard representation format. For example, the preprocessing component 926 may correct or modify image defects to account for lens distortion, illumination variations (exposure, gamma, tone mapping, black level), color space (white balance) variations, and/or other image defects. In this regard, the preprocessing component 926 may synthetically balance the respective images to account for differences between camera/image parameters.

In various embodiments, the preprocessing component 926 may determine whether and how to alter the 2D image based on camera/image parameters associated with the image (e.g., received with the image as metadata). For example, the preprocessing component 926 may identify differences between one or more camera/image parameters associated with the received 2D image and one or more defined camera/image parameters of the standard representation format. The preprocessing component 926 may further alter (e.g., edit, modify, etc.) one or more characteristics of the 2D image based on the differences. In some implementations, the one or more characteristics may include visual characteristics, and the preprocessing component 926 may change the one or more visual characteristics. The preprocessing component 926 may also change the orientation of the image, the size of the image, the shape of the image, the magnification level of the image, etc.

In some embodiments, the preprocessing component 926 may also use the position and/or orientation information regarding the relative position and/or orientation from which the input images were captured in order to rotate the input images so that the direction of motion between them is horizontal prior to input to the augmented neural network model. For these embodiments, the enhanced neural network model may be trained (e.g., included in the one or more enhanced models 810) using horizontal disparity cues to predict depth data (e.g., the 3D data 116 has been derived). The images pre-processed by the pre-processing component 926 may be characterized as pre-processed 2D image data 932 and used as input to one or more enhanced 3D self 2D models (e.g., one or more enhanced models 810) that are specially trained to evaluate such pre-processed images. In some implementations, the preprocessed images may also be used as inputs to one or more standard models 114 and/or panoramic models 514 to improve the accuracy of the results of those models. The pre-processed 2D image data 932 may also be stored in a memory (e.g., memory 122 or an external memory) for additional applications.

The semantic labeling component 928 may be configured to process the 2D image data 102 to determine semantic labels for features included in the image data. For example, the semantic labeling component 928 may be configured to employ one or more machine learning object recognition techniques to automatically recognize defined objects and functions (e.g., walls, floors, ceilings, windows, doors, furniture, people, buildings, etc.) included in the 2D image. Semantic labeling component 928 may further assign a label of the identified object to the identified object. In some implementations, the semantic labeling component 928 may also perform semantic segmentation and further identify and define the boundaries of identified objects in the 2D image. Semantic labels/boundaries associated with features included in the 2D image may be characterized as structured auxiliary data 930 and used to facilitate deriving depth data for the 2D image. In this regard, semantic tag/segmentation information associated with the 2D image may also be used with the 2D image as input to one or more enhanced 3D self-2D models (e.g., one or more enhanced models 810) to generate derived 3D data 116 for the 2D image, used by model generation component 118 to facilitate alignment procedures associated with 3D model generation, and/or stored in a memory (e.g., memory 122 or external memory) for additional applications.

Fig. 10 presents an exemplary computer-implemented method 1000 for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data, in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 1002, a system (e.g., system 800) operably coupled to a processor receives a 2D image. At 1004, the system receives (e.g., via receiving component 111) or determines (e.g., via assistance data component 806) assistance data for the 2D image, wherein the assistance data includes orientation information about a capture orientation of the two-dimensional image. At 1006, the system derives the 3D information for the 2D image using one or more neural network models (e.g., one or more augmented models 810) configured to infer three-dimensional information based on the two-dimensional image and the assistance data (e.g., using the 3D data derivation component 110).

Fig. 11 presents another exemplary computer-implemented method 1100 for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 1102, a system (e.g., system 800) operably coupled to a processor receives a captured 2D image of an object or environment, wherein the 2D image is associated based on providing different perspectives of the object or environment. At 1104, the system derives depth information for at least one of the relevant 2D images based on the relevant 2D images using the one or more neural network models (e.g., one or more augmented models 810) and the relevant 2D images as inputs to the one or more neural network models (e.g., via the 3D data derivation component 110). For example, the one or more neural network models may include a neural network model configured to evaluate/process more than one 2D image and use information about relationships between the respective 2D images to facilitate deriving depth data for some or all of the input images.

Fig. 12 presents another exemplary computer-implemented method 1000 for employing auxiliary data related to captured 2D image data to facilitate deriving 3D data from the captured 2D image data in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 1202, a system (e.g., system 800) operatively coupled to a processor receives a 2D image. At 1204, the system pre-processes the 2D, wherein the pre-processing includes changing one or more characteristics of the two-dimensional image to convert the image into a pre-processed image according to a standard representation format (e.g., via the pre-processing component 926). At 1206, the system derives the 3D information for the preprocessed 2D image using one or more neural network models configured to infer 3D information based on the preprocessed 2D image (e.g., using the 3D data derivation component 110).

Fig. 13 presents another example system 1300 that facilitates deriving 3D data from 2D image data and generating a reconstructed 3D model based on both the 3D data and the 2D image data, in accordance with various aspects and embodiments described herein. System 1300 includes the same or similar functionality as system 800 with the addition of optimized 3D data 1306. System 1300 also includes an upgraded 3D from 2D processing module 1304, which is different from 3D from 2D processing module 804, where a 3D data optimization component 1302 is added, which may be configured to generate optimized 3D data 1306. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

Referring to fig. 9 and 13, in various embodiments, the assistance data component output data 808 may include structured assistance data 930 that includes certain depth data associated with the 2D image. For example, in some implementations, the depth data may include 3D sensor data captured in association with the capture of a 2D image, and the 3D sensor data is associated with the 2D image. In other implementations, the depth data can include one or more depth measurements determined by the depth estimation component 916 for the 2D image (e.g., determined using SLAM, by motion recovery structures, photogrammetry, etc.). In some embodiments, this depth data (hereinafter "initial" depth data) may be used as input associated with the 2D image for one or more augmented models 810 to facilitate generation of the derived 3D data 116 for the 2D image.

However, in other embodiments, the initial depth data may be provided to the 3D data optimization component 1302 in addition to and/or instead of using the initial depth data as input to one or more 3D self 2D models included in the 3D self 2D model database 112. The 3D data optimization component 1302 may be configured to analyze 3D/depth data obtained from different sensors and/or depth derivation modalities, including the derived 3D depth data 116 and initial depth data values, in order to determine an optimized unified interpretation of the depth data, referred to herein and depicted in the system 1300 as optimized 3D data 1306. In particular, using different types of depth sensor devices and/or depth derivation techniques (e.g., including different types of 3D sensor depth data, passive stereo processing, active stereo processing, SLAM processing, photogrammetry processing, self-motion restoration structure processing, and 3D self-2D processing), 3D data optimization component 1302 may analyze the different types of depth data captured and/or determined for the 2D image to determine optimized 3D data for respective pixels, superpixels, features, etc. of the 2D image.

For example, in one implementation, the 3D data optimization component 1302 may be configured to combine different depth data values associated with the same pixels, superpixels, features, areas/regions, etc. of a 2D image. The 3D data optimization component 1302 can further employ heuristics to evaluate the quality of depth data generated separately using different modalities to determine a uniform interpretation of depth data for pixels, superpixels, features, regions/zones, etc. In another example, the 3D data optimization component 1302 may employ an average depth measurement for a respective pixel, super-pixel, feature, region/zone, etc. of the 2D image that averages the initial depth data and a corresponding depth measurement reflected in the derived 3D data 116. In some embodiments, 3D data optimization component 1302 can map the determined depth measurements (including derived 3D data 116, depth data received from 3D sensors, depth data determined using stereo processing, depth data determined using SLAM, depth data determined using photogrammetry, etc.) to corresponding pixels, superpixels, features, etc. of the image using different approaches. The 3D data optimization component 1302 can further combine the respective depth values to determine an optimal depth value for the respective pixel, superpixel, etc., that weights the different measurement values based on a defined weighting scheme. For example, the weighting scheme may take advantage of known advantages and disadvantages of the respective depth data sources to determine the accuracy associated with each applicable source, and combine the depth data from each applicable source in a principled manner to determine optimized depth information. In another implementation, the initial depth data may include partial depth data of a portion of the 2D image. In one implementation, the 3D data optimization component 1302 may be configured to use the initial depth data for the image portion associated therewith and to populate the missing depth data for the remainder of the 2D image with the derived 3D data 116 determined for the remainder of the image.

The

systems

100, 500, 800, and 1300 discussed above each describe an architecture in which 2D image data and, optionally, auxiliary data associated with the 2D image data are received and processed by a general purpose computing device (e.g., computing device 104) to generate derived depth data for a 2D image, generate a 3D reconstructed model, and/or facilitate navigation of the 3D reconstructed model. For example, a general purpose computing device may be or correspond to a server device, a client device, a virtual machine, a cloud computing device, and so forth. The

systems

100, 500, 800, and 1300 also include a user device 130 configured to receive and display the reconstructed model, and in some implementations interface with the navigation component 126 to facilitate navigation of the 3D model rendered at the user device 130. However,

systems

100, 500, 800, and 1300 are not limited to this architectural configuration. For example, in some embodiments, one or more features, functions, and associated components of computing device 104 may be provided at user device 130, and vice versa. In another embodiment, one or more features and functions of the computing device 104 may be provided at a capture device used to capture 2D image data. In yet another exemplary embodiment, the one or more cameras (or one or more camera lenses) used to capture 2D image data, the 3D self 2D processing module, the 3D model generation component 118, the navigation component 126, and the display 132 displaying the 3D model and a representation of the 3D model may all be disposed on the same device.

Fig. 14-25 present various exemplary devices and/or systems that provide different architectural configurations that can provide one or more features and functions of

systems

100, 500, 800, and/or 1300 (as well as additional systems described herein). In particular, according to various aspects and embodiments described herein, the various exemplary devices and/or systems shown in fig. 14-25 facilitate capturing a 2D image (e.g., 2D image data 102) of an object or environment, and deriving depth data from the 2D image using one or more 3D self-2D techniques, respectively.

In this regard, the respective devices and/or systems presented in fig. 14-25 may include at least one or more cameras 1404 configured to capture 2D images, and a 3D self-2D processing module 1406 configured to derive 3D data from the 2D images (e.g., one or more 2D images). The 3D-from-2D processing module 1406 may correspond to the 3D-from-2D processing module 106, the 3D-from-2D processing module 504, the 3D-from-2D processing module 804, the 3D-from-2D processing module 1304, or a combination thereof. In this regard, the 3D self-2D processing module 1406 is used to collectively represent a 3D self-2D processing module that may provide one or more features and functions (e.g., components) of any of the 3D self-2D processing modules described herein.

The one or more cameras 1404 may include, for example, RGB cameras, HDR cameras, video cameras, and the like. In some embodiments, the one or more cameras 1404 may include one or more cameras capable of generating panoramic images (e.g., panoramic image data 502). According to some embodiments, the one or more cameras 1404 may also include a video camera capable of capturing video (e.g., video data 902). In some implementations, the one or more cameras 1404 can include cameras that provide a relatively standard field of view (e.g., about 75 °). In other embodiments, one orThe plurality of cameras may include cameras that provide a relatively wide field of view (e.g., from 120 ° to 360 °), capture devices that use a conical mirror (e.g., capable of capturing 360 ° panoramic images from a single image capture), cameras that are capable of generating spherical color panoramic images (e.g., RICOH theta (tm)) and cameras that are capable of capturing spherical color panoramic images^TMA camera), etc.

In some embodiments, the devices and/or systems presented in fig. 14-25 may employ a single camera (or a single camera lens) to capture 2D input images. For these embodiments, the one or more cameras 1404 may represent a single camera (or camera lens). According to some of these embodiments, a single camera and/or a device housing the camera may be configured to rotate about an axis to generate images at different capture orientations relative to the environment, wherein the common field of view of the images spans horizontally up to 360 °. For example, in one implementation, the camera and/or the device housing the camera may be mounted on a rotatable mount that is rotatable 360 °, while the camera captures two or more images at different points of rotation whose common field of view spans 360 °. In another exemplary implementation, rather than using a rotatable mount, the camera and/or the device housing the camera may be configured to rotate 360 ° when placed on a flat plane using an internal mechanical drive mechanism (such as a wheel or vibrational force) of the camera and/or the device housing the camera. In another implementation, the one or more cameras 1404 employed by the devices and/or systems presented in fig. 14-25 may correspond to a single panoramic camera (or a camera that is rotatable to generate a panoramic image) that employs an actuation mechanism that allows the cameras to move up and down relative to the same vertical axis. Using this implementation, a single camera may capture two or more panoramic images that span different vertical fields of view but provide the same or similar horizontal fields of view. In some embodiments, the two or more panoramic images may be combined (e.g., by stitching 508 or at the capture device) to generate a single panoramic image having a wider vertical field of view than either image alone. In other embodiments, a single camera may capture two panoramic images with a vertical stereo offset such that the two panoramic images form a stereo image pair. For these embodiments, the stereoscopic panoramic image may be used directly as an input to a 3D self-2D neural network model and/or processed by depth estimation component 916 to derive depth data for one or both images using passive stereoscopic processing. This additional depth data may be used as auxiliary input data for a 3D self 2D neural network model (e.g., the augmented model 810).

In other embodiments, the devices and/or systems presented in fig. 14-25 may employ two or more cameras (or two or more camera lenses) to capture 2D input images. For these embodiments, the one or more cameras 1404 may represent two or more cameras (or camera lenses). In some of these embodiments, two or more cameras may be arranged on or in the same housing in a relative position to each other such that their common field of view spans up to 360 °. In some implementations of these embodiments, a camera pair (or lens pair) capable of generating a stereoscopic image pair (e.g., with slightly offset but partially overlapping fields of view) may be used. For example, a capture device (e.g., a device including one or more cameras 1404 for capturing 2D input images) may include two cameras with horizontally stereo offset fields of view that are capable of capturing stereo image pairs. In another example, the capture device may include two cameras with vertically stereo offset fields of view that are capable of capturing a vertical stereo image pair. According to any of these examples, each camera may have a field of view that spans up to 360 °. In this regard, in one embodiment, the capture device may employ two panoramic cameras with a vertical stereo offset capable of capturing pairs of panoramic images forming a stereo pair (with a vertical stereo offset). With these implementations of the capturable stereo image pair, the 3D self-2D processing module 1406 may be or include the 3D self-2D processing module 804 or 1304, and the auxiliary data component 806 may use stereo processing (e.g., via the depth estimation component 916) to derive initial depth data for the respective images included in the stereo image pair. As discussed above with reference to fig. 9 and 13, the initial depth data may be used as input to an enhanced model 3D from 2D model (selected from one or more enhanced data models 806) to facilitate deriving 3D data for either stereoscopic image included in a pair, used by 3D data optimization component 1302 to facilitate generating optimized 3D data 1306, and/or used by 3D model generation component 118 to facilitate generating a 3D model of an object or environment captured in the respective image.

The devices and/or systems described in fig. 14-25 may include machine-executable components embodied within a machine, such as in one or more computer-readable medium(s) associated with one or more machines. Such components, when executed by one or more computers (e.g., computers, computing devices, virtual machines, etc.), can cause the machine to perform the operations. In this regard, although not shown, the devices and/or systems described in fig. 14-25 may include or be operatively coupled to at least one memory and at least one processor. The at least one memory may further store computer-executable instructions/components that, when executed by the at least one processor, cause performance of operations defined by the computer-executable instructions/components. Examples of the memory and processes and other computing device hardware and software components that may be included in the described devices and/or systems are provided with reference to fig. 35. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

Referring to fig. 14, an example user device 1402 that facilitates capturing 2D images and deriving 3D data from the 2D images in accordance with various aspects and embodiments described herein is presented. In this regard, the user device 1402 may include one or more cameras 1404 for capturing 2D images and/or video, and a 3D self-2D processing module 1406 for deriving 3D data from the 2D images, as discussed above. The user device 1402 can also include a 3D model generation component 118 for generating a reconstructed 3D model based on the 3D data and the 2D image data, and a display/rendering component that facilitates rendering the 3D reconstructed model at the user device 1402 (e.g., via a device display). For example, the display/rendering component 1408 may include suitable hardware and/or software that facilitates accessing or otherwise receiving a 3D model and/or a representation of a 3D model (e.g., including a 3D floorplan model, a 2D floorplan model, a toy house view representation of a 3D model, etc.) and displaying them via a display of a user device (e.g., the display 132). In some embodiments, the user device 1402 may be or correspond to the user device 130. For example, the user device 1402 may be or include, but is not limited to: a desktop computer, a laptop computer, a mobile phone, a smartphone, a tablet PC, a PDA, a standalone digital camera, a HUD device, a virtual reality VR headset, an AR headset or device, or other type of wearable computing device.

In other embodiments, the user device 1402 may not include the 3D model generation component 118 and/or the display/rendering component 1408. For these embodiments, the user device 1402 may simply be configured to capture 2D images (e.g., 2D image data 102) via the one or more cameras 1404 and derive depth data for the 2D images (e.g., derived 3D data 116). The user device 1402 may further store the 2D image and its associated derived depth data (e.g., in a memory of the user device 1402, not shown), and/or provide the 2D image and its associated derived depth data to another device for use by the other device (e.g., to generate a 3D model or for another use context).

Fig. 15 presents another example user device 1502 that facilitates capturing a 2D image and deriving 3D data from the 2D image in accordance with various aspects and embodiments described herein. In this regard, the user device 1502 may include the same or similar features and functionality as the user device 1402. The user device 1502 differs from the user device 1402 in the addition of one or more 3D sensors 1504 and a positioning component 1506. In some embodiments, the user device 1502 may not include the positioning component 1506, but rather one or more 3D sensors, and vice versa. The user device 150 may further (optionally) include a navigation component 126 to provide on-board navigation of the 3D model generated by the 3D model navigation component 118 (in implementations in which the user device 1502 includes the 3D model generation component 118).

Referring to fig. 9 and 15, the one or more 3D sensors 1504 may include one or more 3D sensors or 3D capture devices configured to capture 3D/depth data in association with the capture of 2D images. For example, one or more 3D sensors 1504 may be configured to capture one or more of the various types of 3D sensor data 910 discussed with reference to fig. 9. In this regard, the one or more 3D sensors 1504 may include, but are not limited to: structured light sensors/devices, LiDAR sensors/devices, laser rangefinder sensors/devices, time-of-flight sensors/devices, light field-camera sensors/devices, active stereo sensors/devices, and the like. In one embodiment, the one or more cameras 1404 of the user device 1502 may include a camera that produces spherical color image data, and the one or more 3D sensors 1504 may include a structured light sensor (or another 3D sensor) configured to capture depth data for a portion of the spherical color image (e.g., near a middle portion of the vertical FOV or other equator). With this embodiment, the 3D self 2D processing module 1406 may be configured to employ a 3D self 2D neural network model (e.g., the augmented model 810) trained to acquire both spherical color image data and partial depth input and predict the depth of the entire sphere.

Similarly, the positioning component 1506 may include hardware and/or software configured to capture the capture device motion data 904 and/or capture device location data 906. For example, in the illustrated embodiment, the positioning component 1506 may include an IMU configured to generate capture device motion data 904 in association with capturing one or more images via one or more cameras 1404. The positioning component 1506 may also include a GPS unit configured to provide GPS coordinate information in association with image capture by one or more cameras. In some embodiments, the positioning component 1506 can associate motion data and position data of the user device 1502 with respective images captured via the one or more cameras 1404.

In various embodiments, the user equipment 1502 may provide one or more features and functions of the

systems

800 or 1300. In particular, via the inclusion of one or more 3D sensors 1504, user device 1502 may generate assistance data in the form of at least initial 3D depth sensor data associated with 2D images captured by one or more cameras 1404. This initial depth data may be used by the 3D self-2D processing module 1406 and/or the 3D model generation component 118 as described with reference to fig. 8 and 13. The user device 1502 may also capture additional assistance data and provide the additional assistance data to the 3D slave 2D processing module, including capture device motion data 904 and capture device position data 906.

Fig. 16 presents an exemplary system 1600 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. The system 1600 includes a capture device 1601 and a user device 1602. According to this embodiment, the separate capture device 1601 may include one or more cameras 1404 to capture 2D images (and/or video) of an object or environment. For example, the capture device 1601 may include a camera having one or more lenses disposed within a housing that is configured to hold (e.g., a standalone handheld camera, a standalone digital camera, a phone or smartphone that includes one or more cameras, a tablet PC that includes one or more cameras, etc.), is mounted on a tripod, is located on or within a robotic device, is located on or within a vehicle including an autonomous vehicle, is positioned in a fixed position relative to the environment (e.g., mounted to a wall or fixture), or another suitable configuration. The capture device 1601 can further provide the captured 2D image to the user device 1602 for further processing by a 3D self-2D processing module and/or a 3D model generation component 118 located at the user device 1602. In this regard, the capture device 1601 can include suitable hardware and software to facilitate communication with the user device 1602, and vice versa. In implementations in which the user device 1602 includes the 3D model generation component 118, the user device may also include a display/rendering component 1408 for receiving and displaying the 3D model (and/or a representation of the 3D model) at the user device.

According to this embodiment, the user device 1602 can include a receive/communication component 1604 to facilitate communicating with the capture device 1601 and to receive 2D images captured by the capture device (e.g., via one or more cameras). For example, the receiving/communicating component can facilitate wired and/or wireless communication between the user device 1602 and the capture device 1601, as well as between the user device 1602 and one or more additional devices (e.g., server devices, as discussed below)And/or wireless communication therebetween. For example, the receiving/communicating component 1604 can be or include various hardware and software devices associated with establishing and/or conducting wireless communications between the user device 1602 and external devices. For example, the receive/communication component 1604 may control operation of a transmitter-receiver or transceiver (not shown) of the user device to receive information (e.g., 2D image data) from the capture device 1601, provide information to the capture device 1601, and so on. The reception/communication component 1604 may facilitate wireless communication between the user device and an external device (e.g., the capture device 1601 and/or another device) using various wireless telemetry communication protocols. For example, the receiving/communicating component 1604 may communicate with external devices using communication protocols including, but not limited to: NFC-based protocol based on

Protocol of technology, based on

A Wi-Fi protocol, an RF-based communication protocol, an IP-based communication protocol, a cellular communication protocol, a UWB technology-based protocol, or other forms of communication (including proprietary and non-proprietary communication protocols).

Fig. 17 presents another example system 1700 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Similar to system 1600, system 1700 includes a capture device 1701 including one or more cameras 1404 configured to capture 2D images (and/or video); and a user device 1702 comprising receiving/communicating means 1604, 3D self-2D processing module 1506 and (optionally) 3D model generating means 118 and display/rendering means 1408. In this regard, the system 1700 may provide the same or similar features as the system 1600.

System 1700 differs from system 1600 in that one or more 3D sensors 1504 and a positioning component 1506 are added to capture device 1701. The user device 1702 may also include a navigation component 126 to provide onboard navigation of the 3D model generated by the 3D model generation component 118. According to this embodiment, the capture device 1701 may capture at least some initial depth data (e.g., 3D sensor data 910) for respective images captured by the one or more cameras 1404. The capture device 1701 may also provide the captured 2D image and initial depth data associated therewith to the user device 1702. For example, in one implementation, the one or more cameras 1404 may be configured to capture and/or generate panoramic images of the environment having a relatively wide field of view that spans up to 360 ° (e.g., greater than 120 °) at least in a horizontal direction. The one or more 3D sensors 1504 may also include a 3D sensor configured to capture depth data of a portion of the panoramic image such that the 3D depth sensor has a smaller field of view of the environment relative to the panoramic 2D image. For these embodiments, the 3D self-2D processing module 1406 of the user device 1702 may include additional features and functionality of the

systems

800 or 1300 relating to using assistance data to enhance 3D self-2D prediction. In this regard, the 3D self-2D processing module 1406 may employ the initial depth data to enhance the 3D self-2D prediction to generate optimized 3D data 1306 by using the initial depth data as input to one or more enhancement models 810 and/or using the initial depth data in conjunction with the derived 3D data 116. For example, in implementations in which the initial depth data includes partial depth data of the panoramic image, the 3D self-2D processing module 1406 may use one or more 3D self-2D predictive models to derive depth data for the remaining portion of the panoramic image for which the initial depth data was not captured. In some implementations, the capture device 1701 may also generate and provide capture device motion data 904 and/or capture device position data 906 to the user device 1702 in association with the 2D image.

Fig. 18 presents another exemplary system 1800 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Similar to system 1600, system 1800 includes a capture device 1801 including one or more cameras 1404 configured to capture 2D images (and/or video); and a user device 1802 configured to communicate with the capture device 1801 (e.g., using the receiving/communicating component 1604). System 1800 differs from system 1600 in that the location of the 3D self 2D processing module 1406 is at the capture device 1801, rather than at the user device 1802. In accordance with this embodiment, the capture device 1801 may be configured to capture 2D images (and/or video) of an object or environment, and further to use the 3D self-2D processing module 1406 to derive depth data (e.g., derived 3D data 116) for one or more images. The capture device 1801 may further provide the image and its associated derived depth data to the user device for further processing. For example, in the illustrated embodiment, the user device 1802 may include the 3D model generation component 118 to generate one or more 3D models (and/or representations of the 3D models) based on the received imaging data and the derived depth data associated therewith. The user device 1802 may also include a display/rendering component 1408 to render the 3D model and/or a representation of the 3D model at the user device 1802.

Fig. 19 presents another exemplary system 1900 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Similar to system 1700, system 1900 includes a capture device 1901 that includes one or more cameras 1404 configured to capture 2D images (and/or video); and a user device 1902 configured to communicate with the capture device (e.g., using receiving/communicating component 1604). Also similar to system 1700, capture device 1901 can include one or more 3D sensors 1504 and a positioning component 1506, and user device 1902 can include a 3D model generation component 118, a display/rendering component 1408, and a navigation component 126. System 1900 differs from system 1700 in that the location of the 3D self 2D processing module 1406 is at the capture device 1901, rather than at the user device 1902.

According to this embodiment, the capture device 1901 may be configured to capture 2D images (and/or video) of an object or environment as well as auxiliary data, including 3D sensor data 910, capture device motion data 904, and/or capture device location data 906. The capture device 1901 may further use a 3D self-2D processing module 1406 to derive depth data (e.g., derived 3D data 116) for one or more captured 2D images, where the 3D self-2D processing module corresponds to the 3D self-2D processing module 804 or 1304 (e.g., and is configured to use auxiliary data with the 2D images to facilitate depth data derivation/optimization). Capture device 1901 may further provide the image and its associated derived depth data to user device 1902 for further processing and use by navigation component 126. In some embodiments, the capture device 1901 may also provide assistance data to the user device 1902 in order to facilitate an alignment process associated with: the 3D model is generated by the 3D model generation component 118 based on the image data and its associated derived depth data.

Fig. 20 presents another exemplary system 2000 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Unlike

previous systems

1600, 1700, 1800, and 1900, which distributed various components between capture devices and user devices, system 2000 distributed components between user device 2002 and server device 2003. In the illustrated embodiment, user device 2002 may include one or more cameras 1404, 3D self-2D processing modules 1406, display/rendering components 1408, and receiving/communication components 1604. The server device 2003 may include a 3D model generation component 118 and a navigation component 126.

According to this embodiment, user device 2002 is operable as a capture device and captures at least a 2D image (e.g., 2D image data 102) of an object or environment using one or more cameras 1404. For example, user device 2002 may include a tablet PC, a smart phone, a stand-alone digital camera, a HUD device, an AR device, and so forth, having a single camera, a single camera having two lenses that may capture a stereoscopic image pair, a single camera having two lenses that may capture a 2D image having a wide field of view, two or more cameras, and so forth. The user device 2002 may also include a device capable of capturing and/or generating (e.g., via stitching component 508 of stitching 3D from 2D processing module 1406) panoramic images (e.g., images having a field of view greater than a minimum threshold and up to 360 °). User device 2002 may further execute 3D self-2D processing module 1406 to derive 3D/depth data for respective images captured via one or more cameras 1404 according to one or more of the various techniques described with reference to 3D self-2D processing module 106, 3D self-2D processing module 504, 3D self-2D processing module 804, and 3D self-2D processing module 1304.

User device 2002 and server device 2003 may be configured to operate in a server-client relationship, where server device 2003 provides services and information to user device 2002, including various 3D modeling services provided by 3D model generation component 118, and navigation services provided by navigation component 126 that facilitate navigating the 3D model displayed at user device 2002. The respective devices may communicate with each other via one or more wireless communication networks (e.g., cellular networks, the internet, etc.). For example, in the illustrated embodiment, the server device 2003 may also include a receive/communication component 2004, which may include suitable hardware and/or software to facilitate wireless communication with the user device 2002. In this regard, the receiving/communicating component 2004 can include the same or similar features and functionality as the receiving/communicating component 1604. In some implementations, the server device 2003 may operate as a Web server, an application server, a cloud-based server, etc. to provide 3D modeling and navigation services to the user device 2002 via a website, a Web application, a thin client application, a hybrid application, or another suitable network accessible platform.

In one or more implementations, the user device 2002 may be configured to capture 2D images via the one or more cameras 1404, derive depth data for the 2D images, and provide the captured 2D images and their associated derived depth data to the server device 2003 (e.g., transmit, send, transmit, etc.) for further processing by the 3D model generation component 118 and/or the navigation component 126. For example, using the 3D model generation component 118, the server device 2003 may generate a 3D model of an object or environment included in the received 2D image according to the techniques described herein with reference to fig. 1. Server device 2003 may further provide (e.g., transmit, send, transfer, stream, etc.) the 3D model (or 2D model, such as a 2D floorplan model) to user device 2002 for rendering via a display at user device 2002 (e.g., using display/rendering component 1408).

In some embodiments, server device 2003 may generate and provide one or more intermediate versions of the 3D model to user device 2002 based on the image data and associated depth data that have been received so far during the scanning process. These intermediate versions may include 3D reconstructions, 3D images, or 3D models. For example, during a scanning process, wherein the user device is positioned at different locations and/or orientations relative to the environment so as to capture different images at different perspectives of the environment, the receiving/communicating component 1604 may be configured to send the respective images and associated derived 3D data to the server device 2003 as they are captured (and processed by the 3D self-2D processing module 1406 to derive the 3D data). In this regard, as described with reference to system 100 and illustrated with reference to 3D model 200 shown in fig. 2, display/rendering component 1408 may receive and display intermediate versions of the 3D model to facilitate guiding the user during the capture process to determine where to position the camera to capture additional image data that the user wishes to reflect in the final version of the 3D model. For example, based on viewing the intermediate 3D reconstruction generated based on the 2D image data captured so far, the entity (e.g., a user or computing device) controlling the capture process may determine which portions or regions of the object or environment have not been captured and excluded from the intermediate version. The entity may also identify regions of the object or environment associated with the poor image data or the mis-aligned image data. The entity may also position one or more cameras 1404 to capture additional 2D images of the object or environment based on the missing or misaligned data. In some implementations, when the entity controlling the capture process is satisfied with the last rendered intermediate 3D reconstruction or otherwise determines that collection of the 2D image captured in association with the scan is complete, user device 2002 can send a confirmation message to server device 2003 confirming that the scan is complete. Based on receipt of the confirmation message, server device 2003 may generate a final version of the 3D model based on the complete set of 2D images (and associated 3D data).

Additionally, in some embodiments, after generating (or partially generating) the 3D model, server device 2003 may use the features and functionality of navigation component 126 to facilitate navigating the 3D model displayed at the user device. In various implementations, the intermediate 3D reconstructions discussed herein may represent a "draft" version of the final navigable 3D reconstruction. For example, the intermediate version may have a lower image quality relative to the final version, and/or may be generated using a less precise alignment process relative to the final version. In some implementations, unlike the final 3D representation, the intermediate version may include an un-navigable static or 3D reconstruction.

Fig. 21 presents another example system 2100 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. The system 2100 may include the same or similar features and functionality as the system 2000, except for the location of the 3D self-2D processing module 1406. In this regard, the 3D self 2D processing module 1406 may be disposed on the server device 2103, rather than on the user device 2102. In accordance with the system 2100, the user device 1202 can include one or more cameras 1404 configured to capture 2D images (and/or video). The user device 2102 may further provide the captured 2D image to the server device 2103 for further processing by the server device 2103 using 3D from the 2D processing module 1406, the 3D model generation component 118, and/or the navigation component 126. In this regard, the intermediate version may be generated and rendered with relatively little processing time, thereby enabling a real-time (or substantially real-time) 3D reconstruction process that provides a continuously updated, coarse 3D version of the scene during the capture process.

According to embodiments in which the 3D self-2D processing module 1406 is provided at a server device (e.g., server device 2103 or another server device described herein) in a manner similar to the techniques discussed above, the server device 2103 may also generate and provide to the user device 2102 a generated intermediate 3D reconstruction of an object or environment included in the received 2D image (e.g., captured in association with the scan). However, unlike the techniques described with reference to fig. 20, the server device 2103 may derive depth data for the received 2D image instead of the user device 2102. For example, the user device 2102 can capture a 2D image of an object or environment using one or more cameras 1404 and send (e.g., using the receiving/communicating component 1604) the 2D image to the server device 2103. Based on the receipt of the 2D image, the server device 2103 may employ the 3D self-2D processing module 1406 to derive 3D data for the 2D image and generate an intermediate 3D reconstruction of the object or environment using the 2D image and the 3D data. The server device 2103 may also send the intermediate 3D reconstruction to the user device 2102 for rendering at the user device 2102 as a preview for facilitating the capture process.

Once the user device 2102 notifies the server device 2103 (e.g., using a completion acknowledgement message, etc.) of the scan completion, the server device 2103 may further perform additional (and in some implementations more complex processing techniques) to generate a final 3D model of the environment. In some implementations, the additional processing may include using additional depth derivation and/or depth data optimization techniques (e.g., provided by panorama component 506, auxiliary data component 806, and/or 3D data optimization component 1302) to generate more accurate depth data for the 2D image for use by 3D modeling generation component 118. For example, in one exemplary implementation, the server device 2103 may employ a first 2D self-2D neural network model (e.g., the standard model 114) to derive first depth data for the received 2D image, and use this first depth data to generate one or more intermediate 3D reconstructions. Upon receiving the complete set of 2D image data, server device 2103 may then use the techniques provided by panorama component 506, auxiliary data component 806, and/or 3D data optimization component 1302 in order to derive more accurate depth data for the 2D images in the complete set. The server device 2103 may further employ this more accurate depth data in order to generate a final 3D model of the object or environment using the 3D model generation component 118.

Fig. 22 presents another exemplary system 2200 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. System 2200 can include the same or similar features and functionality as system 2000, wherein one or more 3D sensors 1504 and a positioning component 1506 are added to user device 2202. According to the system 2200, the user device may capture assistance data, including 3D sensor data 910, capture device motion data 904, and/or capture device location data 906 associated with one or more 2D images captured via one or more cameras. In some implementations, depending on the features and functionality of the

systems

800 and 1300, the 3D self-2D processing module 1406 may be configured to employ the auxiliary data to facilitate generating the derived 3D data 116 and/or optimized 3D data 1306 for the 2D image. The user device 2202 can also determine images, associate with respective images, and/or employ other types of assistance data (e.g., camera/image parameters 908) discussed herein, facilitating generation of the derived 3D data 116 by the 3D self-2D processing module 1406 according to techniques described with reference to the assistance data component 806 and the 3D self-2D processing module 804. The user device 2202 may further provide the 2D image and its associated depth data (e.g., the derived 3D data 116 or the optimized 3D data 1306). In some implementations, user device 2202 can also provide assistance data to server device 2003 to facilitate 3D model generation by 3D model generation component 118 and/or navigation by navigation component 126. In other implementations, rather than using the assistance data to facilitate 3D from 2D depth derivation by the 3D from 2D processing module 1406, the user device may alternatively provide the assistance data to the server device 2003 for use by the 3D model generation component 118 and/or the navigation component 126.

Fig. 23 presents another exemplary system 2300 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. System 2300 can include the same or similar features and functionality as system 2200, except for the location of the 3D self 2D processing module. In accordance with system 2300, server device 2103 can include a 3D self 2D processing module 1406 (e.g., in the same or similar manner as described with reference to system 2100). User device 2303 may include one or more 3D capture devices, one or more cameras 1404, a positioning component 1506, and a receiving/communication component 1604. According to this embodiment, the user device 2302 may capture the 2D image and associated assistance data and further transmit the image and its associated assistance data to the service device for further processing by the 3D self-2D processing module 1406, the 3D model generation component 118, and/or the navigation component 126.

Fig. 24 presents another exemplary system 2400 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. System 2400 can include the same or similar features as previous systems disclosed herein. However, system 2400 distributes the various components of the previous systems disclosed herein among capture device 2401, user device 2402, and server device 2003 (previously described with reference to fig. 20). With system 2400, capture device 2401 can include one or more cameras 1404 for capturing 2D image data. For example, in one implementation, the capture device 2401 may be moved to different positions and orientations relative to the object or environment to capture different images of the object or environment from different perspectives. In some implementations, the different images may include one or more panoramic images (e.g., having a field of view 360 ° horizontal or between 120 ° and 360 ° horizontal) generated using one or more techniques described herein. The capture device 2401 may also provide the captured images to the user device 2402, where upon receipt of the images, the user device 2402 may employ the 3D self-2D processing module 1406 to derive 3D/depth data for the respective images using various techniques described herein. According to this embodiment, the user device 2402 may further (optionally) provide the 2D images and the 3D/depth data associated therewith to the server device 2003 for further processing by the 3D model generation component 118 in order to generate a 3D model of the object or environment (e.g., by aligning them with each other using the derived depth data associated with the 2D images). In some implementations, as discussed with reference to fig. 20 and 21, the server device 2003 may further provide one or more intermediate versions of the 3D model to the user device for rendering at the user device 2402 (e.g., using the display/rendering component 1408). These intermediate versions of the 3D model may provide a preview of the reconstruction spatial alignment in order to facilitate guidance or control of the entity operating the capture device 2401 through the capture process (e.g., knowing where to place the camera to obtain additional images). In this regard, once the user has captured image data of as many objects or environments as they want, the 3D model generation component may further optimize the alignment approach to create a refined 3D reconstruction of the environment. The final 3D reconstruction can be provided to the user device for viewing and navigation as an interactive space (as facilitated by navigation component 126). In various implementations, the intermediate version may represent a "draft" version of the final 3D reconstruction. For example, the intermediate version may have a lower image quality relative to the final version, and/or may be generated using a less precise alignment process relative to the final version. In some implementations, unlike the final 3D representation, the intermediate version may include an un-navigable static or 3D reconstruction.

Fig. 25 presents another example system 2500 that facilitates capturing 2D image data, deriving 3D data from the 2D image data, generating a reconstructed 3D model based on the 3D data and the 2D image data, and navigating the reconstructed 3D model, in accordance with various aspects and embodiments described herein. System 2500 may include the same or similar features and functionality as system 2400, where the location of 3D self-2D processing module 1406 (previously described with reference to fig. 21) at server device 2103 is modified and one or more 3D sensors 1504 and a positioning component 1506 are added to capture device 1701. In this regard, the user device 2502 may include only the receive/communication component 1604 to facilitate relaying information between the capture device 1701 and the server device 2103. For example, the user device 2502 may be configured to receive 2D images and/or associated native assistance data from the capture device 1701 and send the 2D images and/or associated native assistance data to a server device for processing by the 3D self-2D processing module 1406 and the optional 3D model generation component 118. The server device 2103 may also provide the user device 2502 with 3D models and/or representations of 3D models generated based on 2D images and/or assistance data.

In another implementation of this embodiment, the server device 2103 may provide a cloud-based, web-based, thin-client application-based, etc. service in which a user may select and upload images already stored at the user device 2502 to the server device 2103. The server service 2103 may then automatically align the images in 3D and create a 3D reconstruction using 3D self-2D techniques described herein.

Fig. 26 presents an exemplary computer-implemented method 2600 that facilitates capturing 2D image data and deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 2602, a device (e.g., one or more capture devices described with reference to fig. 14-25) operatively coupled to the processor captures a 2D image of the object or environment (e.g., using one or more cameras 1404). At 2704, the device employs one or more 3D self-2D neural network models to derive 3D data for the 2D image (e.g., using 3D self-2D processing module 1406).

Fig. 27 presents another exemplary computer-implemented method 2700 that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 2702, a device (e.g., one or more capture devices, user devices, or server devices described with reference to fig. 14-25) operatively coupled to the processor receives or captures a 2D image of an object or environment. At 2704, the device employs one or more 3D self-2D neural network models to derive 3D data for the 2D image (e.g., using 3D self-2D processing module 1406). At 2706, the device aligns the 2D image based on the 3D data to generate a 3D model of the object or environment, or the device transmits the 2D image and the 3D data to an external device (e.g., one or more server devices described with reference to fig. 20-25) via a network, wherein the external device generates the 3D model of the object or environment based on the transmission.

Fig. 28 presents another exemplary computer-implemented method 2800 that facilitates capturing 2D image data and deriving 3D data from the 2D image data in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 2802, a device (e.g., one or more user devices or server devices described with reference to fig. 14-25) operably coupled to the processor receives 2D images of the object or environment captured from different perspectives of the object or environment, wherein the device also receives derived depth data for respective ones of the 2D images derived using one or more 3D self-2D neural network models (e.g., using the 3D data derivation component 110). At 2804, the device aligns the 2D images with one another based on the depth data to generate 3D of the object or environment (e.g., via the 3D model generation component 118).

Fig. 29 presents another exemplary computer-implemented method 2900 that facilitates capturing 2D image data and deriving 3D data from the 2D image data, in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 2902, a device including a processor (e.g., user device 2102, user device 2302, user device 2502, etc.) captures a 2D image of an object or environment (e.g., using one or more cameras 1404). At 2904, the device sends the 2D image to a server device (e.g., server device 2103), where upon receipt of the 2D image, the server device employs one or more 3D from 2D neural network models to derive 3D data for the 2D image (e.g., using 3D from 2D processing module 1406), and generates a 3D reconstruction of the object or environment using the 2D image and the 3D data (e.g., using 3D model generating component 118). At 2906, the device further receives a 3D reconstruction from the server device, and at 2908 the device renders the 3D reconstruction via a display of the device.

In one or more embodiments, a device may capture 2D images from different perspectives of an object or environment in association with an image scan of the object or environment. For these embodiments, the device may further send a confirmation message to the remote device confirming completion of the image scan. In this regard, the 3D reconstruction may include a first or initial 3D reconstruction, and wherein the remote device may generate a second (or final) 3D reconstruction of the object or environment based on receipt of the confirmation message. For example, in some implementations, the second 3D reconstruction has a higher level of image quality relative to the first three-dimensional reconstruction. In another exemplary implementation, the second 3D reconstruction includes a navigable model of the environment, and wherein the first 3D reconstruction is non-navigable. In another exemplary implementation, the second 3D reconstruction is generated using a more precise alignment process than the alignment process used to generate the first 3D reconstruction.

Fig. 30 presents an exemplary system 3000 that facilitates associating with an Augmented Reality (AR) application using one or more 3D self-2D techniques in accordance with various aspects and embodiments described herein. System 3000 includes at least some features (e.g., one or more cameras 1404, 3D from 2D processing module 1406, receiving/communicating component 1604, and displaying/rendering component 1408) that are the same as or similar to previous systems disclosed herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

In the illustrated embodiment, the system 100 includes a user device 3002 having one or more cameras 1404 configured to capture 2D image data (e.g., including panoramic images and video) of an object or environment, and a 3D self-2D processing module 1406 configured to derive depth data for one or more 2D images included in the 2D image data. As described above, the 3D-from-2D processing module 1406 may be or correspond to the 3D-from-2D processing module 106, the 3D-from-2D processing module 504, the 3D-from-2D processing module 804, or the 3D-from-2D processing module 1304. Although not shown, in some embodiments, the user device may also include one or more 3D sensors 1504, a positioning component 1506, and/or one or more additional hardware and/or software components that facilitate generating native assistance data 802 to facilitate deriving depth data for the captured image according to various techniques described herein with reference to fig. 8, 9, and 13. The user device also includes an AR component 3004, a receiving/communication component 1604, and a display/rendering component 1408. The user device 3002 may include various types of computing devices, including one or more cameras on or within a housing configured to capture 2D image data of an environment, and a display/rendering component 1408 including hardware and/or software that facilitates rendering digital objects on or within a representation of the environment in a hologram or the like via a display of the user device 3002. For example, in some embodiments, the user device 3002 may include an AR headset configured to be worn by a user and including a display (e.g., a transparent glass display) located in front of the user's eyes (e.g., glasses, goggles, HUD, etc.). In another embodiment, the user device may be or include a mobile handheld device, such as a mobile phone or smart phone, tablet PC, or similar device. In still other embodiments, the user device 3002 may comprise a device (such as a laptop PC, desktop PC, etc.) that may be positioned in a relatively fixed location relative to the environment.

The user device 3002 may include or be operatively coupled to at least one memory 3020 and at least one processor 3024. The at least one memory 122 may further store computer-executable instructions (e.g., one or more software elements of the 3D self-2D processing module 1406, the AR component 3004, the receiving/communication component 1604, and/or one or more software elements of the display/rendering component 1408) that, when executed by the at least one processor 3024, cause performance of operations defined by the computer-executable instructions. In some embodiments, memory 122 may also store information received, generated, and/or employed by the computing device. For example, in the illustrated embodiment, the memory 3020 may store one or more AR data objects 3022 that may be used by the AR component 3004. The memory 3020 may also store information including, but not limited to, captured image data and depth information derived for the captured image data, the received 2D image data 102, the derived 3D data 116, and the 3D model and alignment data 128. The user device 3002 may also include a device bus 120 that communicatively couples various components of the user device. Examples of such a processor 3024 and memory 3020, as well as other suitable computers or computer-based elements that may be used in conjunction with implementing one or more of the systems or components illustrated and described in connection with fig. 30 or other figures disclosed herein, may be found with reference to fig. 35.

System 3000 can also include server device 3003. The server device 3003 may provide information and/or services to the user device 3002 that facilitate one or more features and functions of the AR component 3004. In this regard, the AR component 3004 may be or correspond to an AR application that provides one or more AR features and functionality related to integrating virtual digital data objects on or within a real-time view of an environment. For example, in embodiments where the user device 3002 comprises a wearable device configured to be worn by a user and includes a transparent display (e.g., glasses, goggles, or other forms of glasses) that is positioned in front of the user's eyes when worn, the real-time view of the environment may include an actual view of the environment that is currently being viewed through the transparent display. With this embodiment, a digital data object may be rendered on a glass display whose appearance and position cause the digital data object to align with a real-time view of the environment. In another example, the user device may include a tablet PC, smartphone, or the like having a display configured to render real-time image data (e.g., video) of the environment captured via a forward-facing camera of the device. According to this exemplary embodiment, the digital data object may be rendered as overlay data onto real-time image data (e.g., snapshots and/or video) rendered on a device display.

The type of digital data object that can be integrated on or within a real-time view of an environment may vary and is referred to herein as an AR data object (e.g., AR data object 3022). For example, AR data object 3022 may include a 3D or 2D graphical image or data model of an object or person. In another example, AR data object 3022 may include icons, text, symbolic labels, hyperlinks, and the like that may be visually displayed and interacted with. In another example, AR data object 3022 may include data objects that are not visually displayed (or initially visually displayed) but may interact with and/or be activated in response to a trigger (e.g., a user pointing, viewing along a user's line of sight, a gesture, etc.). For example, in one embodiment involving viewing or pointing to an actual object (e.g., a building) that appears in the environment, auxiliary data associated with the building may be rendered, such as text overlays identifying the building, video data, sound data, graphical image data corresponding to objects or things that appear from an open window of the building, and so forth. In this regard, the AR data object 3022 may include various types of auxiliary data sets. For example, AR data object 3022 may include markers or tags that identify objects or locations captured in image data (e.g., real-time video and/or snapshots) by one or more cameras 1404. These markings may be made manually or automatically (via image or object recognition algorithms) during the current or previous capture of the environment, or previously generated and associated with known objects or environmental locations being viewed. In another example, AR data object 3022 may include an image or 3D object having a predefined association with one or more actual objects or locations or things included in the current environment. In yet another example, AR data object 3022 may include a video data object, an audio data object, a hyperlink, and the like.

The AR component 3004 may employ 3D/depth data derived by the 3D self-2D processing module 1406 from real-time 2D image data (e.g., snapshots or video frames) of an object or environment captured via one or more cameras 1404 to facilitate various AR applications. In particular, the AR component 3004 may employ 3D self-2D techniques described herein to facilitate enhancing various AR applications with more accurate and photo-level integration of AR data objects as an overlay onto a real-time view of the environment. In this regard, according to various embodiments, the one or more cameras 1404 may capture real-time image data of the environment that corresponds to a current perspective of a view of the environment viewed on or through a display of the user device 3002. The 3D self-2D processing module 1406 may also derive depth data from the image data in real time or substantially real time. For example, in implementations where a user walks around in an open house for a potential purchase while wearing or holding the user device 3002 such that at least one of the one or more cameras 1404 of the user device 3002 captures image data corresponding to a current perspective of the user, the 3D self-2D processing module 1406 may derive depth data from the image data that corresponds to the actual 3D location (e.g., depth/distance) of the user relative to a physical structure of the house (e.g., a wall, ceiling, kitchen, appliance, opening, door, window, etc.). The AR component 3004 may use this depth data to facilitate integration of one or more AR data objects on or within a real-time view of the environment.

In the illustrated embodiment, the AR component 3004 may include a spatial alignment component 3006, an integration component 3008, an occlusion mapping component 3010, as well as an AR data object interaction component 3012, an AR data object generation component 3014, and a 3D model localization component 3016.

The spatial alignment component 3006 may be configured to determine a location for integrating an AR data object on or within a representation of an object or environment corresponding to a current perspective of the object environment viewed by the user based on the derived depth/3D data. The integration component 3008 may integrate the AR data object on or within a representation of the object or environment at the location. For example, the integration component 3008 may render an auxiliary data object on the display having a size, shape, orientation, and position that aligns the auxiliary data object with the real-time view of the environment at the determined location. In this regard, if the display is a transparent display, the integration component 3008 may render the AR data object on the glass of the transparent display at a location on the display and with a size, shape, and/or orientation that aligns the AR data object with the determined location in the environment. The integration component 3008 may also determine an appropriate location, size, shape, and/or orientation for the AR data object based on the relative position of the user's eyes to the display and the type of AR data object. In other implementations where the representation of the environment includes image data captured from the environment and rendered on a display, the integration component 3008 may render the AR data object as an overlay on the image data having a size, shape, and/or orientation that aligns the AR data object with a determined location in the environment.

For example, based on depth data indicating a relative 3D position of a user with respect to actual objects, things, people, etc. included in an environment (such as walls, appliances, windows, etc.), the spatial alignment component 3006 may determine a location for integrating AR data objects that spatially aligns the AR data objects with the walls, appliances, windows, etc. For example, in one implementation, based on the user's known relative position to the actual wall, appliance, window, etc. (as determined based on the derived depth data), the spatial alignment component 3006 may determine an assumed 3D position and orientation of the AR data object relative to the actual wall, appliance, window, etc. The integration component 3008 may further use this assumed 3D position and orientation of the AR data object to determine a location for overlaying the data object onto a real-time representation of the environment on or viewed through the display that spatially aligns the data object at the assumed 3D position with an appropriately scaled size and shape (e.g., based on what the data object is).

The occlusion mapping component 3010 may facilitate accurate integration of AR data objects into a real-time view of the environment, taking into account the relative positions of objects in the environment to each other and to the user's current viewpoint based on the derived 3D/depth data. In this regard, the occlusion mapping component 3010 may be configured to determine a relative position of the AR data object with respect to another object included in a real-time representation of the object or environment viewed on or through the display based on a current perspective of the user and the derived 3D/depth data. For example, the occlusion mapping component 3010 may ensure that if an AR object is placed in front of an actual object that appears in the environment, the portion of the AR data object that is in front of the actual data object occludes the actual data object. Likewise, if an AR object is placed behind an actual object present in the environment, the portion of the AR data object located behind the actual data object is occluded by the actual data object. Thus, relative to the user's current location and viewpoint for the respective object, thing, etc., the occlusion mapping component 3010 can employ the derived 3D/depth data of the respective object, thing, etc., in the environment to ensure a correct occlusion mapping or virtual object relative to the actual object (e.g., draw the virtual object behind the real object that is closer than them).

The AR data object interaction component 3012 may employ the derived 3D/depth data of the environment based on the viewer's current location and perspective to the environment in order to facilitate user interaction with virtual AR data objects spatially integrated with the environment through the spatial alignment component 3006 and the integration component 3008. In this regard, the AR data object interaction component 3012 may directly employ the derived 3D/depth data by having the virtual object interact with its environment in a more realistic manner or be constrained by the environment.

The AR data object generation component 3014 may provide for generating a 3D virtual data object for use by the AR component 3004. For example, in one or more embodiments, the AR data object generation component 3014 may be configured to extract object image data of an object included in the 2D image. For example, using the features and functionality of the cropping component 510 discussed below and substantially any 2D image (including objects that may be segmented from the image), the AR data object generation component 3014 may crop, segment, or otherwise extract a 2D representation of the object from the image. The AR data object generation component 3014 may further employ 3D data (i.e., object image data) derived by the 3D self-2D processing module 1406 for and associated with the extracted 2D object to generate a 3D representation or model of the object. In various embodiments, the spatial alignment component may be further configured to determine a location for integrating a 3D representation or model of the object (i.e., object image data) on or within the real-time representation of the object based on the object three-dimensional data.

In some embodiments, a real-time environment viewed and/or interacted with by a user using AR (e.g., using features and functionality of user device 3002) may be associated with a previously generated 3D model of the environment. The previously generated 3D model of the environment may also include or otherwise be associated with information identifying a defined position and/or orientation of the AR data object relative to the 3D model. For example, the 3D model generated by the 3D model generation component 118 can be associated with markers at various defined locations relative to the 3D model that identify objects (e.g., appliances, furniture, walls, buildings, etc.), provide information about the objects, provide hyperlinks to applications associated with the objects, and so forth. Other types of AR data objects that may be associated with a previously generated 3D model of an object or environment may include, but are not limited to:

a marker or tag identifying the captured object or location; these markings may be made manually or automatically (via image or object recognition algorithms) during current or previous capture of the environment or via a user manipulating an external tool that has captured the 3D data.

Images or 3D objects added at specific locations with respect to a previous 3D capture of the same environment; for example, an upholstery person or other user may capture a 3D environment, import the 3D environment into a 3D design program, make changes and additions to the 3D environment, and then use a 3D reconstruction system to see how these changes and additions will appear in the environment.

Previously captured 3D data from the same object or environment; in this case, a difference between the previous 3D data and the current 3D data may be highlighted.

A 3D CAD model of the captured object or environment; in this case, differences between the CAD model and the current 3D data may be highlighted, which is useful for finding defects in the manufacturing or construction or installing incorrect items.

Data captured by additional sensors during the current or previous 3D capture process.

AR data objects (e.g., markers, the above, etc.) that have been previously associated with a defined position relative to a 3D model of the object or environment are referred to herein as aligned AR data objects. In the illustrated embodiment, such previously generated 3D models and associated aligned AR data objects of the environment may be stored in a network-accessible 3D spatial model database 3026 as 3D spatial model 3028 and aligned AR data object 3030, respectively. In the illustrated embodiment, the 3D spatial model database 3026 may be provided by the server device 3003 and accessed by the AR component 3004 via one or more networks (e.g., using the receiving/communicating component 1604). In other embodiments, the 3D spatial model database 3026 and/or some of the information provided by the 3D spatial model database 3026 may be stored locally at the user device 3002.

In accordance with these embodiments, the 3D model localization component 3016 may provide a previously generated 3D model of the usage environment and aligned AR data objects (e.g., markers and other AR data objects discussed herein) to facilitate integrating the aligned AR data objects 3030 with a real-time view of the environment. In particular, the 3D model localization component 3016 can employ the derived 3D/depth data determined for the current perspective of the environment from the current position and orientation of the user device 3002 to "localize" the user device relative to the 3D model. In this regard, based on the derived 3D data indicating the location of the user device with respect to the corresponding objects in the environment, the 3D model localization component 3016 can determine the relative location and orientation of the user with respect to the 3D model (as if the user were actually standing in the 3D model). The 3D model positioning component 3016 may further identify AR data objects associated with defined locations in the 3D model that are within the user's current perspective relative to the 3D model and the actual environment. The 3D model localization component 3016 may also determine how to spatially align the AR data objects with the real-time/live view of the environment based on how the auxiliary data objects are aligned with the 3D model and the relative position of the user to the 3D model.

For example, assume that there is a scenario in which a 3D space model of a house or building was previously generated, and various objects included in the 3D space model are associated with markers, such as markers associated with an electrical panel, that indicate respective functions describing different circuits on the electrical panel. Now imagine that the user operating the user device 3002 is on site and personally viewing the premises, and has a current view of the real electrical panel (e.g., viewing through a transparent display). The AR component 3004 can provide overlay marking data that is aligned with the actual electrical panel when viewed in real time through the transparent display. In order to accurately align the marking data with the electrical panel, the user device 3002 needs to position itself through the previously generated 3D model. The 3D model positioning component 3016 may perform this positioning using derived 3D/depth data determined from real-time images of the environment corresponding to real-time perspectives of the electrical panels. For example, the 3D model positioning component 3016 may use the derived depth information corresponding to the actual position/orientation of the user relative to the electrical panel to determine the relative position/orientation of the user relative to the electrical panel in the 3D model. Using the relative position/orientation and the actual position/orientation, the 3D spatial alignment component 3006 can determine how to position the marking data as an overlay onto a transparent display that aligns the marking data with the actual view of the electrical panel.

Fig. 31 presents an exemplary computer-implemented method 3100 for associating with an AR application using one or more 3D self-2D techniques in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 3102, a device (e.g., user device 3002) operatively coupled to the processor employs one or more 3D self-2D neural network models to derive 3D data from one or more 2D images of the object or environment (e.g., using 3D self-2D processing module 1406) captured from a current perspective of the object or environment viewed on or through a display of the device. At 3104, the device determines a location for integrating graphical data objects on or within the display (e.g., using the spatial alignment component 3006 and/or the 3D model localization component 3016) on or through the representation of the object or environment viewed by the display based on the current perspective and the three-dimensional data. At 3106, the device integrates graphical data objects on or within the representation of the object or environment based on the location (e.g., using integration component 3008).

Fig. 32 presents an exemplary computing device 3202 employing one or more 3D self-2D techniques associated with object tracking, real-time navigation, and 3D feature-based security applications in accordance with various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

Referring to fig. 13 and 32, computing device 3202 may include the same or similar features and functionality as computing device 104, including a 3D self-2D processing module 1304 configured to generate derived 3D data 116 and/or optimized 3D data 1306 based on received 2D image data 102 and optionally native assistance data 802. Computing device 3202 also includes a tracking component 3204, a real-time navigation component 3206, and a 3D feature authentication component 3208. The tracking component 3204, the real-time navigation component 3206, and the 3D feature authentication component 3208 may each include computer-executable components stored in a memory (e.g., memory 122) that, when executed by a processor (e.g., processor 124), may perform the described operations.

In one or more embodiments, tracking component 3204 may facilitate tracking a relative location or position of an object, thing, person, etc. included in the environment based on derived 3D data 116 and/or optimized 3D data 1306 determined for the object over a period of time from captured 2D image data of the object. For example, in some implementations, the tracking component 3204 may receive successive frames of video of an object captured via one or more cameras over a period of time. The tracking component 3204 may also use the derived 3D data 116 and/or optimized 3D data 1306 determined for the object in at least some sequential frames of the video to determine a relative position of the object with respect to the camera over a period of time. In some implementations, computing device 3202 may also house one or more cameras. In some embodiments, the object comprises a moving object, and the one or more cameras may track the position of the object when the one or more cameras also move within a period of time or remain fixed positions relative to the perspective of the moving object within a period of time. In other embodiments, the object may comprise a fixed object and the one or more cameras may be movable relative to the object. For example, one or more cameras may be attached to a moving vehicle or object, held in a user's hand while the user is moving in the environment, and so on.

The real-time navigation component 3206 may facilitate real-time navigation of an environment by a mobile entity that includes a computing device 3202 and one or more cameras configured to capture and provide 2D image data (and optionally native assistance data 802). For example, a mobile entity may include a user-operated vehicle, an autonomous vehicle, an unmanned airplane, a robot, or another device that may benefit from knowing its relative location with respect to objects included in the environment that the device is navigating. According to this embodiment, the real-time navigation component 3206 may capture image data corresponding to a current perspective of the computing device relative to the environment continuously, regularly (e.g., at defined points in time), or in response to a trigger (e.g., a sensed signal indicating that one or more objects are located within a defined distance from the computing device). The 3D self-2D processing module 1304 may further determine derived 3D data 116 and/or optimized 3D data 1306 for respective objects, things, people included in the direct environment of the computing device 3202. Based on the derived 3D data 116 and/or optimized 3D data 1306 indicating the relative position of the computing device 3202 with respect to one or more objects in the environment being navigated, the real-time navigation component 3206 may determine navigation information for an entity employing the computing device 3202, including navigation paths that avoid collisions with objects, navigation paths that facilitate bringing entities to desired positions with respect to objects in the environment, and so forth. In some implementations, the real-time navigation component 3206 can also use information that semantically identifies objects included in the environment to facilitate navigation (e.g., where the vehicle should go, what the vehicle should avoid, etc.).

The 3D feature authentication component 3208 may employ the derived 3D data 116 and/or optimized 3D data determined for the object to facilitate the authentication process. For example, in some embodiments, the object may comprise a human face, and the derived 3D data 116 and/or optimized 3D data may comprise a depth map providing the face surface. The depth map may be used to facilitate face-based biometric authentication of a user identity.

Fig. 33 presents an exemplary system 3300 for developing and training 2D self-3D models in accordance with various aspects and embodiments described herein. System 3300 includes at least some features (e.g., 3D self-2D processing module 1406, 2D image data 102, panoramic image data 502, native auxiliary data 802, derived 3D data 116, and optimized 3D data 1306) that are the same or similar to previous systems disclosed herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

In the illustrated embodiment, the system 3300 includes a computing device 3312 that includes computer-executable components, including a 3D from 2D development module 3314 and a 3D from 2D processing module 1406. The computing device 3312 may include or be operatively coupled to at least one memory 3322 and at least one processor 3320. In one or more embodiments, the at least one memory 3322 may further store computer-executable instructions (e.g., the 3D from 2D development module 3314 and the 3D from 2D processing module 1406) that, when executed by the at least one processor 3320, cause performance of operations defined by the computer-executable instructions. In some embodiments, memory 3322 may also store information received, generated, and/or employed by computing device 3312 (e.g., 3D spatial model database 3302, 3D from 2D model database 3326, received 2D image data 102, received native assistance data 802, derived 3D data 116, optimized 3D data 1306, and/or additional training data generated by 3D from 2D model development module 3314 discussed below). Computing device 3312 may also include a device bus 3324 that communicatively couples the various components of computing device 3312. Examples of such processors 3320 and memories 3322, as well as other suitable computer or computing-based elements that may be used in conjunction with implementing one or more of the systems or components illustrated and described in connection with fig. 33 or other figures disclosed herein, may be found with reference to fig. 35.

The system 3300 also includes a 3D spatial model database 3302 and a 3D self-2D model database 3326. In one or more embodiments, the 3D self-2D model development module 3314 may be configured to facilitate generating and/or training one or more 3D self-2D models included in the 3D self-2D model database 3326 based at least in part on data provided by the 3D spatial model database 3302. For example, in the illustrated embodiment, the 3D self 2D model development module 3314 may include a training data development component 3316 to facilitate the collection and/or generation of training data based on various types of rich 3D model information (described below) provided by the 3D spatial model database 3302. The 3D self 2D model development module 3314 may also include a model training component 3318 that may be configured to employ training data to train and/or develop one or more 3D self 2D neural network models included in the 3D self 2D model database 3326. The 3D self-2D processing module 1406 may further employ the 3D self-2D model included in the 3D self-2D model database 3326 to generate the derived 3D data 116 and/or optimized 3D data 1306 based on the received input data (including the 2D image data 102 and/or the native auxiliary data 802) according to the various techniques described above.

In one or more embodiments, the 3D spatial model database 3302 may include a large amount of proprietary data associated with previously generated 3D spatial models that were generated using proprietary alignment techniques (e.g., those described herein), captured 2D image data, and associated captured depth data captured by various 3D sensors. In this regard, data for generating a 3D space model may be collected (e.g., with one or more types of 3D sensors) by scanning (e.g., with one or more types of 3D sensors) of real world scenes, spaces (e.g., houses, office spaces, outdoor spaces, etc.), objects (e.g., furniture, ornaments, merchandise, etc.), and so forth. The data may also be generated based on a computer-implemented 3D modeling system. For example, in some embodiments, the 3D spatial model is generated using one or more 2D/3D capture devices and/or systems described in U.S. patent application No. 15/417,162, filed on 26.1.2017 and entitled "CAPTURING AND ALIGNING PANORAMIC IMAGE AND DEPTH DATA," and U.S. patent application No. 14/070,426, filed on 1.11.2013 and entitled "CAPTURING AND ALIGNING THREE-dimensionnal SCENES," the entire contents of which are incorporated herein by reference. In some embodiments, the data provided by the 3D spatial model database 3302 may also include information for 3D spatial models generated by the 3D model generation component 118 according to the techniques described herein. The 3D spatial model database 3302 may also include information for the 3D spatial model 3028 discussed with reference to fig. 30.

In this regard, the 3D spatial model database 3302 may include 3D model and alignment data 3304, indexed 2D image data 3306, indexed 3D sensor data 3308, and indexed semantic tag data 3310. The 3D model and alignment data 3304 may include previously generated 3D spatial models of various objects and environments and associated alignment information related to the relative positions of geometric points, shapes, etc. that form the 3D models. For example, the 3D spatial model may include data representing positions, geometries, curved surfaces, and the like. The 3D spatial model may also include data comprising a set of points represented by 3D coordinates, such as points in 3D euclidean space. The sets of points may be associated with (e.g., connected to) each other by a geometric entity. For example, a set of lattice connectable points comprising a series of triangles, lines, curved surfaces (e.g., non-uniform rational basis splines (NURBS)), quadrilaterals, n-grams, or other geometric shapes. For example, a 3D model of a building interior environment may include mesh data (e.g., a triangular mesh, a quadrilateral mesh, a parametric mesh, etc.), one or more texture-mapped meshes (e.g., one or more texture-mapped polygonal meshes, etc.), a point cloud, a set of point clouds, bins, and/or other data constructed using one or more 3D sensors. In some implementations, portions of the 3D model geometry data (e.g., a mesh) may include image data describing textures, colors, intensities, and so forth. For example, in addition to including texture coordinates associated with the geometric data points (e.g., texture coordinates indicating how the texture data is applied to the geometric data), the geometric data may also include the geometric data points.

The indexed 2D image data 3306 may include 2D image data used to generate a 3D spatial model represented by the 3D model and the alignment data 3304. For example, the indexed 2D image data 3306 may include a set of images used to generate a 3D spatial model, and also include information associating the respective images with portions of the 3D spatial model. For example, the 2D image data may be associated with portions of a 3D model mesh to associate visual data (e.g., texture data, color data, etc.) from the 2D image data 102 with the mesh. The indexed 2D image data 3306 may also include information associating the 2D image with a particular location of the 3D model and/or a particular perspective for viewing the 3D spatial model. The indexed 3D sensor data 3308 may include 3D/depth measurements associated with respective 2D images used to generate the 3D spatial model. In this regard, the indexed 3D sensor data 3308 can include captured 3D sensor readings captured by one or more 3D sensors and associated with respective pixels, superpixels, objects, etc. of respective 2D images that are used to align the 2D images to generate the 3D spatial model. The indexed semantic tag data 3310 may include semantic tags that were previously determined and associated with respective objects or features of the 3D spatial model. For example, indexed semantic tag data 3310 may identify walls, ceilings, fixtures, appliances, etc. included in the 3D model and also include information identifying spatial boundaries of corresponding objects within the 3D spatial model.

Conventional training data for generating a 3D self-2D neural network model includes 2D images having known depth data for respective pixels, superpixels, objects, etc. included in the respective 2D images, such as indexed 3D sensor data 3308 associated with the respective 2D images included in the indexed 2D image data 3306, which is used to generate a 3D spatial model and alignment data 3304 included in the 3D model. In one or more embodiments, the training data development module 3314 may extract the training data (e.g., indexed 2D images and associated 3D sensor data) from the 3D spatial model database 3302 for provision to the model training component 3318 for use in association with generating and/or training one or more 3D self 2D neural network models included in the 3D self 2D model database 3326. In various additional embodiments, the training data development component 3316 may further use the reconstructed 3D spatial model to create training examples for the respective 2D images that were never directly captured by the 3D sensor. For example, in some implementations, the training data development component 3316 may employ the textured 3D mesh of the 3D spatial model included in the 3D model and the alignment data 3304 to generate a 2D image from camera positions where the actual cameras are never placed. For example, the training data development component 3316 may use the capture position/orientation information of the respective images included in the indexed 2D image data 3306 to determine various virtual capture position/orientation combinations that are not represented by the captured 2D images. The training data development component 3316 may further generate a composite image of the 3D model from these virtual capture positions/orientations. In some implementations, training data development component 3316 may generate composite 2D images from various perspectives of the 3D model, the images corresponding to a series of images captured by the virtual camera in association with navigating the 3D spatial model, wherein the navigation aids in capturing the scene as if the user were actually walking in the environment represented by the 3D model while holding the camera and capturing the images along the way.

The training data development component 3316 may also generate other forms of training data associated with the composite 2D image and the actual 2D image in a similar manner. For example, training data development component 3316 may generate IMU measurements, magnetometer or depth sensor data, or the like, as if such sensors were being placed in 3D space or moved therein. Based on the known locations of the points included in the composite image and the virtual camera capture location and orientation relative to the 3D spatial model (from which the virtual camera captures and synthesizes the location and orientation of the image), the training data development component 3316 may generate depth data for the respective pixels, superpixels, objects, etc. included in the composite image. In another example, the training data development component 3316 may determine depth data for a captured 2D image based on aligning visual features of the 2D image with known features of a 2D model for which depth information is available. Other inputs are generated as if a particular sensor were used in the 3D space.

In some embodiments, training data development component 3316 may further employ the 3D spatial model and alignment data 3304 included in the 3D model to create synthetic "ground truth" 3D data from those reconstruction environments in order to match each 2D used to create the 3D spatial model (e.g., included in the indexed 2D image data 3306) and the synthetic 2D images generated from the perspective of the 3D spatial model that were never actually captured by the actual camera from the actual environment. Thus, the composite 3D "ground truth" data for the respective images may exceed the quality of the actual 3D sensor data for the respective image captures (e.g., the actual 3D sensor data for the respective image captures included in the indexed 3D sensor data 3308), thereby improving the training effect. In this regard, because the synthesized 3D data is derived from a 3D model generated based on aligning several images having overlapping or partially overlapping image data to one another using various alignment optimization techniques, the alignment position 3D position of the corresponding point in the images may become more accurate than the 3D sensor data associated with the individual images captured by the 3D sensor. In this regard, the alignment pixels of a single 2D image included in the 3D model will have 3D positions relative to the 3D model that are determined not only based on the captured 3D sensor data associated with the 2D image, but also based on the alignment process used to create the 3D model, with the relative positions of the other images to the 2D image and the 3D coordinate space used to determine the final 3D position of the alignment pixel. Thus, the aligned 3D pixel locations associated with the 3D model may be considered more accurate than the 3D measurements for the pixels captured by the depth sensor.

In one or more additional embodiments, the training data development component 3316 may also extract additional scene information associated with the 3D spatial model, such as semantic tags included in the indexed semantic tag data 3310, and include it with the corresponding 2D images used as training data. In this regard, the training data development component 3316 may use the indexed semantic tag data 3310 to determine semantic tags and associate the semantic tags with 2D images (e.g., indexed 2D images and/or synthesized 2D images) that the model training component 3318 uses to develop and/or train a 3D self-2D neural network model. This allows model training component 3318 to train a 3D self-2D neural network model to predict semantic labels (e.g., walls, ceilings, doors, etc.) without manual annotation of the data set.

In various embodiments, model training component 3318 may employ training data collected and/or generated by training data development component 3316 to train and/or develop one or more 3D from 2D neural network models included in 3D from 2D model database 3326. In some implementations, the 3D self 2D model database 3326 may be, include, or correspond to the 3D self 2D model database 112. For example, in the illustrated embodiment, a 3D self-2D model database may include one or more panoramic models 514 and one or more augmented models 810. In some implementations, model training component 3318 may generate and/or train one or more panoramic models 514 and/or one or more augmented models 810 (discussed above) based at least in part on training data provided by training data development component 3316. The 3D self-2D model database 3326 may also include one or more optimized models 3328. The one or more optimized models 3328 may include one or more 3D self-2D neural network models that have been trained specifically using training data provided by the training data development component 3316. In this regard, the one or more optimized models 3328 may employ various 3D from 2D derivation techniques to derive 3D data from the 2D images discussed herein, including the 3D from 2D derivation techniques discussed with reference to the one or more standard models 114. However, relative to other 3D self-2D models trained on conventional input data, the one or more optimized models 3328 may be configured to generate more accurate and accurate depth-derived results based on training using training data provided by the training data development component 3316. For example, in some embodiments, the optimized model 3328 may comprise a standard 3D self-2D model that has been specially trained using training data provided by the training data development component 3316. Accordingly, the standard model 3D self-2D model may be converted to an optimized 3D self-2D model that is configured to provide more accurate results relative to a standard 3D self-2D model trained based on surrogate training data (e.g., training data not provided by the training data development component 3316).

Fig. 34 presents an exemplary computer-implemented method 3400 for developing and training a 2D self-3D model according to various aspects and embodiments described herein. Repeated descriptions of similar elements employed in the corresponding embodiments are omitted for the sake of brevity.

At 3402, a system (e.g., system 3300) operably coupled to the processor accesses (e.g., from a 3D spatial model database 3302) a 3D model of the object or environment, the model generated based on 2D images of the object or environment captured at different capture locations relative to the object or environment and depth data captured for the 2D images via one or more depth sensor devices (e.g., using training data development component 3316). At 3318, the system determines supplementary training data for the 2D image based on the 3D model. For example, the training data development component 3316 may determine semantic labels for the images and/or synthetic 3D data for the 2D images. Then, at 3406, the system may train one or more 3D self-2D neural networks using the 2D images and the auxiliary training data to derive 3D information from the new 2D images, the auxiliary data being treated as ground truth data in association with training the one or more neural networks using the auxiliary data.

Exemplary operating Environment

In order to provide a context for the various aspects of the disclosed subject matter, fig. 35 and 36, as well as the following discussion, are intended to provide a brief, general description of a suitable environment in which the various aspects of the disclosed subject matter can be implemented.

With reference to fig. 35, a suitable environment 3500 for implementing various aspects of the disclosure includes a computer 3512. The computer 3512 includes a processing unit 3514, a system memory 3516, and a system bus 3518. The system bus 3518 couples system components including, but not limited to, the system memory 3516 to the processing unit 3514. The processing unit 3514 may be any of various available processors. Dual microprocessors and other multiprocessor architectures also can be employed as the processing unit 3514.

The system bus 3518 can be any of several types of bus structure(s) including the memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any available bus architecture, including, but not limited to, Industrial Standard Architecture (ISA), micro-channel architecture (MSA), extended ISA (eisa), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB), Peripheral Component Interconnect (PCI), card bus, Universal Serial Bus (USB), Advanced Graphics Port (AGP), personal computer memory card international association bus (PCMCIA), firewire (IEEE 1394), and Small Computer Systems Interface (SCSI).

The system memory 3516 includes volatile memory 3520 and nonvolatile memory 3522. The basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 3512, such as during start-up, is stored in nonvolatile memory 3522. By way of illustration, and not limitation, nonvolatile memory 3522 can include Read Only Memory (ROM), programmable ROM (prom), electrically programmable ROM (eprom), electrically erasable programmable ROM (eeprom), flash memory, or nonvolatile Random Access Memory (RAM) (e.g., ferroelectric RAM (feram)). Volatile memory 3520 includes Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM can be available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), Direct Rambus RAM (DRRAM), Direct Rambus Dynamic RAM (DRDRAM), and Rambus dynamic RAM.

Computer 3512 also includes removable/non-removable, volatile/nonvolatile computer storage media. For example, FIG. 35 illustrates a disk storage 3524. Disk storage 3524 includes, but is not limited to, devices like a magnetic disk drive, floppy disk drive, tape drive, Jaz drive, Zip drive, LS-100 drive, flash memory card, or memory stick. Disk storage 3524 can also include storage media separately or in combination with other storage media including, but not limited to, an optical disk Drive such as a compact disk ROM device (CD-ROM), CD recordable Drive (CD-R Drive), CD rewritable Drive (CD-RW Drive) or a digital versatile disk ROM Drive (DVD-ROM). To facilitate connection of the disk storage devices 3524 to the system bus 3518, a removable or non-removable interface is typically used such as interface 3526.

FIG. 35 also depicts software that acts as an intermediary between users and the basic computer resources described in suitable operating environment 3500. Such software includes, for example, an operating system 3528. Operating system 3528, which can be stored on disk storage 3524, acts to control and allocate resources of the computer system 3512. System applications 3530 take advantage of the management of resources by operating system 3528 through program modules 3532 and program data 3534 (e.g., stored in system memory 3516 or disk storage 3524). It is to be appreciated that the present disclosure can be implemented with various operating systems or combinations of operating systems.

A user enters commands or information into the computer 3512 through input device 3536. Input devices 3536 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, television tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices connect to the processing unit 3514 through the system bus 3518 via interface port(s) 3538. Interface port(s) 3538 include, for example, a serial port, a parallel port, a game port, and a Universal Serial Bus (USB). Output device(s) 3540 use some of the same type of ports as input device 3536. Thus, for example, a USB port may be used to provide input to computer 3512 and to output information from computer 3512 to an output device 3540. Output adapter 3542 is provided to illustrate that there are some output devices 3540 like monitors, speakers, and printers, among other output devices 3540, which require special adapters. By way of illustration and not limitation, the output adapter 3542 includes video and sound cards that provide a means of connection between the output device 3540 and the system bus 3518. It should be noted that other devices and/or systems of devices provide both input capabilities and output capabilities such as remote computer 3544.

The computer 3512 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 3544. The remote computer 3544 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to the computer 3512. For purposes of brevity, only a memory storage device 3546 is illustrated with remote computer(s) 3544. Remote computer(s) 3544 is logically connected to computer 3512 through a network interface 3548 and then physically connected via communication connection 3550. Network interface 3548 includes wired and/or wireless communication networks such as Local Area Networks (LANs), Wide Area Networks (WANs), cellular networks, and the like. LAN technologies include Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), ethernet, token ring, and the like. WAN technologies include, but are not limited to, point-to-point links, circuit switching networks like Integrated Services Digital Networks (ISDN) and variations thereon, packet switching networks, and Digital Subscriber Lines (DSL).

Communication connection(s) 3550 refers to the hardware/software employed to connect the network interface 3548 to the bus 3518. While communication connection 3550 is shown for illustrative clarity inside computer 3512, it can also be external to computer 3512. The hardware/software necessary for connection to the network interface 3548 includes, for exemplary purposes only, internal and external technologies such as, modems (including regular telephone grade modems, cable modems and DSL modems), ISDN adapters, and ethernet cards.

It should be understood that the computer 3512 can be utilized in connection with implementing one or more of the systems, components, and/or methods shown and described in fig. 1-34. According to various aspects and implementations, computer 3512 may be used to facilitate determining and/or executing commands associated with deriving depth data from 2D images, using the derived depth data for various applications including AR and object tracking, generating training data, and so forth (e.g., by

systems

100, 500, 800, 1300, 3000, 3200, and 3300). Computer 3512 may further provide various processing of the 2D image data and 3D depth data described in association with primary processing component 104, secondary processing component 110, third processing component 114, processing component 420, processing component 1222, and processing component 1908. The computer 3512 may further provide for rendering and/or displaying 2D/3D image data and video data generated by the various 2D/3D panorama capture devices, apparatuses and systems described herein. Computer 3512 includes a component 3506 that can embody one or more of the various components described in association with the various systems, devices, assemblies, and computer-readable media described herein.

FIG. 36 is a schematic block diagram of a sample-computing environment 3600 with which the subject matter of the present disclosure can interact. System 3600 includes one or more clients 3610. The client(s) 3610 can be hardware and/or software (e.g., threads, processes, computing devices). The system 3600 also includes one or more servers 3630. Thus, system 3600 can correspond to a two-tier client server model or a multi-tier model (e.g., client, middle tier server, data server), among other models. The server(s) 3630 can also be hardware and/or software (e.g., threads, processes, computing devices). The servers 3630 can house threads to perform transformations by employing the present disclosure, for example. One possible communication between a client 3610 and a server 3630 may be in the form of a data packet transmitted between two or more computer processes.

System 3600 includes a communication framework 3650 that can be employed to facilitate communications between client(s) 3610 and server(s) 3630. The client 3610 is operatively connected to one or more client data storage devices 3620 that may be used to store information local to the client 3610. Similarly, the server 3630 is operatively connected to one or more server data storage devices 3640 that may be used to store information local to the server 3630.

It is noted that aspects or features of the present disclosure may be utilized in substantially any wireless telecommunications or radio technology, e.g., Wi-Fi; bluetooth; worldwide Interoperability for Microwave Access (WiMAX); enhanced general packet radio service (enhanced GPRS); third generation partnership project (3GPP) Long Term Evolution (LTE); third generation partnership project 2(3GPP2) Ultra Mobile Broadband (UMB); 3GPP Universal Mobile Telecommunications System (UMTS); high Speed Packet Access (HSPA); high Speed Downlink Packet Access (HSDPA); high Speed Uplink Packet Access (HSUPA); GSM (global system for mobile communications) EDGE (enhanced data rates for GSM Evolution) Radio Access Network (GERAN); UMTS Terrestrial Radio Access Network (UTRAN); LTE-advanced (LTE-a), and the like. Additionally, some or all aspects described herein may be utilized in conventional telecommunications technologies (e.g., GSM). In addition, mobile as well as non-mobile networks (e.g., the internet, data services networks such as Internet Protocol Television (IPTV), etc.) may utilize aspects or features described herein.

While the subject matter has been described above in the general context of computer-executable instructions of a computer program that runs on a computer and/or computers, those skilled in the art will recognize that the disclosure also may, or can be implemented in combination with other program modules. Generally, program modules include routines, programs, components, data structures, etc. that perform particular tasks and/or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the inventive methods can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, minicomputers, mainframe computers, as well as personal computers, hand-held computing devices (e.g., PDAs, telephones), microprocessor-based or programmable consumer or industrial electronics, and the like. The illustrated aspects may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. However, some, if not all aspects of the disclosure may be practiced on stand-alone computers. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

As used in this application, the terms "component," "system," "platform," "interface," and the like may refer to and/or may comprise a computer-related entity or an entity associated with an operating machine having one or more specific functions. The entities disclosed herein may be hardware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.

In another example, the respective components can execute from various computer readable media having various data structures stored thereon. The components may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems via the signal). As another example, a component may be a device having a specific function provided by mechanical portions operated by electrical or electronic circuitry operated by a software or firmware application executed by a processor. In this case, the processor may be internal or external to the device and may execute at least a portion of a software or firmware application. As yet another example, an element may be a device that provides a particular function through electronic components without mechanical parts, where an electronic component may include a processor or other means for executing software or firmware that imparts, at least in part, functionality to an electronic component. In one aspect, a component may emulate an electronic component via a virtual machine (e.g., in a cloud computing system).

In addition, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise, or clear from context, "X employs A or B" is intended to mean any of the natural inclusive permutations. That is, if X employs A; x is B; or X employs both A and B, then "X employs A or B" is satisfied under any of the foregoing circumstances. In addition, the articles "a" and "an" as used in the subject specification and figures are generally to be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.

As used herein, the terms "example" and/or "exemplary" are used to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. Moreover, any aspect or design described herein as "exemplary" and/or "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to exclude equivalent exemplary structures and techniques known to those of ordinary skill in the art.

Various aspects or features described herein may be implemented as a method, apparatus, system, or article of manufacture using standard programming or engineering techniques. Additionally, various aspects or features disclosed in this disclosure may be implemented by program modules implementing at least one or more of the methods disclosed herein, the program modules being stored in a memory and executed by at least a processor. Other combinations of hardware and software or hardware and firmware may enable or implement the aspects described herein, including the disclosed methods. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or storage media. For example, computer-readable storage media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strip …), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD), blu-ray disk (BD) …), smart cards, and flash (flash) memory devices (e.g., card, stick, key drive …), among others.

As used in this specification, the term "processor" may refer to substantially any computing processing unit or device, including, but not limited to: a single core processor; a single processor with software multi-threaded execution capability; a multi-core processor; a multi-core processor having software multi-thread execution capability; a multi-core processor having hardware multithreading; a parallel platform; and parallel platforms with distributed shared memory. Additionally, a processor may refer to an integrated circuit, an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Controller (PLC), a Complex Programmable Logic Device (CPLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. In addition, processors may utilize nanoscale architectures such as, but not limited to, molecular and quantum dot based transistors, switches, and gates, in order to optimize space usage or enhance performance of user equipment. A processor may also be implemented as a combination of computing processing units.

In this disclosure, terms such as "storage," "data storage," "database," and substantially any other information storage component related to the operation and function of the component are used to refer to "memory components," entities embodied in "memory," or components comprising memory. It will be appreciated that the memory and/or memory components described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

By way of illustration, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (prom), electrically programmable ROM (eprom), electrically erasable ROM (eeprom), flash memory, or nonvolatile Random Access Memory (RAM) (e.g., ferroelectric RAM (feram)). For example, volatile memory can include RAM which can act as external cache memory. By way of illustration and not limitation, RAM can take many forms, such as Synchronous RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), Direct Rambus RAM (DRRAM), Direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM). Additionally, the memory components of systems or methods disclosed herein are intended to comprise, without being limited to, including these and any other suitable types of memory.

It is to be appreciated and understood that components described with respect to a particular system or method can include the same or similar functionality as corresponding components (e.g., correspondingly named components or similarly named components) described with respect to other systems or methods disclosed herein.

What has been described above includes examples of systems and methods that provide the advantages of the present disclosure. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the present disclosure, but one of ordinary skill in the art may recognize that many further combinations and permutations of the present disclosure are possible. Furthermore, to the extent that the terms "includes," "has," "possessing," and the like are used in either the detailed description, the claims, the appendix and the accompanying drawings, such terms are intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim.

Claims

1. A system, comprising:

a memory storing computer-executable components; and

a processor that executes the computer-executable components stored in the memory, wherein the computer-executable components comprise:

a receiving component configured to receive a two-dimensional image; and

a three-dimensional data derivation component configured to employ one or more three-dimensional data from two-dimensional data, i.e., 3D from a 2D neural network model, to derive three-dimensional data for the two-dimensional image.

2. The system of claim 1, wherein the computer-executable components further comprise:

a modeling component configured to determine an alignment between the two-dimensional images and a common three-dimensional coordinate space based on the three-dimensional data respectively associated with the two-dimensional images.

3. The system of claim 2, wherein the modeling component is further configured to generate a three-dimensional model of an object or environment included in the two-dimensional image based on the alignment.

4. The system of claim 3, wherein the computer-executable components further comprise:

A rendering component configured to facilitate rendering of the three-dimensional model via a display of a device.

5. The system of claim 3, wherein the computer-executable components further comprise:

a navigation component configured to facilitate navigation of a three-dimensional model as rendered via a display of a device.

6. The system of claim 1, wherein the computer-executable components further comprise:

a rendering component configured to facilitate rendering of the three-dimensional data of a respective image of the two-dimensional images via a display of a device.

7. The system of claim 1, wherein the computer-executable components further comprise:

a communication component configured to transmit the two-dimensional image and the three-dimensional data to an external device via a network, wherein based on receiving the two-dimensional image and the three-dimensional data, the external device generates a three-dimensional model of an object or environment included in the two-dimensional image by aligning the two-dimensional images with each other based on the three-dimensional data.

8. The system of claim 1, wherein the two-dimensional image comprises a wide field of view image having a field of view exceeding a minimum threshold and spanning up to 360 degrees.

9. The system of claim 1, wherein the computer-executable components further comprise:

a stitching component configured to combine two or more of the two-dimensional images to generate a second image having a field of view greater than the respective fields of view of the two or more first images, and wherein the three-dimensional data derivation component is configured to employ the one or more 3D self-2D neural network models to derive at least some of the three-dimensional data from the second image.

10. The system of claim 1, wherein the receiving component is further configured to receive depth data of a portion of the two-dimensional image captured by one or more three-dimensional sensors, and wherein the three-dimensional data derivation component is further configured to use the depth data as input to the one or more 3D self-2D neural network models to derive the three-dimensional data of the two-dimensional image.

11. The system of claim 10, wherein the one or more three-dimensional sensors are selected from the group consisting of: structured light sensors, light detection and ranging sensors, i.e., LiDAR sensors, laser rangefinder sensors, time-of-flight sensors, light field camera sensors, and active stereo sensors.

12. The system of claim 10, wherein the two-dimensional image comprises a panoramic color image having a first vertical field of view, wherein the depth data corresponds to a second vertical field of view within the first vertical field of view, and wherein the second vertical field of view comprises a narrower field of view than the first vertical field of view.

13. The system of claim 1, wherein the two-dimensional images comprise a panoramic image pair having a horizontal field of view spanning up to 360 degrees, and wherein respective images included in the panoramic image pair are captured from different vertical positions relative to a same vertical axis, wherein the different vertical positions are offset by a stereoscopic image pair distance.

14. The system of claim 1, wherein the system is located on a device selected from the group consisting of: mobile phones, tablet personal computers, notebook personal computers, standalone cameras, and wearable optical systems.

15. An apparatus, comprising:

a camera configured to capture a two-dimensional image:

a memory storing computer-executable components; and

16. The apparatus of claim 15, wherein the computer-executable components further comprise:

17. The apparatus of claim 15, wherein the computer-executable components further comprise:

a communication component configured to transmit the two-dimensional image and the three-dimensional data to an external device, wherein based on receiving the two-dimensional image and the three-dimensional data, the external device generates a three-dimensional model of an object or environment included in the two-dimensional image by aligning the two-dimensional images with each other based on the three-dimensional data.

18. The apparatus of claim 15, wherein the two-dimensional image comprises a wide field of view image having a field of view exceeding a minimum threshold and spanning up to 360 degrees.

19. The apparatus of claim 15, further comprising:

one or more three-dimensional sensors configured to capture depth data of a portion of the two-dimensional image, and wherein the three-dimensional data derivation component is further configured to use the depth data as input to the one or more 3D self-2D neural network models to derive the three-dimensional data of the two-dimensional image.

20. The apparatus of claim 19, wherein the one or more three-dimensional sensors are selected from the group consisting of: structured light sensors, light detection and ranging sensors or LiDAR) sensors, laser range finder sensors, time-of-flight sensors, light field camera sensors, and active stereo sensors.

21. The device of claim 19, wherein the two-dimensional image comprises a panoramic color image having a first vertical field of view, wherein the three-dimensional sensor is configured to capture the depth data for a second vertical field of view within the first vertical field of view, and wherein the second vertical field of view comprises a narrower field of view than the first vertical field of view.

22. The apparatus of claim 15, wherein the apparatus is selected from the group consisting of: mobile phones, tablet personal computers, notebook personal computers, standalone cameras, and wearable optical devices.

23. A method, comprising:

capturing, by a system comprising a processor, a two-dimensional image of an object or environment; and

transmitting, by the system, the two-dimensional image to a remote device, wherein upon receipt of the two-dimensional image, the remote device employs one or more three-dimensional data from a two-dimensional (3D) from 2D neural network model to derive three-dimensional data for the two-dimensional image, and generates a three-dimensional reconstruction of the object or environment using the two-dimensional image and the three-dimensional data.

24. The method of claim 23, further comprising:

receiving, by the system, the three-dimensional reconstruction from the remote device; and

rendering, by the system, the three-dimensional reconstruction via a display of a device.

25. The method of claim 23, wherein the capturing comprises capturing the two-dimensional image as a panoramic image having a horizontal field of view spanning up to 360 degrees.

26. The method of claim 25, wherein the capturing comprises capturing the panoramic image pair, comprising capturing respective images of the panoramic image pair from different vertical positions relative to a same vertical axis, wherein the different vertical positions are offset from a stereoscopic image pair distance.

27. The method of claim 26, wherein the capturing comprises employing a camera configured to move to the different vertical position to capture the respective image.

28. The method of claim 26, wherein the capturing comprises employing two cameras located at the different vertical positions.