WO2021160097A1

WO2021160097A1 - System and method for object detection for augmented reality

Info

Publication number: WO2021160097A1
Application number: PCT/CN2021/076066
Authority: WO
Inventors: Xiang Li; Yi Xu; Yuan Tian
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-02-14
Filing date: 2021-02-08
Publication date: 2021-08-19
Also published as: CN115104135A

Abstract

Methods and systems are disclosed for object detection in augmented reality environments. An image of an environment is received and used to generate a point cloud. Depth values are extract using the point cloud. One or more candidate identification profiles are received from a neural network. Each candidate identification profile includes a first bounding box that surrounds a first portion of the image that includes at least a portion of a particular object, a label, and an identification probability. Virtual geometry, generated by modifying scale information associated with object identification profile using the depth values, is used to generate a scale matching probably. The scale matching probability updates the neural network. A final identification profile is generated using the neural network. The final identification profile includes a second bounding box that surrounds a second portion of the image that includes the particular object and an updated label.

Description

SYSTEM AND METHOD FOR OBJECT DETECTION FOR AUGMENTED REALITY

BACKGROUND OF THE INVENTION

An Augmented Reality (AR) system superimposes virtual content over a user’s view of the real world environment. With the development of AR software development kits (SDK) , the mobile industry has brought smartphone AR to mainstream. An AR SDK typically provides six degrees-of-freedom (6DoF) tracking capability. A user can scan the environment using a smartphone’s camera, and the smartphone performs visual inertial odometry (VIO) in real time. Once the camera pose is tracked continuously, virtual objects can be placed into the AR scene to create an illusion that real objects and virtual objects are merged together.

Some augmented-reality applications generate virtual objects intended to appear naturally within the real-world environment. Generating a virtual object (e.g., such as a bed) that is placed out of context (e.g., in a bathroom) prevents the augmented-reality application from generating naturally appearing objects. Thus, it can be important in such systems to identify objects within the real-world environment for improved augmented-reality displays.

SUMMARY OF THE INVENTION

The present disclosure relates generally to augmented reality, and more specifically, and without limitation, to object detection in neural networks augmented using augmented-reality sensor data.

Aspects of the present disclosure include methods for object detection in augmented reality environments. The methods include receiving, by a camera of a mobile device, an image of an environment; generating, by the mobile device, a point cloud that indicates a distance of a particular object that is depicted in the image; extracting, from the point cloud, a depth value that indicates a distance between the camera and the particular object; receiving, from a trained neural network, one or more candidate identification profiles, each candidate identification profile including a first bounding box that surrounds a first portion of the image that includes the at least a portion of the particular object and an identification probability that indicates a probability that the particular object corresponds to a particular label; receiving scale information that corresponds to the particular label, the scale information indicating a geometry of an arbitrary object that corresponds to the particular label; generating virtual geometry by modifying the scale information using the depth value; generating a scale matching probability by comparing the first bounding box to the virtual geometry; updating the neural network using the scale matching probability; and generating, using the updated neural network, a final identification profile that includes a second bounding box that surrounds a second portion of the image that includes the particular object and an updated label that corresponds to the particular object.

Another aspect of the present disclosure includes a system comprising one or more processors and a non-transitory computer-readable media that includes instructions that when executed by the one or more processors, cause the one or more processors to perform methods described above.

Another aspect of the present disclosure includes a non-transitory computer-readable media that includes instructions that when executed by one or more processors, cause the one or more processors to perform the methods described above.

Numerous benefits are achieved by way of the present disclosure over conventional techniques. For example, methods and systems are provided that utilize a neural network that generates improved object detection and identification characterized by a higher accuracy than conventional techniques. Scale matching using augmented-reality sensor can be performed by an operation of a neural network to modify candidate regions being identified by the neural network. This results in the neural network generating improved object detection and identification that has a higher accuracy. In addition, the scale matching applied at various stages of the neural network may eliminate some candidate objects from consideration by the neural network, which can reduce the processing resource consumed by the neural network and increase the processing speed of success layers of the neural network.

Further areas of applicability of the present disclosure will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating various embodiments, are intended for purposes of illustration only and are not intended to necessarily limit the scope of the disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example of a computer system that includes a depth sensor and a red, green, and blue (RGB) optical sensor for AR applications according to an embodiment of the present invention.

FIG. 2 illustrates an example of a block diagram of a neural network augmented with augmented-reality sensor data according to an embodiment of the present invention.

FIG. 3 illustrates an example process for object detection using an augmented neural network of the embodiment of FIG. 2 according to an embodiment of the present invention

FIG. 4 illustrates an example of a block diagram of a neural network augmented with augmented-reality sensor data according to an alternative embodiment of the present invention.

FIG. 5 illustrates an example process for object detection using an augmented neural network of the embodiment of FIG. 4 according to an embodiment of the present invention.

FIG. 6 illustrates an example of a block diagram of a neural network augmented with augmented-reality sensor data, according to a specific embodiment of the present invention.

FIG. 7 illustrates an example process for object detection using an augmented neural network of the embodiment of FIG. 6 according to an embodiment of the present invention.

FIG. 8 illustrates an example block diagram depicting the execution layers of a scale matching process according to an embodiment of the present invention.

FIG. 9 illustrates an example computer system according to an embodiment of the present invention.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.

Embodiments of the present disclosure are directed to, among other things, object detection in neural networks augmented using augmented-reality sensor data. For instance, an image is received from a camera of a mobile device. The mobile device, using the image, may generate a pose (e.g., position and orientation) of the camera and a point cloud of a particular object that is depicted in the image. The mobile device may use the pose of the camera and the point cloud to generate a depth value that indicates a distance between the camera and a particular object. The mobile device may receive, from a neural network, candidate identification profiles for portions of the image. Each profile can include a first bounding box that surrounds a first portion of the image that includes a portion of the particular object, a label that corresponds to the particular object, and a probability that the label corresponds to the particular object. The mobile device may receive scale information that indicates a geometry of an arbitrary object that corresponds to the label. Virtual geometry can be generated by modifying the scale information using the depth value. The mobile device may use the virtual geometry to generate a scale matching probability, which may update the neural network. The updated neural network outputs a final identification profile that includes a second bounding box that surrounds a second portion of the image that includes the particular object and a new label that corresponds to a particular object.

For example, an AR application may use object detection to select a particular virtual object and/or the placement of a particular virtual environment within an AR scene. The AR application select a particular virtual object based or its placement based on a context of the real-world environment such that the virtual object appears as if it belongs in the real-world environment. Object detection may be performed by the AR application, by a neural network, or by a neural network that is augmented by AR sensor data. For instance, an image of an office environment may be captured by a camera of a mobile device. The mobile device generates a point cloud of objects in the image, such as a desk, chair, etc. and computes a depth value that corresponds to at least one object (e.g., the distance between the camera and a chair) . A neural network generates candidate identification profiles of the objects in the image. Each candidate identification profile includes a bounding box that surrounds a portion of the image that corresponds to an object such as the chair. The candidate identification profile may also include a probability that the portion of the image corresponds to a particular label (e.g., chairs) . The mobile device may receive scale information for an arbitrary chair. The scale information includes average dimensions for arbitrary chairs (e.g., between . 5 and 1 meter high, between . 5 and 1 meter wide, etc. ) . The mobile device then generates virtual geometry by modifying the scale information using the depth value (e.g., accounting for the distance of the object from the camera) .

The mobile device can then generate a scale matching probability by comparing the bounding box of the candidate identification profile with the virtual geometry. The scale matching probability may indicate the accuracy of an identification profile. For instance, if bounding box is close to the virtual geometry, then the bounding box is likely accurate. If the bounding box diverges from the virtual geometry, then the identification profile may be inaccurate.

The scale matching probability can be used to update the neural network. For instance, the probability of an identification profile may be updated (e.g., being more or less probable of corresponding to a particular label) , or eliminated (e.g., when the scale matching probability indicating that the identification profile is too inaccurate) , or the like. The updated neural network provides a final identification profile that includes a second bounding box that surrounds a second portion of the image that includes the chair, and a new label. In some instances, the second portion of the image may be equal to the first portion of the image. In other instances, the second portion of the image may more accurately include the object being detected (e.g., the chair) . Similarly, the new label may be the same as the old label (if the candidate identification profile included a correct label) or the updated neural network may generate a new label that is more accurately corresponds to the object (e.g., office chairs) .

Scale matching (e.g., the generation or application of the scale matching probability) may be performed at various stages of the object detection by the neural network. For instance, scale matching may be performed after the last layer (e.g., after the non-maximum suppression process) , before the last layer (e.g., before the non-maximum suppression process) , and/or earlier in the neural network processing pipeline (e.g., after candidate region generation, after scale classifier generation, etc. ) .

The methods and system described herein increase the accuracy of object detection in neural networks. The neural network uses scale matching from the augmented-reality sensor data to modify candidate objects that the neural network is attempting to identify in an image. This results in the neural network generating improved object detection and identification that has a higher accuracy. In addition, the scale matching can, at various stages of the neural network, eliminate some candidate objects from consideration by the neural network, which can reduce the processing resource consumed by the neural network and increase the processing speed of success layers of the neural network.

FIG. 1 illustrates an example of a computer system 110 that includes a depth sensor 112 and an RGB optical sensor 114 for AR applications, according to an embodiment of the present invention. The AR applications can be implemented by an AR module 116 of the computer system 110. Generally, the RGB optical sensor 114 generates an RGB image of a real-world environment that includes, for instance, a real-world object 130. The depth sensor 112 generates depth data about the real-world environment, where this data includes, for instance, a depth map that shows depth (s) of the real-world object 130 (e.g., distance (s) between the depth sensor 112 and the real-world object 130) . Following an initialization of an AR session (where this initialization can include calibration and tracking) , the AR module 116 renders an AR scene 120 of the of the real-world environment in the AR session, where this AR scene 120 can be presented at a graphical user interface (GUI) on a display of the computer system 110. The AR scene 120 shows a real-world object representation 122 of the real-world object 130. In addition, the AR scene 120 shows a virtual object 124 not present in the real-world environment. The AR module 116 can generate a red, green, blue, and depth (RGBD) image from the RGB image and the depth map that may be processed by the AR system to augment an object detection neural network. For instance, the AR application may use a pose of the computer system 100 and a point cloud of objects to obtain scale information of candidate object categories identified by a neural network. The scale information is passed to the neural network to refine the object detection, which increases the accuracy of object detection (and identification) and reduces the processing resources consumed by the neural network.

In an example, the computer system 110 represents a suitable user device that includes, in addition to the depth sensor 112 and the RGB optical sensor 114, one or more graphical processing units (GPUs) , one or more general purpose processors (GPPs) , and one or more memories storing computer-readable instructions that are executable by at least one of the processors to perform various functionalities of the embodiments of the present disclosure. For instance, the computer system 110 can be any of a smartphone, a tablet, an AR headset, or a wearable AR device.

The AR module 116 may execute a visual-inertial odometry (VIO) process to track the pose (e.g., position and orientation) of the AR module 116. VIO uses image analysis and an inertial measurement unit (sensor data) to determine changes in the camera’s (e.g., the AR module’s 116) position and/or orientation. Visual odometry can use feature detection in images to identify and correlate features across successive image. The feature detection may be used to generate an optical flow field that estimates the motion of the camera relative to objects depicted in the success images. The degree of motion between time intervals (e.g., the time interval between successive images) may be used to determine the distance and direction the camera has moved during the time interval. The distance and direction of the camera (and sensor data) may be used to track a position and an orientation of the camera at each time interval. Visual odometry may be augmented using an inertial measurement unit that captures directional force values.

For instance, AR module 116 may execute an implementation of VIO called a simultaneous localizations and mapping (SLAM) process. For instance, the SLAM process may initiate with a calibration step in which an empty map of the environment may be initialized with the device positioned at the origin of the coordinate system. The SLAM process receives input data such as, but not limited, image data, control data ct, sensor data st, and time interval t. The SLAM process then generates an output that may include an approximate location of the device xt for a given time interval (relative to one or more approximate locations at one or more previous time intervals) and a map of the environment mt. The output can be augmented (or verified) using feature detection images captured at time t and time t+1 to identify and correlate features across the images. The changes between images can be used to verify the movement of the AR module 116, populate the environment mt with objects detected in the images, etc.

As the device captures sensor data that indicates movement in a particular direction (and image data from the camera of the device) , the SLAM process may update xt and mt. The SLAM process may be an iterative process that updates xt and mt in set time intervals or when new sensor data or image data can be detected. For instance, if no sensor change occurs between time interval t and t+1, then the SLAM process may delay updating the position and map to preserve processing resources. Upon detecting a change in sensor data indicating a high probability that the device has moved from its previous position xt, the SLAM process may compute the new position of device xt and update the map mt.

The AR module 116 can be implemented as specialized hardware and/or a combination of hardware and software (e.g., general purpose processor and computer-readable instructions stored in memory and executable by the general purpose processor) . In addition to initializing an AR session and performing VIO, the AR module 116 can use the augmented-reality sensor data to improve object detection of the real-world objects. The AR module can determine a distance between the AR model 116 and the real-world object 130. During object detection by a neural network, the AR module 116 may use scale matching of candidate categories of an object (e.g., one or more categories that neural network has assigned as a possibility of corresponding the real-world object 130) , to update the neural network. For instance, the AR module 116 can obtain dimensions for an object that corresponds to a candidate label. The dimensions are then modified based on the distance between the object and the camera as determined by the AR module 116 (e.g., if the object is far away then the dimensions will be smaller and vice versa) . Then if the candidate bounding box for a label generated by the neural network is too large or too small given the scale matching, the probability that the bounding box for the label is correct can be reduced (or the bounding box and label may be eliminated from contention) .

In an illustrative example of FIG. 1, a smartphone is used for an AR session that shows the real-world environment. In particular, the AR session includes rendering an AR scene that includes a representation of a real-world table on top of which a vase (or some other real-world object) is placed. The augmented-reality session can include rendering of virtual objects that are superimposed onto the AR scene such as virtual object 124. The AR session can include a neural network that identifies objects in the real-world environment (e.g., to select the virtual object 124, position the virtual object 124, etc. ) . The AR module 116 initializes the AR session by using an image to determine an initial pose of the smartphone (e.g., using a SLAM process or the like) and initializing a coordinate system relative to the smartphone.

The AR module 116 may generate a point cloud for objects depicted in the image, such as real-world object 130. A point cloud includes two or more points (e.g., coordinates in the coordinate system) , each point corresponding to a discrete point of the object. Point clouds represent the depth of the object (e.g., the distance of the object relative to the smartphone) . The AR module 116 uses the point cloud and the pose of the smartphone to extract the depth value for the object.

The smartphone executes a neural network for object detection. The neural network uses the images captured by the camera to detect and identify the objects depicted therein. The neural network may generate one or more candidate identification profiles. Each candidate identification profile includes a bounding box around a portion of the image that includes a portion of the object and a probability that the object corresponds to a particular label of objects. The bounding box may be represented by a six dimensional (6D) vector such as (c, x, y, z, w, g, p) where x, y, and z are dimensions, w with the width, h is the height, c is the label, and p is the probability that that label is correct. For each candidate identification profile, the AR module 116 uses the label obtain scale information. The scale information includes dimensions for an arbitrary object that conforms to the label. For instance, if the label is tables, the scale information includes average dimensions of an arbitrary table. Alternatively, scale information can include ranges (min-max in each dimension) for tables. The AR module 116 generates virtual geometry for the arbitrary object (e.g., tables) . The virtual geometry is generated by modifying the scale information with the depth value.

The AR module 116 uses a comparison of the bounding box and the virtual geometry to generate a scale matching probability. The scale matching probability may be used to update the neural network by adjusting the probability assigned to the candidate identification profile. For instance, if the bounding box for the label table is larger than the virtual geometry, then the predicted label of the candidate identification profile is likely incorrect because the predicted label at the distance measured by the AR module 116 should appear smaller. Updating the probabilities can include eliminating candidate identification profiles that diverge from the bounding box by more than threshold amount (e.g., 10%, 15%, or any predetermined amount) .

The updated neural network generates a final identification profile for the real-world objects 130. The final identification profile can include an updated bound box surround a portion of the image that includes the table and an updated label (e.g., if different from the candidate identification profile) . The neural network may present the final identification profile via the display of the smartphone. For instance, the neural network (or the AR module 116) can generate

bounding box

126 and 128 with a label that indicates a label that corresponds to the respective bounding box within the AR scene 120 that is displayed.

FIG. 2 illustrates an example of a block diagram of a neural network augmented with augmented-reality sensor data according to an embodiment of the present invention. Object detection may use a combination of an AR framework 204 and an object detection neural network 220. For instance, a computing device may execute AR framework 204 and an object detection neural network 220 within an AR session. The computing device may be a mobile device (e.g., smartphone, laptop, tablet, or the like) , a desktop computer, server, thin client, or the like. Object detection may leverage the AR sensor data during operation of the object detection neural network 220 to improve the performance (e.g., accuracy, resource consumption, speed, etc. ) of the object detection neural network 220.

The AR framework 204 may operate according to an AR session. Input images may be received and used to continuously track a pose 208 of the computing device (e.g., using SLAM or the like) . The AR framework may also compute a point cloud 212 associated with surfaces and/or objects within the image (e.g., by using depth information included in the RGBD image, a separate depth sensor, or stereo disparity using two images) . The point cloud indicates a depth of each surface (e.g., the distance of the surface from the camera that obtained the image) .

The object detection neural network 220 initiates an objected detection process using the same images received by the AR framework 204. In some instances, the object detection neural network 220 initiates in parallel with the AR framework 204 determining pose 208 and generating point cloud 212. In other instances, the object detection neural network 220 initiates before or after the AR framework determines pose 208. The object detection neural network 220 generates candidate identification profiles which may be refined using regressor 228. Regressor 228 refines the candidate regions (e.g., the bounding box of candidate identification profiles) for categorization by classifier. The classifier 224 generates a probability that the bounding box of a candidate identification profile corresponds to a particular label. The object detection neural network 220 may execute a non-maximum suppression process to eliminate overlapping (e.g., redundant candidate identification profiles.

The label of each remaining candidate identification profile is passed to the AR framework. The AR framework executes a query using the label to the scale information database 216, which returns scale information that corresponds to the label. The scale information includes dimensions (e.g., height, width, and depth) of an arbitrary object that is assigned the label. The scale information may include a minimum value and maximum value for each dimension. Example scale information can be found to Table 1 below. Alternatively, the scale information may include for each dimension, an average value of the dimension of the object and a standard deviation.

Label	minL (m)	minW (m)	minH (m)	maxL (m)	maxW (m)	maxH (m)
Airplane	3	2.18	1.5	84.0	117.0	24.2
Car	2.4	1.2	1.2	5.6	2.0	2.0
Chair	0.4	0.45	0.8	0.85	0.9	1.4
Mouse	0.13	0.19	0.05	0.08	0.05	0.03

Table 1. Example of label scale database

The scale information may be processed by the AR framework 204 to generate a virtual geometry for the candidate identification profile. The virtual geometry may be generated by modifying the scale information using the measured depth value for the pixels included in the bounding box (e.g., using the pose 208 of the computing device and the point cloud 212) . The virtual geometry may be passed back to the object detection neural network 220 which may perform scale matching 232. Scale matching eliminates candidate identification profiles that have a bounding box that differs beyond a threshold value from the virtual geometry. Alternatively or additionally, the scale matching 232 modifies the probability of each candidate identification profile based on the comparison of the virtual geometry to the bounding box (e.g., increasing the probability when the bounding box is close the virtual geometry and decreasing the probability when the bounding box differs) . The neural network then selects from the candidate identification profile that has the highest probability as the output of the object detection neural network 220 be output.

FIG. 3 illustrates an example process for object detection using an augmented neural network of FIG. 2 according to an embodiment of the present invention.

At block 304, one or more images may be received. The images can include color data (e.g., RGB values) and depth values. For instance, the images may be received from an RGBD optical sensor (or the like) .

At block 308, an AR framework (such as AR framework 204 of FIG. 2) uses the one or more images to determine a pose of the computing device (e.g., using a SLAM process or the like) and a point cloud of at least one surface depicted in the image. For instance, the images can include depth data (e.g., from an RGBD optical sensor) that indicates a distance between a point depicted in an image and the camera. The AR framework uses the depth data to generate a plurality of points, each point corresponding discrete coordinate of a surface depicted in the image. The point may be represented by a coordinate (in three-dimensions) within the coordinate system of the computing device.

For instance, the RGBD optical sensor can generate a multi-dimensional (e.g., two-dimensional) image with each pixel including color data (e.g., RGB values) and a depth value. The AR framework can convert the two-dimensional coordinates of each pixel to three-dimensional coordinates using the depth values of the pixels. These three-dimensional coordinates may make up the points in the point cloud (e.g., where the number of points is equal to the resolution of the two-dimensional image) . In some instances, the number of points may be less than the resolution of the image. In other instance, the number of points may be greater than the resolution of an image (e.g., such as when multiple images are used to generate the point cloud using the depth values of each image and/or stereo disparity between the images) .

At block 312, a depth value for at least one surface depicted in the image can be extracted by the AR framework. The AR framework may derive a single depth value from the points of the point cloud that correspond to the at least one surface (and optionally the pose) . In some instances, since the point cloud may include multiple points for each surface, the AR framework may average the depth values of the points of a surface. In other instances, the AR framework may select the minimum depth value. In still yet other instances, the AR framework may select the maximum depth value. The AR framework may select any depth value according to a predefined (e.g., user derived) criteria.

At block 324 (and approximately in parallel with 308 in some embodiments) the neural network (e.g., such as object detection neural network 220 of FIG. 2) may begin processing the one or more images to perform object detection. The object detection neural network may be a deep neural network such as, but not limited to, a convolutional neural network (CNN) , single-shot multibox detector (SSD) , you only look once (YOLO) , faster region-based CNN (faster R-CNN) , region-based fully convolutional networks (R-FCN) , or the like. Neural networks perform object detection in layers where the processing at a given layer may both dictate the subsequent layer and be used by the subsequent layer. The initial layers may identify candidate regions (e.g., potential regions of interest) within the image that may correspond with objects to detect. Candidate regions may include a bounding box that surrounds the candidate regions. Subsequent layers may add, remove, or modify the candidate regions as well as determine a probability that the candidate region corresponds to a particular label (e.g., generating a candidate identification profile for the candidate region) . A final layer may include a non-minimum suppression process that eliminates candidate identification of a given candidate region that are less than a threshold amount.

For instance, the neural network performs operations at a one or more successive layers such as, but not limited to, candidate region selection to identify candidate regions of the image, object detection that assigns a label (e.g., a category) for each candidate regions, a regressor (e.g., a linear, polynomial, logarithmic, or the like) that refines two or more candidate regions (e.g., by perform a regression analysis) into a single candidate region, and a classifier (e.g., k-nearest neighbors (k-NN) , softmax, or the like) that generates a candidate identification probability that includes the bounding box (e.g., an identification of candidate region) and probability that a label (e.g., a label) corresponds to the portion of the image that corresponds to the candidate region.

At block 328, the neural network may perform a non-maximum suppression that may eliminate some candidate identification profiles. Non-maximum suppression organizes the candidate identification profiles according to the probability (e.g., from highest to lowest or lowest to highest) . Then the neural network selects the candidate identification profile with the highest probability. The neural network suppresses (e.g., eliminates) all of the candidate identification profiles that have a bounding box that significantly overlaps (e.g., based on a pre-defined threshold level of overlap) with the bounding box of the selected candidate identification profile. Alternatively, or additionally, non-maximum suppression may suppress candidate identification that have a probability that does not exceed a threshold value. The remaining candidate identification profile (s) (or just the labels in some embodiments) are used for scale matching.

At block 316, the AR framework uses the labels of the candidate identification profile (s) to obtain scale information from a database, and augments the scale information with the depth value. For instance, the AR framework may transmit a query using the label to the database. The scale information database stores scale information for each label. The scale information can include, for each dimension (e.g., x, y, and z) , 1) a minimum value and maximum value or 2) an average value and a standard deviation. The scale information database may return the scale information to the AR framework (or to the neural network) . The AR framework may generate a virtual geometry for the candidate identification profile. The virtual geometry is generated by modifying the scale information with the depth value of a surface that is included within the bounding box of the candidate identification profile (e.g., extracted at block 312) . In other words, the AR framework receives the general dimensions of an arbitrary object that is assigned the label of the candidate identification profile and modifies those dimensions to account for the distance the object is measured to be (by the AR framework) from the computing device. For instance, if the depth value is high, then virtual geometry will be smaller than the scale information (since the object will appear smaller to a user) , and if the depth value is low, then the virtual geometry will be larger than the scale information (since the object will appear larger to a user) .

The virtual geometry may be projected onto the two-dimensional camera view presented by a display of the computing device. In some instances, a virtual arbitrary object having the modified dimensions is presented on the display. In other instances, a bonding box with dimensions that correspond to the modified dimensions is presented on the display.

At block 320, a scale matching process may be performed. The scale matching process compares the virtual geometry and the bounding box of each of the candidate identification profile (s) . A scale matching probability can be generated based a comparison of the bounding box with the virtual geometry. Specifically, the more that the bounding box overlaps with the virtual geometry, the more likely that bounding box is likely to be accurate. The value of the scale matching probability depends on a degree to which the bounding box overlaps with the virtual geometry. In some instance, a large degree of overlap increases the value of the scale matching probability. A low degree of overlap decreases the value of the scale matching probability. In other instances, a large degree of overlap decreases the value of the scale matching probability. A low degree of overlap increases the value of the scale matching probability.

In some instances, the scale information is processed by the AR framework to generate the virtual geometry and the scale matching probability. In other instances, the neural network receives the scale information, the pose of the computing, and the point cloud. The neural network then generates the virtual geometry and the scale matching probability. In still yet other instances, a combination of the AR framework and the neural network may be used to generate the virtual geometry and scale matching probability.

At block 332, the neural network determines, for each remaining candidate identification profile, an updated probability that the portion of the image within the bounding box corresponds to a particular label. The neural network may receive the candidate identification profile (s) output from the non-maximum suppression process and the corresponding scale matching probability for each candidate identification profile (s) . The scale matching probability may be used to update the probability of a corresponding candidate identification profiles. For instance, if the scale matching probability is low (e.g., little overlap between the virtual geometry and the bounding box) , the probability of the candidate identification profile may be lowered such that the candidate identification profile may be effectively removed from contention in favor of a candidate identification profile with a bounding box that better overlaps with the scale matching probability.

For example, a first candidate identification profile may include a bounding box and a label that corresponds to a table. The depth value for the portion of image indicates the object is further away and thus the object has a virtual geometry that is smaller than the bounding box. The probability of the candidate identification profile would be reduced to reflect the inconsistency that the candidate object has dimensions are not consistent with the computing device’s measurements.

The neural network selects from the remaining candidate identification profiles the candidate identification profiles that has a highest (modified) probability. The neural network then outputs bounding box, the label, and (optionally) the probability that portion of the image surround by the bounding box corresponds to the label.

At block 336, the output bounding box is displayed by superimposing the bounding box, on a view of the AR scene. The bounding box may be presented with the label and the probability that portion of the image surround by the bounding box corresponds to the label. In some instance, the AR framework tracks the portion of the image surrounded by the bounding box such that the bounding box continues to be displayed over the proper portion of the image even when a view of the environment changes (e.g., the camera moves) .

FIG. 4 illustrates an example of a block diagram of a neural network augmented with augmented-reality sensor data according to an alternative embodiment of the present invention. The discussion provided in FIG. 2 related to a neural network augmented with augmented-reality sensor data is applicable to FIG. 4 as appropriate. The neural network may be updated with AR sensor data at various stages of neural network. The block diagram of FIG. 2 (and the corresponding flowchart of FIG. 3) show the modification occurring after the non-maximum suppression process (e.g., at the last layer of the neural network before an output is produced. Updating the neural network at an earlier stage may reduce the number of candidate regions processed by the neural network (e.g., eliminating regions and/or labels that have a low probability of being accurate) thereby improving the resource consumption of the neural network and the speed in which the neural network may process images. For instance, object detection neural network 420 incorporates the scale matching 408 from the AR sensor data (e.g., AR framework 204, pose 208, point cloud 212 and scale information database 216 as described above in connection to FIG. 2) after object detection layer 404, but prior to the non-maximum suppression 412.

The object detection layer 404 of the neural network generates candidate identification profiles for all possible candidate regions of the image without suppressing redundant ones (e.g., overlapping) that have non-maximal probability within a local region. Scale matching 408 may be performed after the operation of the detection layer to filter candidate identification profiles that have a low probability that the portion of the image surrounded by the bounding of the candidate identification profile corresponds to an assigned label. Since these candidate identification profiles are eliminated prior to the non-maximum suppression, the non-maximum suppression can process fewer candidate identification profiles thereby increasing the speed in which the non-maximum suppression executes and output can be obtained. Further, the reduction of candidate identification profiles earlier in the process prevents an inaccurate candidate identification profile from biasing the remaining layers of the neural network (e.g., reducing the accuracy of the neural network) .

FIG. 5 illustrates an example process for object detection using an augmented neural network of FIG. 4 according to an embodiment of the present invention. The blocks 304-320 are similar to those described in connection to FIG. 3 and the description provided in relation to FIG. 3 is applicable to FIG. 5 as appropriate 3.

At block 504, the neural network processes the one or more images using the layers of the neural network. The neural network executes an object detection layer (e.g., object detection layer 404 of FIG. 4 that generates an output of one or more candidate identification profiles (e.g., each including a bounding box that surrounds a candidate region of the image and a probability that the candidate regions corresponds to a particular label) . The label and probability may be output from a trained classifier layer that identifies features in the candidate region that are similar or the same as features in a training dataset. The output of the object detection layer is passed to both categorization at block 508 and scale information retrieval at block 316. In the example process of FIG. 5 the scale information retrieval at block 316 is initiated using all possible candidate regions of the image without suppressing redundant ones (e.g., overlapping) that have non-maximal probability within a local region.

At block 508, the neural network determines, for each remaining candidate identification profile, an updated probability that the portion of the image within the bounding box corresponds to a particular label. The scale matching probability received from block 320 can be used to update the probability of a corresponding candidate identification profile. Since the neural network has not yet applied a non-maximum suppression process, the probability (and label) of all possible candidate regions of the image (e.g., including redundant ones) is updated using the corresponding scale matching information from block 320. In some instances, candidate identification profiles that have a probability lower than a threshold value may be eliminated from further processing.

At block 512, the neural network can apply a non-maximum suppression to the candidate identification profiles. Non-maximum suppression organizes the candidate identification profiles according to the probability (e.g., from highest to lowest or lowest to highest) . Then the neural network selects the candidate identification profile with the highest probability. The neural network then suppresses (e.g., eliminates) all of the candidate identification profiles that have a bounding box that significantly overlaps (e.g., based on a pre-defined threshold level of overlap) with the bounding box of the selected candidate identification profile. Alternatively, or additionally, non-maximum suppression may suppress candidate identification that have a probability that does not exceed a threshold value. By updating the probabilities of the candidate identification profiles (e.g., at block 508) , the non-maximum suppression may suppress more candidate identification profiles than the non-maximum suppression performed at block 328 of FIG. 3. This improves the accuracy of the neural network as fewer (e.g., potentially only one) candidate identification profile remains after the non-maximum suppression process. If more than one candidate identification profile remains, the neural network selects the candidate identification profile having the highest probability from among the candidate identification profiles

At block 516, the bounding box, label, and (optionally) probability of the selected candidate identification profile is displayed by superimposing the bounding box, on a view of the AR scene. In some instance, the AR framework tracks the portion of the image surrounded by the bounding box such that the bounding box continues to be displayed over the proper portion of the image even when a view of the environment changes (e.g., the camera moves) .

FIG. 6 illustrates an example of a block diagram of a neural network augmented with augmented-reality sensor data according to a specific embodiment of the present invention. In some instances, the neural network may use a two-stage object detection (e.g., faster R-CNN) . The discussion provided in FIGS. 2 and 4 related to a neural network augmented with augmented-reality sensor data is applicable to FIG. 6 as appropriate. The first stage identifies a large number of candidate regions and a second stage that performs object detection/classification on the candidate regions (e.g., using subsequent layers of the neural network) . In this instance, scale matching may be performed after the first stage to reduce the number of candidate regions processed by the subsequent neural network layers (e.g., reducing the processing load and resource consumption of the subsequent neural network layers) . In other instances, such as single stage object detection networks a multiscale feature map may be used to generate classifiers to detection objects at multiple scales (e.g., SSD) . In this instance, scale matching may be used select classifiers with appropriate scales for detection. For example, for each layer, if the bounding box would appear too small or too big based on estimated real-world scale (e.g., determined using the point cloud, pose of the computing device, and scale information to perform scale matching) , then the detection classifier can be omitted on these layers.

In two stage object detection neural networks, the object detection neural network 620 includes a region proposal layer 604 (e.g., stage one) that generate a large number of candidate layers for the subsequent processing layers. The candidate regions may be reduced using scale matching. For instance, for each candidate region, an assigned label of the candidate region may be used to look up scale information. The scale information may be modified using a depth value (e.g., using the pose 208 and point cloud 212 of AR framework 204) to generate virtual geometry for the candidate region. If the bounding box of the candidate region is significantly larger or smaller than the virtual geometry (e.g., using a predetermined threshold value) then the candidate region may be eliminated from further process by the neural network. The subsequent layers of the neural network may include an object detection layer, a classifier, a non-maximum suppression process 612, combinations thereof, or the like that may be used to output the bounding box and label for a region of the image.

In single stage object detection neural networks, object detection neural network 620 includes one or more layers for identifying candidate regions 604 (e.g., instead of an entire stage) . The object detection neural network 620 may include a plurality of scale classifiers 608 that can detect a particular feature in an image at one or more scales (e.g., each scale corresponding to a dimension of the feature at a different distance from the camera of the computing device) . The multi-scale feature map (e.g., with a plurality of scales being included in the map for each feature) may be used to generate (e.g., train) a classifier to detect features of a particular scale. Typically, since the scale may be unknown, each of the classifiers (e.g., each via a different layer of the neural network) may be applied to detect features.

Scale matching may be used to determine scales that are not consistent with the AR sensor data (e.g., bounding boxes that are too small or too big, etc. ) . The classifiers that correspond to these scales may be omitted, which may reduce the number of layers of the neural network. The subsequent layers of the neural network may include an object detection layer, a classifier, a non-maximum suppression process 612, combinations thereof, or the like. The subsequent layers may be used to output the bounding box and label for a region of the image.

The AR framework 204 may operate according to an AR session. The input images may be received and used to continuously track a pose 208 of the computing device (e.g., using SLAM or the like) . The AR framework may also compute a point cloud 212 associated with surfaces and/or objects within the image (e.g., by using depth information included in the RGBD image, a separate depth sensor, or stereo disparity using two images) . The point cloud indicates a depth of each surface (e.g., the distance of the surface from the camera that obtained the image) .

The labels of each candidate identification profiles may be output to the AR framework. The AR framework may transmit a query using the label to the scale information database 216, which returns scale information that corresponds to the label. The scale information includes dimensions (e.g., height, width, and depth) of an arbitrary object that is assigned the label.

The scale information may be processed by the AR framework 204 to generate a virtual geometry for the candidate identification profile. The virtual geometry may be generated by modifying the scale information using the measured depth value for the pixels included in the bounding box (e.g., using the pose 208 of the computing device and the point cloud 212) . The virtual geometry may be passed back to the object detection neural network 220, which may perform scale matching 232. Scale matching 232 may be performed to reduce the number of candidate regions (e.g., candidate identification profiles) to be processed by subsequent layers of object detection neural network 620 or to eliminate scale classifiers that correspond to scales that are inconsistent with the AR sensor data.

FIG. 7 illustrates an example process for object detection using an augmented neural network of FIG. 6 according to an embodiment of the present invention. The blocks 304-320 are similar to those described in connection to FIG. 3 and the description provided in relation to FIG. 3 is applicable to FIG. 7 as appropriate. The description of the example process of FIG. 7 begins a block 704 in which the neural network may begin processing input images received from block 304.

At block 708, depending on the type of neural network, candidate regions may be generated and/or scale classifiers may be generated. For instance, a two stage object detection neural network generates a large number of candidate regions (e.g., candidate identification profiles) . A single stage object detection neural network may generate candidate regions and use a multi-scale feature map to generate scale-based classifiers.

At block 316, the labels assigned to the candidate identification profiles are used to obtain scale information from a database. The scale information may be augmented by the depth (e.g., from block 312) to generate a virtual geometry (e.g., the dimensions of an arbitrary object assigned to the label and scaled to dimensions consistent with the distance of the portion of the image surrounded by the bounding box from the camera) .

At block 320, scale matching may be performed in which it is determined whether the bounding box for a candidate identification profile is larger, smaller, or equal to virtual geometry. For two-stage object detection neural networks, the correspondence between the virtual geometry and the bounding box for a candidate identification profile may be used to eliminate some candidate identification profiles from further processing by the neural network. For instance, if the bounding box is larger than a first predetermined threshold or smaller than a second predetermine threshold then the candidate identification profile may be eliminated. For single stage object detection neural networks, the scale matching is used to identify scales that that are not consistent with the scale of the portion of the image surrounded by the bounding box. Processing, of these candidate identification profiles, may be limited to scale classifiers (e.g., in subsequent layers) that correspond to a similar scale (dimensions at a particular distance from the camera) as the portion of the image surrounded by the bounding box.

At block 712, the scale matching may be used to determine an updated probability that the portion of the image surrounded by the bounding box corresponds to a particular (or potentially a new) label. The probability may be updated using scale matching information.

At block 716, the candidate identification profiles are processed by subsequent layers of the neural network. The subsequent layers of the neural network may include an object detection layer, a classifier, a non-maximum suppression process 612, combinations thereof, or the like.

At block 720, the neural network may generate an output that includes a bounding box that surrounds a portion of the image and a label that corresponds to the portion of the image. The output may also include the probability that the portion of the image corresponds to the label.

The example processes described in FIGS. 3, 5, and 7 execute in connection with a computer system that is an example of the computer systems described herein above. Some or all of the blocks of the processes can be implemented via specific hardware on the computer system and/or can be implemented as computer-readable instructions stored on a non-transitory computer-readable medium of the computer system. As stored, the computer-readable instructions represent programmable modules that include code executable by a processor of the computer system. The execution of such instructions configures the computer system to perform the respective operations. Each programmable module in combination with the processor represents a means for performing a respective blocks (s) . While the blocks are illustrated in a particular order, it should be understood that no particular order is necessary and that one or more operations may be omitted, skipped, and/or reordered.

FIG. 8 illustrates an example block diagram depicting the execution layers of a scale matching process according to an embodiment of the present invention. Scale matching may operate using three layers, a user input layer in which a user interacts with the AR framework, the neural network layer, and the AR framework. The user input layer includes the scale database. Scale information for each label may be added to a label scale database 802 via user input. For instance, when a new label is defined, a user may generate a new entry in the database that indicates the dimensions of an arbitrary object that is assigned to that label. Alternatively, or additionally, scale information may be added automatically from other networked sources (e.g., the Internet, another database, etc. ) . The label scale database is used (along with label ‘c’ 806 from the bounding polygon of the candidate identification profile 810) to identify the geometry (e.g., scale information) for an arbitrary object that is assigned to the label ‘c’ 806.

The AR framework may use the geometry information to generate a virtual geometry that includes a bounding polygon 808 that includes the label ‘c 806, vertices (e.g., dimensions) , and a confidence value that the bounding polygon is accurate. The dimensions of the bounding polygon 808 are determined based on the geometry 804, the size 824 of the candidate identification profile, the camera pose 816 and the point cloud 812.

The bounding polygon 808 can be compared with the size 824 of the bounding polygon candidate identification profile can be used by the neural network to perform a scale matching probability ‘S’ 820. The scale matching probability ‘S’ 820 may be used to generate an updated probability P’ 824 for the candidate identification profile (at the neural network layer) using the probability ‘p’ 828 and the scale matching probability ‘S’ 820. In some instances, the updated probability P’ may be calculated by P’ =p*S. The updated probability may then be used in subsequent layers of the neural network to generate an output.

FIG. 9 illustrates examples of components of a computer system 900 according to an embodiment of the present invention. The computer system 900 is an example of the computer system described herein above. Although these components are illustrated as belonging to a same computer system 900, the computer system 900 can also be distributed.

The computer system 900 includes at least a processor 902, a memory 904, a storage device 906, input/output peripherals (I/O) 908, communication peripherals 910, and an interface bus 912. The interface bus 912 is configured to communicate, transmit, and transfer data, controls, and commands among the various components of the computer system 900. The memory 904 and the storage device 906 include computer-readable storage media, such as RAM, ROM, electrically erasable programmable read-only memory (EEPROM) , hard drives, CD-ROMs, optical storage devices, magnetic storage devices, electronic non-volatile computer storage, for example

memory, and other tangible storage media. Any of such computer readable storage media can be configured to store instructions or program codes embodying aspects of the disclosure. The memory 904 and the storage device 906 also include computer readable signal media. A computer readable signal medium includes a propagated data signal with computer readable program code embodied therein. Such a propagated signal takes any of a variety of forms including, but not limited to, electromagnetic, optical, or any combination thereof. A computer readable signal medium includes any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use in connection with the computer system 900.

Further, the memory 904 includes an operating system, programs, and applications. The processor 902 is configured to execute the stored instructions and includes, for example, a logical processing unit, a microprocessor, a digital signal processor, and other processors. The memory 904 and/or the processor 902 can be virtualized and can be hosted within another computer system of, for example, a cloud network or a data center. The I/O peripherals 908 include user interfaces, such as a keyboard, screen (e.g., a touch screen) , microphone, speaker, other input/output devices, and computing components, such as graphical processing units, serial ports, parallel ports, universal serial buses, and other input/output peripherals. The I/O peripherals 908 are connected to the processor 902 through any of the ports coupled to the interface bus 912. The communication peripherals 910 are configured to facilitate communication between the computer system 900 and other computing devices over a communications network and include, for example, a network interface controller, modem, wireless and wired interface cards, antenna, and other communication peripherals.

While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Indeed, the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the present disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the present disclosure.

Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing, ” “computing, ” “calculating, ” “determining, ” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.

The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multipurpose microprocessor-based computer systems accessing stored software that programs or configures the computer system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.

Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied-for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.

Conditional language used herein, such as, among others, “can, ” “could, ” “might, ” “may, ” “e.g., ” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain examples include, while other examples do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular example.

The terms “including, ” “including, ” “having, ” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list. The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Similarly, the use of “based at least in part on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based at least in part on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.

The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed examples. Similarly, the example systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed examples.

Claims

A method comprising:

receiving, by a camera of a mobile device, an image of an environment;

generating, by the mobile device, a point cloud that indicates a distance of a particular object that is depicted in the image;

extracting, from the point cloud, a depth value that indicates a distance between the camera and the particular object;

receiving, from a trained neural network, one or more candidate identification profiles, each candidate identification profile including a first bounding box that surrounds a first portion of the image that includes the at least a portion of the particular object and an identification probability that indicates a probability that the particular object corresponds to a particular label;

receiving scale information that corresponds to the particular label, the scale information indicating a geometry of an arbitrary object that corresponds to the particular label;

generating virtual geometry by modifying the scale information using the depth value;

generating a scale matching probability by comparing the first bounding box to the virtual geometry;

updating the neural network using the scale matching probability; and

generating, using the updated neural network, a final identification profile that includes a second bounding box that surrounds a second portion of the image that includes the particular object and an updated label that corresponds to the particular object.
The method of claim 1, wherein the scale information includes a height value, a length value, and a width value for the arbitrary object that corresponds to the label.
The method of claim 1, wherein the first bounding box is displayed by the mobile device.
The method of claim 1, wherein the second bounding box is displayed by the mobile device.
The method of claim 3, wherein a dimension of the second bounding box based on the scale information.
The method of claim 1, further comprising:

determining that updated identification probability is below a threshold; and

executing the trained neural network to provide a new identification profile.
The method of claim 1, wherein the scale information is received before the trained neural network executes a non-maximum suppression process.
The method of claim 1, wherein the trained neural network generates a plurality of identification profiles, and wherein the scale information is used to eliminate identification profiles that have a probability below a threshold.
The method of claim 1, wherein the trained neural network generates a plurality of identification profiles, and wherein the scale information is used to combine two or more identification profiles.
A method comprising:

receiving, by a camera of a mobile device, an image of an environment;

generating, by the mobile device and based on the image, a point cloud that indicates a distance of at least one object within the image from the camera;

receiving, from a neural network, one or more identification profiles, each identification profile including a bounding box that surrounds a portion of the image, a label that corresponds to the portion of the image, and a probability that the portion of the image corresponds to the label;

receiving scale information that includes a dimensions of an arbitrary object that corresponds to the label;

updating the neural network using the scale information and the point cloud; and

generating, using the updated neural network, a second bounding box that surrounds a second portion of the image and a label that corresponds to an object depicted by the second portion of the image.
The method of claim 10, wherein updating the neural network includes modifying the probability of at least one identification profile of the one or more identification profiles.
The method of claim 10, wherein updating the neural network includes:

eliminating at least one identification profile of the one or more identification profiles based on the scale information and the bounding box of the at least one identification profile.
The method of claim 10, wherein updating the neural network includes:

merging at least two identification profiles.
The method of claim 10, wherein the neural network is updated before the neural network performs a non-maximum suppression process.
A system comprising:

one or more processors; and

a non-transitory machine readable medium that includes instructions that when executed by the one or more processors, cause the one or more processors to perform the operations including:

receiving, by a camera of a mobile device, an image of an environment;

generating, by the mobile device, a point cloud that indicates a distance of a particular object that is depicted in the image;

extracting, from the point cloud, a depth value that indicates a distance between the camera and the particular object;

receiving, from a trained neural network, one or more candidate identification profiles, each candidate identification profile including a first bounding box that surrounds a first portion of the image that includes the at least a portion of the particular object and an identification probability that indicates a probability that the particular object corresponds to a particular label;

receiving scale information that corresponds to the particular label, the scale information indicating a geometry of an arbitrary object that corresponds to the particular label;

generating virtual geometry by modifying the scale information using the depth value;

generating a scale matching probability by comparing the first bounding box to the virtual geometry;

updating the neural network using the scale matching probability; and

generating, using the updated neural network, a final identification profile that includes a second bounding box that surrounds a second portion of the image that includes the particular object and an updated label that corresponds to the particular object.
The system of claim 15, wherein the scale information includes a height value, a length value, and a width value for the arbitrary object that corresponds to the label.
The system of claim 15, wherein the first bounding box is displayed by the mobile device.
A system comprising:

one or more processors; and

a non-transitory machine readable medium that includes instructions that when executed by the one or more processors, cause the one or more processors to perform the operations including:

receiving, by a camera of a mobile device, an image of an environment;

generating, by the mobile device and based on the image, a point cloud that indicates a distance of at least one object within the image from the camera;

receiving, from a neural network, one or more identification profiles, each identification profile including a bounding box that surrounds a portion of the image, a label that corresponds to the portion of the image, and a probability that the portion of the image corresponds to the label;

receiving scale information that includes a dimensions of an arbitrary object that corresponds to the label;

updating the neural network using the scale information and the point cloud; and

generating, using the updated neural network, a second bounding box that surrounds a second portion of the image and a label that corresponds to an object depicted by the second portion of the image.
The system of claim 18, wherein updating the neural network includes modifying the probability of at least one identification profile of the one or more identification profiles.
The system of claim 18, wherein updating the neural network includes:

eliminating at least one identification profile of the one or more identification profiles based on the scale information and the bounding box of the at least one identification profile.