CN115104135A - Object detection system and method for augmented reality - Google Patents

Object detection system and method for augmented reality Download PDF

Info

Publication number
CN115104135A
CN115104135A CN202180013885.7A CN202180013885A CN115104135A CN 115104135 A CN115104135 A CN 115104135A CN 202180013885 A CN202180013885 A CN 202180013885A CN 115104135 A CN115104135 A CN 115104135A
Authority
CN
China
Prior art keywords
neural network
image
bounding box
label
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180013885.7A
Other languages
Chinese (zh)
Inventor
李翔
徐毅
田原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Publication of CN115104135A publication Critical patent/CN115104135A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes

Abstract

Methods and systems for object detection in an augmented reality environment are disclosed. An image of an environment is received and a point cloud is generated therefrom. Depth values are extracted using the point cloud. One or more candidate recognition profiles are received from the neural network. Each candidate recognition profile includes a first bounding box surrounding a first portion of the image, the first portion including at least a portion of the particular object, a label, and a recognition probability. Generating a virtual geometry by modifying scale information associated with the object recognition profile using the depth values, generating a scale matching probability using the virtual geometry. And updating the neural network by the scale matching probability. A final recognition profile is generated using a neural network. The final identification profile includes a second bounding box surrounding a second portion of the image that includes the particular object and the updated label.

Description

Object detection system and method for augmented reality
Background
Augmented Reality (AR) systems overlay virtual content on a user view of a real-world environment. With the development of the AR Software Development Kit (SDK), the mobile industry has introduced smartphone AR into the mainstream. AR SDKs typically provide six degrees of freedom (six degrees-of-freedom, 6DoF) tracking capability. A user may scan the environment using a camera of the smartphone, which executes a Visual Inertial Odometer (VIO) in real time. After the camera pose is continuously tracked, the virtual object can be placed in the AR scene to create an illusion that the real object and the virtual object are fused together.
Some augmented reality applications aim to generate virtual objects that naturally appear in a real-world environment. Generating virtual objects that are out of the environment (e.g., a bed in a bathroom) may prevent an augmented reality application from generating naturally rendered objects. Therefore, it is important in such systems to identify objects in the real world environment to improve augmented reality displays.
Disclosure of Invention
The present disclosure relates generally to augmented reality and more particularly, but not by way of limitation, to object detection for neural networks augmented with augmented reality sensor data.
Aspects of the present disclosure include methods for object detection in an augmented reality environment. The method comprises the following steps: a camera of the mobile device receives an image of an environment; generating a point cloud by the mobile device, wherein the point cloud indicates the distance of a specific object displayed in the image; extracting a depth value from the point cloud, the depth value indicating a distance between the camera and the specific object; receiving one or more candidate recognition profiles (identification profiles) from the trained neural network, each candidate recognition profile including a first bounding box surrounding a first portion of the image including at least a portion of the particular object and a recognition probability indicating a probability that the particular object corresponds to a particular label; receiving scale information (scale information) corresponding to the specific tag, the scale information indicating a geometric figure of an arbitrary object corresponding to the specific tag; generating a virtual geometric figure by modifying the scale information by using the depth value; generating a scale matching probability by comparing the first bounding box with the virtual geometric figure; updating the neural network using the scale matching probability; generating a final recognition profile using the updated neural network, the final recognition profile including a second bounding box surrounding a second portion of the image including the particular object and an updated label, the updated label corresponding to the particular object.
Another aspect of the disclosure includes a system comprising one or more processors and a non-transitory computer-readable medium comprising instructions that, when executed by the one or more processors, cause the one or more processors to perform the method described above.
Another aspect of the disclosure includes a non-transitory computer-readable medium comprising instructions, which when executed by one or more processors, cause the one or more processors to perform the above-described method.
The present disclosure achieves a number of advantages over conventional techniques. For example, the provided methods and systems use neural networks that generate improved object detection and recognition, characterized by higher accuracy than conventional techniques. Through operation of the neural network, scale matching using the augmented reality sensor may be performed, thereby modifying the candidate regions identified by the neural network. This enables the neural network to generate improved object detection and recognition with greater accuracy. In addition, applying scale matching at various stages of the neural network may exclude some candidates from consideration by the neural network, which may reduce processing resources consumed by the neural network and increase processing speed of successive layers of the neural network.
Further areas of applicability of the present disclosure will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating various embodiments, are intended for purposes of illustration only and are not intended to necessarily limit the scope of the disclosure.
Drawings
FIG. 1 shows an example of a computer system including a depth sensor and a red-green-blue (RGB) optical sensor for AR applications, according to an embodiment of the invention.
Fig. 2 illustrates an example block diagram of a neural network augmented with augmented reality sensor data in accordance with an embodiment of the present invention.
FIG. 3 illustrates an example process for object detection using the augmented neural network of the embodiment of FIG. 2, according to an embodiment of the present invention.
Fig. 4 illustrates an example block diagram of a neural network augmented with augmented reality sensor data in accordance with an alternative embodiment of the present invention.
FIG. 5 illustrates an example process for object detection using the augmented neural network of the embodiment of FIG. 4, according to an embodiment of the present invention.
Fig. 6 illustrates an example block diagram of a neural network augmented with augmented reality sensor data in accordance with a particular embodiment of the present invention.
FIG. 7 illustrates an example process for object detection using the augmented neural network of the embodiment of FIG. 6, according to an embodiment of the present invention.
FIG. 8 illustrates an example block diagram of the executive layers of a scale matching process in accordance with one embodiment of this disclosure.
FIG. 9 illustrates an example computer system according to an embodiment of the present invention.
Detailed Description
Various embodiments are described below. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the described embodiments.
Embodiments of the present disclosure relate to object detection in neural networks augmented with augmented reality sensor data. For example, an image is received from a camera of the mobile device. Using the image, the mobile device may generate a pose (e.g., position and orientation) of the camera and a point cloud of a particular object displayed in the image. The mobile device may generate a depth value using the pose of the camera and the point cloud, the depth value indicating a distance between the camera and the particular object. The mobile device may receive candidate identification profiles for portions of the image from the neural network. Each profile may include a first bounding box surrounding a first portion of the image that includes a portion of the particular object, a label corresponding to the particular object, and a probability that the label corresponds to the particular object. The mobile device may receive dimensional information indicating the geometry of any object corresponding to the tag. By modifying the scale information using the depth values, a virtual geometry may be generated. The mobile device may use the virtual geometry to generate scale matching probabilities, which may update the neural network. The updated neural network outputs a final recognition profile that includes a second bounding box that surrounds a second portion of the image that includes the particular object and a new label that corresponds to the particular object.
For example, the AR application may use object detection to select the placement of a particular virtual object and/or a particular virtual environment in the AR scene. The AR application selects a particular virtual object or its placement based on the background of the real-world environment such that the virtual object appears to belong to the real-world environment. Object detection may be performed by an AR application, a neural network, or a neural network augmented by AR sensor data. For example, a camera of the mobile device may capture images of an office environment. The mobile device generates a point cloud of objects in the image, such as tables, chairs, etc., and calculates a depth value (e.g., a distance between a camera and a chair) corresponding to at least one object. The neural network generates candidate identification profiles for objects in the image. Each candidate recognition profile includes a bounding box that encloses a portion of the image, which corresponds to an object such as a chair. The candidate identification profile may also include a probability that the portion of the image corresponds to a particular tag (e.g., chair). The mobile device may receive dimension information for any chair. The dimensional information includes the average size of any chair (e.g., 0.5 to 1 meter high, 0.5 to 1 meter wide, etc.). The mobile device then generates a virtual geometry by modifying the scale information (e.g., taking into account the distance of the object from the camera) using the depth values.
The mobile device may then generate a scale match probability by comparing the bounding boxes of the candidate recognition profiles to the virtual geometry. The metric match probability may indicate the accuracy of the identification profile. For example, if the bounding box is close to the virtual geometry, the bounding box may be accurate. If the bounding box deviates from the virtual geometry, the recognition profile may be inaccurate.
The scale matching probabilities can be used to update the neural network. For example, the probability of identifying a profile (e.g., the likelihood of corresponding to a particular tag is higher or lower) may be updated, or the probability of identifying a profile (e.g., when the metric match probability indicates that the identification profile is very inaccurate) may be eliminated, and so on. The updated neural network provides a final recognition profile that includes a new label and a second bounding box surrounding a second portion of the image (including the chair). In some cases, the second portion of the image may be equal to the first portion of the image. In other cases, the second portion of the image may more accurately include the detected object (e.g., a chair). Similarly, the new tag may be the same as the old tag (if the candidate recognition profile includes the correct tag), or the updated neural network may generate a new tag that more accurately corresponds to the object (e.g., an office chair).
The scale matching (e.g., generation or application of scale matching probabilities) may be performed at various stages of object detection by the neural network. For example, the scale matching may be performed after the last layer (e.g., after the non-maxima suppression process), before the last layer (e.g., before the non-maxima suppression process), and/or earlier in the neural network processing (e.g., after candidate region generation, after scale classifier generation, etc.).
The methods and systems described herein improve the accuracy of object detection in neural networks. The neural network uses scale matching from the augmented reality sensor data to modify the candidate object in the image that the neural network is attempting to identify. This enables the neural network to generate improved object detection and recognition with greater accuracy. In addition, in each stage of the neural network, the scale matching can exclude some candidate objects from the neural network, so that the processing resources consumed by the neural network can be reduced, and the processing speed of the continuous layer of the neural network can be improved.
FIG. 1 shows an example of a computer system 110, the computer system 110 including a depth sensor 112 and an RGB optical sensor 114 for AR applications, according to an embodiment of the invention. The AR application may be implemented by the AR module 116 of the computer system 110. In general, the RGB optical sensor 114 generates RGB images of a real-world environment that includes, for example, real-world objects 130. The depth sensor 112 generates depth data about the real-world environment, where the data includes, for example, a depth map that shows the depth of the real-world object 130 (e.g., the distance between the depth sensor 112 and the real-world object 130). After the AR session is initialized (which may include calibration and tracking), the AR module 116 renders an AR scene 120 of the real world environment in the AR session, where the AR scene 120 may be presented on a Graphical User Interface (GUI) on a display of the computer system 110. The AR scene 120 displays a real-world object representation 122 of a real-world object 130. Further, the AR scene 120 displays virtual objects 124 that do not exist in the real world environment. The AR module 116 may generate red, green, blue, and depth (RGBD) images from the RGB images and the depth map, which may be processed by the AR system to enhance the object detection neural network. For example, the AR application may use the pose of the computer system 100 and a point cloud of objects to obtain scale information for candidate object classes identified by the neural network. The scale information is passed to the neural network to correct the object detection, which improves the accuracy of the object detection (and identification) and reduces the processing resources consumed by the neural network.
In an example, the computer system 110 represents a suitable user device, and in addition to the depth sensor 112 and the RGB optical sensor 114, the computer system 110 includes one or more Graphics Processing Units (GPUs), one or more General Purpose Processors (GPPs), and one or more memories storing computer readable instructions executable by at least one processor to perform the various functions of the disclosed embodiments. For example, the computer system 110 may be any of a smartphone, a tablet, an AR headset, or a wearable AR device.
The AR module 116 may perform a visual-inertial odometry (VIO) process to track the pose (e.g., position and orientation) of the AR module 116. The VIO uses image analysis and inertial measurement units (sensor data) to determine changes in the position and/or orientation of the camera (e.g., the AR module 116). Visual odometry may use feature detection in images to identify and correlate features in successive images. Feature detection may be used to generate an optical flow field that estimates the motion of the camera relative to an object displayed in successive images. The degree of motion between time intervals (e.g., the time intervals between successive images) may be used to determine the distance and direction that the camera head moves during the time intervals. The distance and orientation of the camera (and sensor data) can be used to track the position and orientation of the camera at each time interval. An inertial measurement unit that captures directional force values (directional force value) may be used to enhance the visual odometer.
For example, the AR module 116 may perform an implementation of the VIO, referred to as a simultaneous localization and mapping (SLAM) process. For example, the SLAM process may begin with a calibration step, where an empty map of the environment may be initialized with the device at the origin of the coordinate system. The SLAM process receives input data such as, but not limited to, image data, control data ct, sensor data st, and time interval t. The SLAM process then generates an output that can include the approximate location xt of the device over a given time interval (relative to one or more approximate locations of one or more previous time intervals) and a map mt of the environment. The output may be enhanced (or validated) using feature detection images captured at time t and time t +1 to identify and correlate features in the images. The changes between images may be used to verify movement of the AR module 116, to fill the environment mt with objects detected in the images, and so on.
The SLAM process can update xt and mt when the device captures sensor data (and image data from the device camera) indicating a particular direction of motion. The SLAM process may be an iterative process that updates xt and mt at set time intervals or when new sensor data or image data may be detected. For example, if no sensor change has occurred between time interval t and t +1, then the SLAM process may delay updating the location and map to conserve processing resources. When a detected change in sensor data indicates that the device is likely to have moved from its previous location xt, the SLAM process may calculate the new location xt of the device and update the map mt.
The AR module 116 may be implemented as dedicated hardware and/or a combination of hardware and software (e.g., a general-purpose processor and computer-readable instructions stored in a memory and executable by a general-purpose processor). In addition to initiating an AR session and performing a VIO, the AR module 116 may also use augmented reality sensor data to improve object detection of real-world objects. The AR module may determine the distance between the AR model 116 and the real-world object 130. During object detection by the neural network, the AR module 116 may update the neural network with candidate classes of objects (e.g., one or more classes for which the neural network has determined a likelihood of corresponding to the real-world object 130). For example, the AR module 116 may obtain the dimensions of the object corresponding to the candidate tag. The dimensions are then modified according to the distance between the object and the camera as determined by the AR module 116 (e.g., if the object is far away, the dimensions will be smaller, and vice versa). Then, if a candidate bounding box for a label generated by the neural network is too large or too small at a given scale match, the probability that the bounding box for the label is correct may be reduced (or the bounding box and label may be eliminated).
In the illustrative example of fig. 1, a smartphone is used to display an AR session of a real-world environment. In particular, the AR session comprises rendering an AR scene comprising a representation of a real-world table on which a vase (or some other real-world object) is placed. The augmented reality session may include rendering virtual objects, such as virtual object 124, superimposed onto the AR scene. The AR session may include a neural network that identifies objects in the real-world environment (e.g., to select a virtual object 124, locate a virtual object 124, etc.). The AR module 116 initiates an AR session by using the image to determine an initial pose of the smartphone (e.g., determined using a SLAM process, etc.) and initializing a coordinate system with respect to the smartphone.
The AR module 116 may generate a point cloud for an object displayed in the image (e.g., the real world object 130). The point cloud includes two or more points (e.g., coordinates in a coordinate system), each point corresponding to a discrete point of the object. The point cloud represents the depth of the object (e.g., the distance of the object relative to the smartphone). The AR module 116 extracts depth values of objects using the point cloud and gestures of the smartphone.
The smartphone executes a neural network for object detection. The neural network uses images captured by the cameras to detect and identify objects displayed in the images. The neural network may generate one or more candidate recognition profiles. Each candidate recognition profile includes a bounding box surrounding a portion of the image, the portion of the image including a portion of the object, and a probability that the object corresponds to a particular tag of the object. The bounding box may be represented by a six-dimensional (six dimensional, 6D) vector, such as (c, x, y, z, w, g, p), where x, y, z are dimensions, w is width, h is height, c is label, and p is the probability that the label is correct. For each candidate identification profile, the AR module 116 uses the tags to obtain the scale information. Dimensional information includes the dimensions of any object that conforms to the label. For example, if the tag is a table, the dimension information includes an average size of any table. Alternatively, the scale information may include the extent (minimum-maximum in each dimension) of the table. The AR module 116 generates a virtual geometry for an arbitrary object (e.g., a table). The virtual geometry is generated by modifying the scale information using the depth values.
The AR module 116 uses the comparison between the bounding box and the virtual geometry to generate a scale match probability. The scale matching probabilities may be used to update the neural network by adjusting the probabilities assigned to the candidate recognition profiles. For example, if the bounding box of the label "table" is larger than the virtual geometry, the predicted label of the candidate recognition profile may not be correct because the predicted label should appear smaller at the distance measured by the AR module 116. Updating the probability may include deleting candidate identification profiles that deviate from the bounding box by more than a threshold amount (e.g., 10%, 15%, or any predetermined amount).
The updated neural network generates a final recognition profile for the real-world object 130. The final recognition profile may include an updated bounding box and an updated label around the image portion that includes the table (e.g., if different from the candidate recognition profile). The neural network may present the final recognition profile through a display of the smartphone. For example, the neural network (or the AR module 116) may generate bounding boxes 126 and 128 with labels that indicate the labels corresponding to the respective bounding boxes within the displayed AR scene 120.
Fig. 2 illustrates an example block diagram of a neural network augmented with augmented reality sensor data in accordance with an embodiment of the present invention. Object detection may use a combination of the AR framework 204 and the object detection neural network 220. For example, the computing device may execute the AR framework 204 and the object detection neural network 220 within an AR session. The computing device may be a mobile device (e.g., a smartphone, laptop, tablet, etc.), desktop computer, server, thin client (thin client), etc. Object detection may utilize AR sensor data during operation of the object detection neural network 220 to improve the performance (e.g., accuracy, resource consumption, speed, etc.) of the object detection neural network 220.
The AR framework 204 may operate according to an AR session. Input images may be received and used to continuously track the pose 208 of the computing device (e.g., tracking using SLAM, etc.). The AR framework may also calculate a point cloud 212 associated with surfaces and/or objects within the image (e.g., by using depth information included in the RGBD image, a separate depth sensor, or using stereo disparity (stereo disparity) of the two images). The point cloud indicates the depth of each surface (e.g., the distance between the surface and the camera that acquired the image).
The object detection neural network 220 initiates an object detection process using the same images received by the AR framework 204. In some cases, the object detection neural network 220 is initiated in parallel with the AR framework 204 determining the pose 208 and generating the point cloud 212. In other cases, the object detection neural network 220 is initiated before or after the AR framework determines the gesture 208. The object detection neural network 220 generates candidate recognition profiles, which may be modified using a regressor 228. The regressor 228 revises the candidate regions (e.g., bounding boxes of the candidate recognition profiles) for classification by the classifier. The classifier 224 generates a probability that the bounding box of the candidate recognition profile corresponds to a particular label. The object detection neural network 220 may perform a non-maximum suppression (non-maximum suppression) process to eliminate overlap (e.g., redundant candidate identification profiles).
The labels of each of the remaining candidate identification profiles are passed to the AR framework. The AR framework queries a database 216 of scale information using the tags, which database 216 returns the scale information corresponding to the tags. Dimensional information includes the dimensions (e.g., height, width, depth) of any object to which the tag is assigned. The scale information may include a minimum value and a maximum value for each size. Example scale information is shown in table 1 below. Alternatively, the scale information may include an object size average and a standard deviation for each size.
Label (R) minL(m) minW(m) minH(m) maxL(m) maxW(m) maxH(m)
Aircraft with a flight control device 3 2.18 1.5 84.0 117.0 24.2
Automobile with a detachable cover 2.4 1.2 1.2 5.6 2.0 2.0
Chair (Ref. now to FIGS) 0.4 0.45 0.8 0.85 0.9 1.4
Mouse (Saggar) 0.13 0.19 0.05 0.08 0.05 0.03
Table 1 example of a tag metrics database
The AR framework 204 may process the scale information to generate virtual geometries for the candidate recognition profiles. The virtual geometry may be generated by modifying the scale information using measured depth values of pixels included in the bounding box (e.g., using pose 208 and point cloud 212 of the computing device). The virtual geometry may be passed back to the object detection neural network 220, and the object detection neural network 220 may perform scale matching 232. The scale matching deletes candidate identification profiles whose bounding boxes differ from the virtual geometry by more than a threshold. Alternatively or additionally, the scale matching 232 modifies the probability of each candidate recognition profile based on a comparison of the virtual geometry to the bounding box (e.g., increasing the probability when the bounding box is close to the virtual geometry and decreasing the probability when the bounding box is widely different from the virtual geometry). The neural network then selects the candidate recognition profile with the highest probability from the candidate recognition profiles as the output of the object detection neural network 220.
FIG. 3 illustrates an example process for object detection using the augmented neural network of FIG. 2, according to an embodiment of the present invention.
At block 304, one or more images are received. The image may include color data (e.g., RGB values) and depth values. For example, the image may be received from an RGBD optical sensor or the like.
At block 308, an AR framework (e.g., AR framework 204 of fig. 2) determines a pose of the computing device (e.g., determined using a SLAM process or the like) and a point cloud of at least one surface displayed in the image using the one or more images. For example, the image may include depth data (e.g., from an RGBG optical sensor) that indicates a distance between a point shown in the image and the camera. The AR framework uses the depth data to generate a plurality of points, each point corresponding to discrete coordinates of a surface displayed in the image. The points may be represented by (three-dimensional) coordinates within a coordinate system of the computing device.
For example, an RGBD optical sensor may generate multi-dimensional (e.g., two-dimensional) data, where each pixel includes color data (e.g., RGB values) and a depth value. The AR framework may convert the two-dimensional coordinates of each pixel into three-dimensional coordinates using the depth value of the pixel. These three-dimensional coordinates may constitute points in the point cloud (e.g., the number of points equals the resolution of the two-dimensional image). In some examples, the number of points may be less than the resolution of the image. In other examples, the number of points may be greater than the resolution of the images (e.g., when the point cloud is generated with multiple images using the depth values of each image and/or the stereo disparity between the images).
At block 312, the AR framework may extract depth values for at least one surface displayed in the image. The AR framework may derive a single depth value from points in the point cloud corresponding to the at least one surface (and optionally the pose). In some examples, since the point cloud may include multiple points per surface, the AR framework may average the depth values of the points of the surface. In other examples, the AR framework may select the smallest depth value. In other examples, the AR framework may select the largest depth value. The AR framework may select any depth value according to predefined criteria (e.g., user-derived criteria).
At block 324 (in some embodiments substantially in parallel with 308), a neural network (e.g., the object detection neural network 220 of fig. 2) may begin processing one or more images to perform object detection. The object detection neural network may be a deep neural network such as, but not limited to, Convolutional Neural Network (CNN), single-shot multi-box detector (SSD), you just look once (YOLO), faster area-based CNN (fast-based CNN), area-based full convolution network (R-FCN), and the like. The neural network performs object detection in various layers, where processing of a given layer may be indicative of, or used by, a subsequent layer. The initial layer may identify candidate regions (e.g., potential regions of interest) within the image that may correspond to the object to be detected. The candidate region may include a bounding box surrounding the candidate region. Subsequent layers may add, remove, or modify candidate regions and determine a probability that a candidate region corresponds to a particular tag (e.g., generate a candidate recognition profile for the candidate region). The last layer may include a non-minimal suppression process that eliminates less than a threshold amount of candidate identifications for a given candidate region.
For example, the neural network performs operations at one or more successive layers, such as, but not limited to, candidate region selection for identifying candidate regions of the image, object detection that assigns a label (e.g., class) to each candidate region, a regressor (e.g., linear, polynomial, logarithmic, etc.) that revises two or more candidate regions (e.g., by performing regression analysis) into a single candidate region, and a classifier (e.g., k-nearest neighbors (k-NN), softmax, etc.) that generates candidate recognition probabilities, including bounding boxes (e.g., identification of candidate regions) and probabilities that labels (e.g., labels) correspond to image portions of the candidate regions.
At block 328, the neural network may perform non-maximum suppression, which may delete some candidate recognition profiles. Non-maximum suppression organizes the candidate identification profiles according to probability (e.g., from high to low or low to high). The neural network then selects the candidate recognition profile with the highest probability. The neural network suppresses (e.g., deletes) all candidate recognition profiles whose bounding boxes overlap mostly with the bounding boxes of the selected candidate recognition profile (e.g., based on a predefined overlap level threshold). Alternatively or additionally, non-maxima suppression may suppress candidate identifications whose probability does not exceed a threshold. The remaining candidate identification profiles (or in some embodiments, only tags) are used for scale matching.
At block 316, the AR framework uses the tags of the candidate recognition profiles to obtain scale information from the database and supplements the scale information with depth values. For example, the AR framework may use the tag to send a query to the database. The scale information database stores scale information of each label. For each dimension (e.g., x, y, z), the scale information may include: 1) minimum and maximum, or 2) mean and standard deviation. The scale information database may return the scale information to the AR framework (or neural network). The AR framework may generate a virtual geometry for each candidate recognition profile. The virtual geometry is generated by modifying the scale information using depth values of surfaces included within the bounding box of the candidate recognition profile (e.g., the depth values extracted at block 312). In other words, the AR framework receives the general dimensions of any object assigned a label of a candidate identification profile and modifies these dimensions to account for the measured distance of the object from the computing device (AR framework). For example, if the depth value is high, the virtual geometry will be smaller than the scale information (because the object appears smaller to the user), and if the depth value is low, the virtual geometry will be larger than the scale information (because the object appears larger to the user).
The virtual geometry may be projected onto a two-dimensional camera view presented by a display of the computing device. In some cases, a virtual arbitrary object with a modified size is presented on the display. In other cases, bounding boxes having dimensions corresponding to the modified dimensions are displayed on the display.
At block 320, a scale matching process may be performed. The scale matching process compares the virtual geometry to the bounding box of each candidate recognition profile. The scale match probability may be generated based on a comparison between the bounding box and the virtual geometry. In particular, the more a bounding box covers a virtual geometry, the more accurate the bounding box may be. The value of the scale matching probability depends on the extent to which the bounding box covers the virtual geometry. In some examples, a high degree of coverage increases the value of the scale matching probability. A low degree of coverage reduces the value of the scale matching probability. In other examples, a high degree of coverage may reduce the value of the scale matching probability. A low degree of coverage increases the value of the scale matching probability.
In some examples, the AR framework processes the scale information to generate virtual geometry and scale matching probabilities. In other examples, a neural network receives the scale information, the computed pose, and the point cloud. The neural network then generates virtual geometry and scale matching probabilities. In other examples, virtual geometry and scale matching probabilities may be generated using a combination of an AR framework and a neural network.
At block 332, for each candidate recognition profile remaining, the neural network determines an updated probability that the image portion within the bounding box corresponds to a particular tag. The neural network may receive candidate recognition profiles output from the non-maxima suppression process and a corresponding scale match probability for each candidate recognition profile. The metric match probabilities may be used to update the probabilities of the corresponding candidate recognition profiles. For example, if the scale matching probability is low (e.g., there is little overlap between the virtual geometry and bounding box), the probability of a candidate recognition profile may be reduced so that candidate recognition profiles may be effectively eliminated, selecting candidate recognition profiles whose bounding box overlaps better with the scale matching probability.
For example, the first candidate identification profile may include a bounding box and a label corresponding to a table. The depth value of the image portion indicates that the object is far away, and therefore the virtual geometry of the object is smaller than the bounding box. The probability of the candidate identification profile will decrease to reflect that the size of the candidate object does not coincide with the measurement of the computer device.
The neural network selects the candidate recognition profile with the highest (modified) probability from the remaining candidate recognition profiles. The neural network then outputs the probability that the bounding box, the label, and (optionally) the portion of the image surrounded by the bounding box correspond to the label.
At block 336, the output bounding box is displayed by overlaying the bounding box over the view of the AR scene. The bounding box may be presented with the label and a probability that the portion of the image enclosed by the bounding box corresponds to the label. In some cases, the AR frame may track the portion of the image enclosed by the bounding box so that even if the environmental view changes (e.g., the camera moves), the bounding box continues to be displayed on the correct portion of the image.
Fig. 4 illustrates an example block diagram of a neural network augmented with augmented reality sensor data in accordance with an alternative embodiment of the present invention. The discussion provided in fig. 2 in relation to a neural network augmented with augmented reality sensor data also applies to fig. 4, where appropriate. The neural network may be updated with AR sensor data at various stages of the neural network. The block diagram of fig. 2 (and the corresponding flow diagram of fig. 3) shows the modifications that occur after the non-maxima suppression process (e.g., the last layer of the neural network before the output is generated). Updating the neural network at an earlier stage may reduce the number of candidate regions processed by the neural network (e.g., eliminate regions and/or labels with low probability of accuracy), thereby improving resource consumption of the neural network and increasing the speed at which the neural network processes images. For example, the object detection neural network 420 incorporates the scale matching 408 from the AR sensor data (e.g., the AR framework 204, pose 208, point cloud 212, and scale information database 216 as described above in connection with fig. 2) after the object detection layer 404, before the non-maxima suppression 412.
The object detection layer 404 of the neural network generates candidate recognition profiles for all possible candidate regions in the image without suppressing redundant (e.g., overlapping) profiles having a non-maximal probability within the local region. The operation of the detection layer may be followed by performing a scale matching 408 to filter out candidate recognition profiles having a lower probability that the portion of the image enclosed by the boundaries of the candidate recognition profile corresponds to the assigned label. Since these candidate identification profiles are deleted prior to non-maximum suppression, non-maximum suppression can process fewer candidate identification profiles, thereby increasing the execution speed of non-maximum suppression and the speed at which output can be obtained. Further, reducing the candidate recognition profiles early in the process may prevent inaccurate candidate recognition profiles from biasing the remaining layers of the neural network (e.g., reducing the accuracy of the neural network).
FIG. 5 illustrates an example process for object detection using the augmented neural network of FIG. 4, according to an embodiment of the present invention. Blocks 304-320 are similar to those described in connection with fig. 3, and the description related to fig. 3 applies to fig. 5 as applicable.
At block 504, the neural network processes one or more images using the various layers of the neural network. The neural network executes an object detection layer (e.g., object detection layer 404 of fig. 4) that generates one or more candidate recognition profiles (e.g., each candidate recognition profile includes a bounding box surrounding a candidate region of the image and a probability that the candidate region corresponds to a particular label). The labels and probabilities may be output from a trained classifier layer that identifies features in the candidate regions that are similar or identical to features in the training data set. The output of the object detection layer is passed to the classification of block 508 and the scale information retrieval of block 316. In the example process of FIG. 5, the scale information retrieval at block 316 is initiated using all possible candidate regions of the image without suppressing redundant (e.g., overlapping) regions having a non-maximum probability within the local region.
At block 508, the neural network determines, for each remaining candidate recognition profile, an updated probability that the image portion within the bounding box corresponds to the particular label. The scale match probabilities retrieved from block 320 may be used to update the probabilities of the corresponding candidate recognition profiles. Since the neural network has not applied the non-maxima suppression process, the probabilities (and labels) of all possible candidate regions (e.g., including redundant regions) of the image are updated using the corresponding scale matching information from block 320. In some examples, candidate recognition profiles with probabilities below a threshold may not be further processed.
At block 512, the neural network may apply non-maximum suppression to the candidate recognition profile. Non-maxima suppression organizes candidate identification profiles according to probability (e.g., from high to low or low to high). The neural network then selects the candidate recognition profile with the highest probability. The neural network suppresses (e.g., deletes) all candidate recognition profiles whose bounding boxes overlap mostly with the bounding boxes of the selected candidate recognition profile (e.g., based on a predefined overlap level threshold). Alternatively or additionally, non-maxima suppression may suppress candidate identifications whose probability does not exceed a threshold. By updating the probabilities of the candidate recognition profiles (e.g., at block 508), the non-maxima suppression may suppress more candidate recognition profiles than the non-maxima suppression performed at block 328 of fig. 3. This improves the accuracy of the neural network because fewer (e.g., perhaps only one) candidate recognition profiles remain after the non-maxima suppression process. If more than one candidate recognition profile remains, the neural network selects the candidate recognition profile with the highest probability from the candidate recognition profiles.
At block 516, the bounding boxes, labels, and probabilities (optional) of the selected candidate recognition profiles are displayed by superimposing the bounding boxes over the view of the AR scene. In some cases, the AR frame may track the portion of the image enclosed by the bounding box so that even if the environmental view changes (e.g., the camera moves), the bounding box continues to be displayed on the correct portion of the image.
Fig. 6 illustrates an example of a block diagram of a neural network augmented with augmented reality sensor data in accordance with a particular embodiment of the present invention. In some examples, the neural network may use two-stage object detection (e.g., master R-CNN). The discussion of fig. 2 and 4 regarding neural networks augmented with augmented reality sensor data also applies to fig. 6, where appropriate. The first stage identifies a large number of candidate regions and the second stage performs object detection/classification on the candidate regions (e.g., done using subsequent layers of a neural network). In this example, the scale matching may be performed after the first stage, thereby reducing the number of candidate regions for subsequent neural network layer processing (e.g., reducing processing load and resource consumption of the subsequent neural network layer). In other examples, such as in a single-stage object detection network, a multi-scale feature map may be used to generate classifiers to detect objects (e.g., SSDs) at multiple scales. In this example, a classifier with the appropriate scale may be selected for detection using scale matching. For example, for each layer, if the bounding box would appear too small or too large based on the measured real-world dimensions (e.g., determined using the point cloud, the pose of the computing device, and the dimension information to perform the dimension matching), the layers may omit the detection classifier.
In a two-stage object detection neural network, the object detection neural network 620 includes a regional suggestion layer 604 (e.g., stage one) that generates a number of candidate layers for subsequent processing layers. The candidate regions may be reduced using scale matching. For example, for each candidate region, the label assigned to the candidate region may be used to look up the scale information. The scale information may be modified using the depth values (e.g., using pose 208 and point cloud 212 in AR framework 204) to generate a virtual geometry for the candidate region. If the bounding box of the candidate region is significantly larger or smaller than the virtual geometry (e.g., using a predetermined threshold), the neural network may subsequently not process the candidate region. Subsequent layers of the neural network, which may include object detection layers, classifiers, non-maxima suppression processes 612, combinations thereof, and the like, may be used to output bounding boxes and labels for the image regions.
In a single-stage object detection neural network, the object detection neural network 620 includes one or more layers (rather than an entire stage) for identifying the candidate region 604. The object detection neural network 620 may include a plurality of scale classifiers 608 that may detect particular features in an image in one or more scales (e.g., each scale corresponding to a size of the feature at a different distance from a camera of a computing device). A multi-scale feature map (e.g., a map including multiple scales for each feature) may be used to generate (e.g., train) a classifier to detect features of a particular scale. In general, since the scale may be unknown, each classifier may be applied to detect features (e.g., each classifier is applied via a different layer of the neural network).
The scale matching may be used to determine a scale that is inconsistent with the AR sensor data (e.g., the bounding box is too small or too large, etc.). Classifiers corresponding to these scales can be omitted, which can reduce the number of layers of the neural network. Subsequent layers of the neural network may include object detection layers, classifiers, non-maxima suppression processes 612, combinations thereof, and the like. Subsequent layers may be used to output bounding boxes and labels for the image region.
The AR framework 204 may operate according to an AR session. Input images may be received and used to continuously track the pose 208 of the computing device (e.g., tracking using SLAM or the like). The AR framework may also calculate a point cloud 212 associated with surfaces and/or objects within the image (e.g., by using depth information included in the RGBD image, a separate depth sensor, or using stereo disparity of the two images). The point cloud indicates the depth of each surface (e.g., the distance of the surface from the camera that acquired the image).
The label of each candidate identification profile may be output to the AR framework. The AR framework may use the tag to send a query to the scale information database 216, which returns the scale information corresponding to the tag. The dimensional information includes the dimensions (e.g., height, width, and depth) of any object to which the tag is assigned.
The AR framework 204 may process the scale information to generate virtual geometries for the candidate recognition profiles. The virtual geometry may be generated by modifying the scale information using measured depth values of pixels included in the bounding box (e.g., using pose 208 and point cloud 212 of the computing device). The virtual geometry may be passed back to the object detection neural network 220, and the object detection neural network 220 may perform scale matching 232. The scale matching 232 may be performed to reduce the number of candidate regions (e.g., candidate recognition profiles) to be processed by subsequent layers of the object detection neural network 620 or to eliminate scale classifiers corresponding to scales that are inconsistent with the AR sensor data.
FIG. 7 illustrates an example process for object detection using the augmented neural network of FIG. 6, according to an embodiment of the present invention. Blocks 304-320 are similar to those described in conjunction with fig. 3, and the description provided with respect to fig. 3 also applies to fig. 7, where appropriate. The description of the example process of fig. 7 begins at block 704, where the neural network may begin processing the input images received from block 304.
At block 708, depending on the type of neural network, candidate regions may be generated and/or a scale classifier may be generated. For example, a two-stage object detection neural network generates a large number of candidate regions (e.g., candidate recognition profiles). A single stage object detection neural network may generate candidate regions and use a multi-scale feature map to generate a scale-based classifier.
At block 316, scale information is obtained from the database using the tags assigned to the candidate identification profiles. The scale information may be supplemented by depth (e.g., from block 312) to generate a virtual geometry (e.g., the size of any object assigned a label scaled to coincide with the distance of the image portion enclosed by the enclosure to the camera).
At block 320, a scale match may be performed in which it is determined whether the bounding box of the candidate recognition profile is greater than, less than, or equal to the virtual geometry. For a two-stage object detection neural network, the correspondence between the virtual geometry and the bounding box of the candidate recognition profiles may be used to take some candidate recognition profiles away from further processing by the neural network. For example, if the bounding box is greater than a first predetermined threshold or less than a second predetermined threshold, the candidate recognition profile may be deleted. For a single stage object detection neural network, scale matching is used to identify scales that are inconsistent with the scales of the image portion enclosed by the bounding box. The processing of these candidate recognition profiles may be limited to scale classifiers (e.g., in subsequent layers) that correspond to similar scales (dimensions at a particular distance from the camera) to the image portion enclosed by the bounding box.
At block 712, a metric match may be used to determine an updated probability that the portion of the image enclosed by the bounding box corresponds to a particular (or possibly new) label. The probabilities may be updated using the scale matching information.
At block 716, the candidate identification profile is processed by subsequent layers of the neural network. Subsequent layers of the neural network may include an object detection layer, classifiers, non-maximum suppression process 612, combinations thereof, and the like.
At block 720, the neural network may generate an output including a bounding box surrounding a portion of the image and a label corresponding to the portion of the image. The output may also include a probability that the image portion corresponds to the label.
The example processes described in fig. 3, 5, and 7 are performed in conjunction with a computer system, which is an example of the computer system described above. Some or all of the blocks of the processes described above may be implemented by specific hardware on a computer system and/or may be implemented as computer readable instructions stored on a non-transitory computer readable medium of a computer system. The stored computer-readable instructions represent programmable modules comprising code executable by a processor of a computer system. Execution of such instructions configures the computer system to perform the corresponding operations. Each programmable module in combination with a processor represents a means for executing a corresponding block. While the blocks are shown in a particular order, it should be understood that a particular order is not required and that one or more operations may be omitted, skipped, and/or reordered.
FIG. 8 illustrates an example block diagram of the executive layers of a scale matching process in accordance with an embodiment of this disclosure. The scale matching may use three layer operations, a user input layer (where the user interacts with the AR framework), a neural network layer, and the AR framework. The user input layer includes a scale database. The scale information for each tag may be added to the tag scale database 802 by user input. For example, when a new tag is defined, the user may create a new entry in the database indicating the size of any object assigned to the tag. Alternatively or additionally, the scale information may be added automatically from other network sources (e.g., the internet, another database, etc.). The tag scale database (along with the tag 'c' 806 of the bounding polygon from the candidate recognition profile 810) is used to identify the geometry (e.g., scale information) of any object assigned to the tag 'c' 806.
The AR framework may use the geometric information to generate a virtual geometry that includes a boundary polygon 808, the boundary polygon 808 including a label 'c' 806, vertices (e.g., dimensions), and a confidence value that the boundary polygon is accurate. The size of the boundary polygon 808 is determined based on the geometry 804, the size 824 of the candidate recognition profile, the camera pose 816, and the point cloud 812.
The boundary polygon 808 may be compared to the size 824 of the candidate recognition profile, which the neural network may use to perform a scale matching probability "S" 820. The scale match probability ' S ' 820 may be used (at the neural network layer) to generate an updated probability P ' 824 for the candidate recognition profile using the probability ' P ' 828 and the scale match probability ' S ' 820. In some cases, the update probability P 'may be calculated by P' ═ P × S. The updated probabilities may then be used in subsequent layers of the neural network to generate an output.
Fig. 9 illustrates an example of components of a computer system 900 according to at least one embodiment of the present disclosure. Computer system 900 is an example of the computer system described above. Although these components are shown as belonging to the same computer system 900, the computer system 900 may also be distributed.
Computer system 900 includes at least a processor 902, a memory 904, a storage device 906, an input/output (I/O) peripheral 908, a communication peripheral 910, and an interface bus 912. Interface bus 912 is used to communicate, send and transfer data, control and commands between the various components of computer system 900. Memory 904 and storage 906 include computer-readable storage media such as RAM, ROM, electrically erasable programmable read-only memory (EEPROM), hard drives, CD-ROMs, optical storage devices, magnetic storage devices, electronic non-volatile computer storage devices (e.g., flash memory), and other tangible storage media. Any such computer-readable storage media may be used to store instructions or program code that implement aspects of the present disclosure. Memory 904 and storage 906 also include computer-readable signal media. A computer readable signal medium includes a propagated data signal with computer readable program code embodied therein. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any combination thereof. Computer-readable signal media includes any computer-readable medium that is not a computer-readable storage medium and that can communicate, propagate, or transport a program for use in connection with computer system 900.
Further, the memory 904 includes an operating system, programs, and applications. The processor 902 is configured to execute stored instructions and includes, for example, a logic processing unit, a microprocessor, a digital signal processor, and other processors. Memory 904 and/or processor 902 may be virtualized and may be hosted in another computer system, such as a cloud network or a data center. I/O peripherals 908 include user interfaces such as keyboards, screens (e.g., touch screens), microphones, speakers, other input/output devices, and computing components such as graphics processing units, serial ports, parallel ports, universal serial buses, and other input/output peripherals. I/O peripheral devices 908 are connected to processor 902 through any port coupled to interface bus 912. Communications peripheral devices 910 are used to facilitate communications between computer system 900 and other computer systems over a communications network and include, for example, network interface controllers, modems, wireless and wired interface cards, antennas, and other communications peripheral devices.
While the subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it is to be understood that the present disclosure has been presented for purposes of illustration and not limitation, and does not preclude inclusion of modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill. Those skilled in the art. Indeed, the methods and systems described herein may be embodied in a variety of other forms. Furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the disclosure.
Unless specifically stated otherwise, it is appreciated that throughout the discussion of the specification discussions utilizing terms such as "processing," "computing," "calculating," "determining," and "identifying" refer to the action and processes of a computer system (e.g., one or more computers or similar electronic computer systems or devices) that manipulates and transforms data represented as physical electronic or magnetic quantities within the computing platform's memories, registers or other information storage, transmission or display devices.
The one or more systems discussed herein are not limited to any particular hardware architecture or configuration. The computer system may include any suitable arrangement of components that provides a result that is conditional on one or more inputs. Suitable computer systems include microprocessor-based general-purpose computer systems that access stored software that programs or configures the computer system from a general-purpose computing device to a specific-purpose computing device implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combination of languages may be used to implement the teachings contained herein in software for programming or configuring a computer system.
Embodiments of the methods disclosed herein may be performed in the operation of such a computer system. The order of the blocks presented in the above examples may be changed, e.g., the blocks may be reordered, combined, and/or broken into sub-blocks. Some blocks or processes may be performed in parallel.
As used herein, conditional language, such as "may," "e.g.," and the like, are generally intended to convey that certain examples include but other examples do not include certain features, elements and/or steps unless expressly stated otherwise or otherwise understood in the context of such usage. Thus, such conditional language generally does not imply that one or more examples require features, elements, and/or steps in any way or that one or more examples must include instructions for deciding, with or without author input or prompting, whether such features, elements, and/or steps are included or are to be performed in any particular example.
The terms "comprising," "having," and the like, are synonymous and are used in an open-ended fashion, and do not exclude other elements, features, acts, operations, and the like. Furthermore, the term "or" is used in its inclusive (and not exclusive) sense, e.g., when used in conjunction with a list of elements, the term "or" indicates one, some, or all of the elements in the list. As used herein, "for" or "configured to" refers to open and inclusive language and does not exclude devices that are used or configured to perform additional tasks or steps. Moreover, the use of "based on" is meant to be open and inclusive, as a process, step, calculation, or other action that is "based on" one or more recited conditions or values may in fact be based on additional conditions or exceeding the recited values. Similarly, the use of "based, at least in part, on" means open and inclusive, in that a process, step, calculation, or other action that is "based, at least in part, on one or more recited conditions or values may, in practice, be based on additional conditions or values than those recited. The headings, lists, and numbers included herein are for ease of explanation only and are not meant to be limiting.
The various features and processes described above may be used independently of one another or may be used in various combinations. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. Moreover, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular order, and the blocks or states associated therewith may be performed in other suitable orders. For example, the blocks or states described may be performed in an order different than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in series, in parallel, or in some other manner. Blocks or states may be added to or deleted from the disclosed examples. Similarly, the example systems and components described herein may be configured differently than described. For example, elements may be added, removed, or rearranged in comparison to the disclosed examples.

Claims (20)

1. A method, comprising:
a camera of the mobile device receives an image of an environment;
the mobile device generating a point cloud indicating a distance of a particular object displayed in the image;
extracting a depth value from the point cloud, the depth value indicating a distance between the camera and the particular object;
receiving one or more candidate recognition profiles from the trained neural network, each candidate recognition profile comprising a first bounding box surrounding a first portion of the image, the first portion of the image comprising at least a portion of the particular object, and a recognition probability indicative of a probability that the particular object corresponds to a particular label;
receiving scale information corresponding to the specific label, the scale information indicating a geometric figure of any object corresponding to the specific label;
generating a virtual geometry by modifying the scale information using the depth values;
generating a scale matching probability by comparing the first bounding box to the virtual geometry;
updating the neural network using the scale matching probabilities; and
generating a final recognition profile using the updated neural network, the final recognition profile including a second bounding box surrounding a second portion of the image including the particular object and an updated label corresponding to the particular object.
2. The method of claim 1, wherein the scale information comprises a height value, a length value, and a width value of the arbitrary object corresponding to the label.
3. The method of claim 1, wherein the first bounding box is displayed by the mobile device.
4. The method of claim 1, wherein the second bounding box is displayed by the mobile device.
5. The method of claim 3, wherein the size of the second bounding box is based on the dimensional information.
6. The method of claim 1, further comprising:
determining that the updated recognition probability is below a threshold; and
performing the trained neural network to provide a new recognition profile.
7. The method of claim 1, wherein the scale information is received before the trained neural network performs a non-maxima suppression process.
8. The method of claim 1, wherein the trained neural network generates a plurality of recognition profiles and the scale information is used to delete recognition profiles having a probability of being below a threshold.
9. The method of claim 1, wherein the trained neural network generates a plurality of recognition profiles and the scale information is used to combine two or more recognition profiles.
10. A method, comprising:
a camera of the mobile device receives an image of an environment;
generating, by the mobile device, a point cloud based on the image, the point cloud indicating a distance of at least one object within the image to the camera;
receiving one or more recognition profiles from a neural network, each recognition profile comprising a bounding box surrounding a portion of the image, a label corresponding to the portion of the image, and a probability that the portion of the image corresponds to the label;
receiving dimension information, wherein the dimension information comprises the dimension of any object corresponding to the label;
updating the neural network using the scale information and the point cloud; and
generating the second bounding box around a second portion of the image and a label corresponding to an object displayed in the second portion of the image using the updated neural network.
11. The method of claim 10, wherein updating the neural network comprises modifying a probability of at least one of the one or more recognition profiles.
12. The method of claim 10, wherein updating the neural network comprises:
deleting at least one of the one or more identification profiles based on the scale information and a bounding box of the at least one identification profile.
13. The method of claim 10, wherein updating the neural network comprises:
at least two identification profiles are merged.
14. The method of claim 10, wherein the neural network is updated before the neural network performs a non-maxima suppression process.
15. A system, comprising:
one or more processors; and
a non-transitory machine-readable medium comprising instructions that, when executed by the one or more processors, cause the one or more processors to:
receiving an image of an environment through a camera of a mobile device;
generating, by the mobile device, a point cloud indicating a distance of a particular object displayed in the image;
extracting a depth value from the point cloud, the depth value indicating a distance between the camera and the particular object;
receiving one or more candidate recognition profiles from the trained neural network, each candidate recognition profile comprising a first bounding box surrounding a first portion of the image, the first portion of the image comprising at least a portion of the particular object, and a recognition probability indicative of a probability that the particular object corresponds to a particular label;
receiving scale information corresponding to the specific label, the scale information indicating a geometric figure of any object corresponding to the specific label;
generating a virtual geometry by modifying the scale information using the depth values;
generating a scale matching probability by comparing the first bounding box to the virtual geometry;
updating the neural network using the scale matching probabilities; and
generating a final recognition profile using the updated neural network, the final recognition profile including a second bounding box surrounding a second portion of the image including the particular object and an updated label corresponding to the particular object.
16. The system of claim 15, wherein the scale information comprises a height value, a length value, and a width value of the arbitrary object corresponding to the label.
17. The system of claim 15, wherein the first bounding box is displayed by the mobile device.
18. A system, comprising:
one or more processors; and
a non-transitory machine-readable medium comprising instructions that, when executed by the one or more processors, cause the one or more processors to:
receiving an image of an environment through a camera of a mobile device;
generating, by the mobile device, a point cloud based on the image, the point cloud indicating a distance of at least one object within the image to the camera;
receiving one or more recognition profiles from a neural network, each recognition profile comprising a bounding box surrounding a portion of the image, a label corresponding to the portion of the image, and a probability that the portion of the image corresponds to the label;
receiving dimension information, wherein the dimension information comprises the dimension of any object corresponding to the label;
updating the neural network using the scale information and the point cloud; and
generating the second bounding box around a second portion of the image and a label corresponding to an object displayed in the second portion of the image using the updated neural network.
19. The system of claim 18, wherein updating the neural network comprises modifying a probability of at least one of the one or more identification profiles.
20. The system of claim 18, wherein updating the neural network comprises:
deleting at least one of the one or more identification profiles based on the scale information and a bounding box of the at least one identification profile.
CN202180013885.7A 2020-02-14 2021-02-08 Object detection system and method for augmented reality Pending CN115104135A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202062977025P 2020-02-14 2020-02-14
US62/977,025 2020-02-14
PCT/CN2021/076066 WO2021160097A1 (en) 2020-02-14 2021-02-08 System and method for object detection for augmented reality

Publications (1)

Publication Number Publication Date
CN115104135A true CN115104135A (en) 2022-09-23

Family

ID=77291387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180013885.7A Pending CN115104135A (en) 2020-02-14 2021-02-08 Object detection system and method for augmented reality

Country Status (2)

Country Link
CN (1) CN115104135A (en)
WO (1) WO2021160097A1 (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9552675B2 (en) * 2013-06-03 2017-01-24 Time Traveler App Llc Display application and perspective views of virtual space
US10346723B2 (en) * 2016-11-01 2019-07-09 Snap Inc. Neural network for object detection in images
US10269159B2 (en) * 2017-07-27 2019-04-23 Rockwell Collins, Inc. Neural network foreground separation for mixed reality
US11618438B2 (en) * 2018-03-26 2023-04-04 International Business Machines Corporation Three-dimensional object localization for obstacle avoidance using one-shot convolutional neural network
CN116778368A (en) * 2018-06-25 2023-09-19 苹果公司 Planar detection using semantic segmentation
WO2020009806A1 (en) * 2018-07-05 2020-01-09 Optimum Semiconductor Technologies Inc. Object detection using multiple sensors and reduced complexity neural networks
CN110221690B (en) * 2019-05-13 2022-01-04 Oppo广东移动通信有限公司 Gesture interaction method and device based on AR scene, storage medium and communication terminal

Also Published As

Publication number Publication date
WO2021160097A1 (en) 2021-08-19

Similar Documents

Publication Publication Date Title
CN111328396B (en) Pose estimation and model retrieval for objects in images
JP7236545B2 (en) Video target tracking method and apparatus, computer apparatus, program
US11176758B2 (en) Overlaying 3D augmented reality content on real-world objects using image segmentation
US10373380B2 (en) 3-dimensional scene analysis for augmented reality operations
US9053571B2 (en) Generating computer models of 3D objects
US11494915B2 (en) Image processing system, image processing method, and program
WO2018089163A1 (en) Methods and systems of performing object pose estimation
JP5713790B2 (en) Image processing apparatus, image processing method, and program
US10311295B2 (en) Heuristic finger detection method based on depth image
US11004221B2 (en) Depth recovery methods and apparatuses for monocular image, and computer devices
EP2671384A2 (en) Mobile camera localization using depth maps
US10867390B2 (en) Computer vision processing
CN112560684B (en) Lane line detection method, lane line detection device, electronic equipment, storage medium and vehicle
CN110648363A (en) Camera posture determining method and device, storage medium and electronic equipment
CN107949851B (en) Fast and robust identification of end points of objects within a scene
KR20220043847A (en) Method, apparatus, electronic device and storage medium for estimating object pose
WO2021068799A1 (en) Occlusion and collision detection for augmented reality applications
CN112528858A (en) Training method, device, equipment, medium and product of human body posture estimation model
CN109785367B (en) Method and device for filtering foreign points in three-dimensional model tracking
JP2017033556A (en) Image processing method and electronic apparatus
CN115104135A (en) Object detection system and method for augmented reality
JP6792195B2 (en) Image processing device
CN115719436A (en) Model training method, target detection method, device, equipment and storage medium
CN114489341A (en) Gesture determination method and apparatus, electronic device and storage medium
CN112950647A (en) Image segmentation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination