WO2020150904A1

WO2020150904A1 - Neural network based obstacle detection for mobile platforms, and associated systems and methods

Info

Publication number: WO2020150904A1
Application number: PCT/CN2019/072703
Authority: WO
Inventors: Xiaozhi Chen; Leijie ZHANG; Cong Zhao
Original assignee: SZ DJI Technology Co., Ltd.
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2020-07-30
Also published as: CN113228043A

Abstract

Detecting obstacles in an environment adjacent to a mobile platform, and associated systems and methods are disclosed herein. A representative method includes: obtaining sensor data that indicate at least a portion of an environment surrounding the mobile platform from one or more sensors carried by the mobile platform; determining, based at least partly on the sensor data, depth information, a feature map, and a plurality of candidate regions, wherein each candidate region indicate at least a portion of an obstacle within the environment; and feeding the depth information, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment.

Description

NEURAL NETWORK BASED OBSTACLE DETECTION FOR MOBILE PLATFORMS, AND ASSOCIATED SYSTEMS AND METHODS

TECHNICAL FIELD

The presently disclosed technology is generally directed to detecting obstacles, such as one or more pedestrians, vehicles, buildings, or other obstacle types, in a three-dimensional (3D) environment adjacent to a mobile platform.

BACKGROUND

The environment surrounding a mobile platform can typically be scanned or otherwise detected using one or more sensors. For example, the mobile platform can be equipped with a stereo vision system (e.g., a “stereo camera” ) to sense its surrounding environment. A stereo camera is typically a type of camera with two or more lenses each having a separate image sensor or film frame. When taking photos/videos with the two or more lenses at the same time but from different angles, the difference between the corresponding photos/videos provides a basis for calculating depth information (e.g., distance from objects in the scene to the stereo camera) . As another example, the mobile platform can be equipped with one or more LiDAR sensors, which typically transmit a pulsed signal (e.g. laser signal) outwards, detect the pulsed signal reflections, and determine depth information about the environment to facilitate object detection and/or recognition. Automated or unmanned navigation typically requires determining various attributes of obstacles, such as position, orientation, or size. There remains a need for more efficient obstacle detection technologies, which can help improve the performance of various higher level applications.

SUMMARY

The following summary is provided for the convenience of the reader and identifies several representative embodiments of the disclosed technology.

In one aspect, a computer-implemented method for detecting obstacles using one or more sensors carried by a mobile platform includes: obtaining sensor data that indicate at least a portion of an environment surrounding the mobile platform from the one or more sensors and determining, based at least partly on the sensor data, depth information, feature map, and a plurality of candidate regions, wherein each candidate region indicate at least a portion of an obstacle within the environment; feeding the depth information, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment.

In some embodiments, the one or more sensors include at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.

In some embodiments, the depth information includes a point cloud.

In some embodiments, the point cloud is obtained based on sensor data generated by at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.

In some embodiments, the depth information includes a depth map determined based, at least in part, on disparity data generated, directly or indirectly, from at least one of a stereo camera or mono camera.

In some embodiments, the depth information is generated by feeding the feature map to a depth estimating neural network separate from the obstacle detection neural network.

In some embodiments, the sensor data includes image data and wherein the feature map is generated by at least feeding the image data to a base neural network separate from the obstacle detection neural network.

In some embodiments, the sensor data includes a point cloud and wherein the feature map is generated based, at least in part, on projecting the point cloud onto a 2D grid defined in accordance with the image data.

In some embodiments, the projecting is based, at least in part, on extrinsic and/or intrinsic calibration parameters with respect to at least one sensors that generated the image data.

In some embodiments, the projected data includes at least one of a height, distance, or angle measurement for individual grid blocks of the 2D grid.

In some embodiments, the feature map is generated by further feeding the projected data to the base neural network.

In some embodiments, the feature map is smaller in size than the image data.

In some embodiments, at least one of the base neural network or the obstacle detection neural network includes one or more convolutional and/or pooling layers.

In some embodiments, the method further comprises feeding the feature map to an intermediate neural network to generate the plurality of candidate regions defined in accordance with the image data.

In some embodiments, each candidate region is a two-dimensional (2D) region that includes a respective target pixel of the image data.

In some embodiments, the respective target pixel is associated with a probability of indicating at least a portion of the obstacle within the environment.

In some embodiments, at least two of the base neural network, the intermediate neural network, and the obstacle detection neural network are jointly trained.

In some embodiments, the obstacle detection neural network includes a first subnetwork configured to determine an initial 3D position of an obstacle for each candidate region of the subset based, at least in part, on the depth information and the at least a subset of the plurality of candidate regions.

In some embodiments, the obstacle detection neural network includes a second subnetwork configured to generate one or more region features for each candidate region of the subset based, at least in part, on the at least a subset of the plurality of candidate regions and the feature map.

In some embodiments, the obstacle detection neural network includes a third subnetwork configured to predict, for each candidate region of the subset, at least one of a type, pose, orientation, 3D position, or 3D size of an obstacle corresponding to the candidate region, based at least in part on the initial 3D position and the one or more region features.

In some embodiments, the mobile platform includes at least one of an unmanned aerial vehicle (UAV) , a manned aircraft, an autonomous car, a self-balancing vehicle, a robot, a smart wearable device, a virtual reality (VR) head-mounted display, or an augmented reality (AR) head-mounted display.

In some embodiments, the method further comprises controlling a mobility function of the mobile platform based, at least in part, on the controlling command.

In some embodiments, the method further comprises causing navigation of the mobile platform based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of one or more obstacles.

In another aspect, A non-transitory computer-readable medium storing computer-executable instructions that, when executed, cause one or more processors associated with a mobile platform to perform actions comprising: obtaining sensor data that indicate at least a portion of an environment surrounding the mobile platform from one or more sensors carried by the mobile platform; determining, based at least partly on the sensor data, depth information, a feature map, and a plurality of candidate regions, wherein each candidate region indicate at least a portion of an obstacle within the environment; and feeding the depth information, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment.

In some embodiments, the depth information includes a point cloud.

In some embodiments, the feature map is smaller in size than the image data.

In some embodiments, the actions further comprise feeding the feature map to an intermediate neural network to generate the plurality of candidate regions defined in accordance with the image data.

In some embodiments, the mobile platform includes at least one of an unmanned aerial vehicle (UAV) , a manned aircraft, an autonomous car, a self-balancing vehicle, or a robot.

In some embodiments, the actions further comprise causing navigation of the mobile platform based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles within the environment.

In another aspect, a mobile platform including a programmed controller that at least partially controls one or more motions of the mobile platform, wherein the programmed controller includes one or more processors configured to: obtain sensor data that indicate at least a portion of an environment surrounding the mobile platform from one or more sensors; determine, based at least partly on the sensor data, depth information, a feature map, and a plurality of candidate regions, wherein each candidate region indicate at least a portion of an obstacle within the environment; and feed the depth information, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment.

In some embodiments, the depth information includes a point cloud.

In some embodiments, the feature map is smaller in size than the image data.

In some embodiments, the one or more processors are further configured to feed the feature map to an intermediate neural network to generate the plurality of candidate regions defined in accordance with the image data.

In some embodiments, the one or more processors are further configured to cause navigation of the mobile platform based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles within the environment.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a schematic illustration of a representative system 100 having elements configured in accordance with some embodiments of the presently disclosed technology.

Fig. 2 is a flowchart illustrating a method that uses a hierarchy of artificial neural networks (ANNs) to detect obstacles for a mobile platform, in accordance with some embodiments of the presently disclosed technology.

Fig. 3 is a flowchart illustrating another method that uses a hierarchy of ANNs to detect obstacles for a mobile platform, in accordance with some embodiments of the presently disclosed technology.

Figs. 4A and 4B illustrate an example of 2D grid and a candidate region identified thereon, in accordance with some embodiments of the presently disclosed technology.

Fig. 5 is a flowchart illustrating an obstacle detection process using an obstacle detection network, in accordance with some embodiments of the presently disclosed technology.

Fig. 6 illustrates examples of mobile platforms configured in accordance with various embodiments of the presently disclosed technology.

Fig. 7 is a block diagram illustrating an example of the architecture for a computer system or other control device that can be utilized to implement various portions of the presently disclosed technology.

Fig. 8 is a flowchart illustrating a candidate region determination process using a candidate region network, in accordance with some embodiments of the presently disclosed technology.

Fig. 9 illustrates an example process for detecting obstacles using one or more sensors carried by a mobile platform, in accordance with some embodiments of the presently disclosed technology.

Fig. 10 illustrates an example process for generating feature maps, in accordance with some embodiments, of the presently disclosed technology.

Figs. 11A and 11B illustrate an example of cascaded convolution and pooling layers used in a base neural network as well as example data involved therewith, in accordance with some embodiments of the presently disclosed technology.

Fig. 12 illustrates an example implementation for generating feature map (s) , in accordance with some embodiments of the presently disclosed technology.

DETAILED DESCRIPTION

1. Overview

Obstacle detection is an important aspect of automated or unmanned navigation technologies. Image data and/or point cloud data collected by sensors (e.g., cameras or LiDAR sensors) carried by a mobile platform (e.g., an unmanned car, boat, or aircraft) can be used as a basis for detecting obstacles in an environment that surrounds or is otherwise observable from the mobile platform. An obstacle's 2D position (e.g., on an image) , orientation, pose, 3D position and size, and/or other attributes can be useful in various higher level navigation applications. The precision and efficiency of obstacle detection and 3D positioning, to some extent, can determine the safety and reliability of corresponding navigation systems.

Typically, 3D information that is reconstructed from stereo camera data (e.g., images) is less precise than 3D point clouds produced by LiDAR or other emission/detection sensors. Therefore, obstacle detection methods using point cloud produced by LiDAR may not be applicable to stereo camera data. On the other hand, image-based obstacle detection methods typically only output an obstacle's 2D position, and are not capable of determining the obstacle's precise 3D position in the physical world. Aspects of the presently disclosed technology use hierarchies of artificial neural networks (ANNs) and region-based method to detect obstacles and determine their status attribute (s) (e.g., positioning information) based on various data collected from sensors. In some embodiments, among other things, the particular structure of ANN hierarchies that inter-connects multiple ANNs whose input and output are specifically defined contributes to various advantages and improvements (e.g., in computational efficiency, detection accuracy, system robustness, etc. ) of the presently disclose technology. As those skilled in the art would appreciate, ANNs are computing systems that "learn" (i.e. progressively improve performance on) tasks by considering examples, generally without task-specific programming. For example, in image recognition, ANN may learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the results to identify cats in other images.

An ANN is typically based on a collection of connected units or nodes called artificial neurons. Each connection between artificial neurons can transmit a signal from one artificial neuron to another. The artificial neuron that receives the signal can process it and then signal artificial neurons connected to it. Typically, in ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is calculated by a non-linear function of the sum of its inputs. Artificial neurons and connections typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that only if the aggregate signal crosses that threshold is the signal sent. Typically, artificial neurons are organized in layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first (input) , to the last (output) layer, possibly after traversing the layers multiple times.

In some embodiments, one or more ANNs used by the presently disclosed technology includes convolutional neural network (s) (CNN, or ConvNet) . Typically, CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. CNNs may also be shift invariant or space invariant artificial neural networks (SIANN) , based on their shared-weights architecture and translation invariance characteristics. Illustratively, CNNs were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

In some embodiments, the presently disclosed technology implements various ANNs in a hierarchy and interconnects the ANNs to achieve more precise and/or efficient obstacle detection. In some aspects, a base neural network can receive 3D point cloud data, stereo camera image data, and/or mono image data and generate intermediate features (e.g., a feature map) for feeding into one or more other neural networks. In some aspects, a candidate region neural network can receive at least the intermediate features and determine 2D candidate regions that indicate at least a portion of an obstacle within the environment. In some aspects, an obstacle detection neural network can receive environment depth information (e.g., 3D point cloud or depth map data) , the intermediate features, and the candidate regions and predict various attributes of detected obstacles.

Several details describing structures and/or processes that are well-known and often associated with mobile platforms (e.g., UAVs, car or other types of mobile platforms) and corresponding systems and subsystems, but that may unnecessarily obscure some significant aspects of the presently disclosed technology, are not set forth in the following description for purposes of clarity. Moreover, although the following disclosure sets forth several embodiments of different aspects of the presently disclosed technology, several other embodiments can have different configurations or different components than those described herein. Accordingly, the presently disclosed technology may have other embodiments with additional elements and/or without several of the elements described below with reference to Figs. 1-9.

Figs. 1-9 are provided to illustrate representative embodiments of the presently disclosed technology. Unless provided for otherwise, the drawings are not intended to limit the scope of the claims in the present application.

Many embodiments of the technology described below may take the form of computer-or controller-executable instructions, including routines executed by a programmable computer or controller. The programmable computer or controller may or may not reside on a corresponding mobile platform. For example, the programmable computer or controller can be an onboard computer of the mobile platform, or a separate but dedicated computer associated with the mobile platform, or part of a network or cloud based computing service. Those skilled in the relevant art will appreciate that the technology can be practiced on computer or controller systems other than those shown and described below. The technology can be embodied in a special-purpose computer or data processor that is specifically programmed, configured or constructed to perform one or more of the computer-executable instructions described below. Accordingly, the terms "computer" and "controller" as generally used herein refer to any data processor and can include Internet appliances and handheld devices (including palm-top computers, wearable computers, cellular or mobile phones, multi-processor systems, processor-based or programmable consumer electronics, network computers, mini computers and the like) . Information handled by these computers and controllers can be presented at any suitable display medium, including an LCD (liquid crystal display) . Instructions for performing computer-or controller-executable tasks can be stored in or on any suitable computer-readable medium, including hardware, firmware or a combination of hardware and firmware. Instructions can be contained in any suitable memory device, including, for example, a flash drive, USB (universal serial bus) device, and/or other suitable medium. In particular embodiments, the instructions are accordingly non-transitory.

2. Representative Embodiments

Fig. 1 is a schematic illustration of a representative system 100 having elements configured in accordance with some embodiments of the presently disclosed technology. The system 100 includes a mobile platform 110 (e.g., an autonomous vehicle) and a control system 120. The mobile platform 110 can be any suitable type of movable object that can be used in various embodiments, such as an unmanned aerial vehicle, a manned aircraft, an autonomous vehicle, a self-balancing vehicle, or a robot.

The mobile platform 110 can include a main body 112 that can carry a payload 114. Many different types of payloads can be used in accordance with the embodiments described herein. In some embodiments, the payload includes one or more sensors, such as an imaging device or an optoelectronic scanning device. For example, the payload 114 can include a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, a mono camera, a video camera, and/or a still camera. The camera can be sensitive to wavelengths in any of a variety of suitable bands, including visual, ultraviolet, infrared and/or other bands. The payload 114 can also include other types of sensors and/or other types of cargo (e.g., packages or other deliverables) . In some embodiments, the payload 114 is supported relative to the main body 112 with a carrying mechanism 116 (e.g., a gimbal, rack, or bar) . The carrying mechanism 116 can allow the payload 114 to be independently positioned relative to the main body 112.

The mobile platform 110 can be configured to receive control commands from the control system 120 and/or transmit data to the control system 120. In the embodiment shown in Fig. 1, the control system 120 includes some components carried on the mobile platform 110 and/or some components positioned off the mobile platform 110. For example, the control system 120 can include a first controller 122 carried by the mobile platform 110 and/or a second controller 124 (e.g., a human-operated, remote controller) positioned remote from the mobile platform 110 and connected via a communication link 128 (e.g., a wireless link such as a radio frequency (RF) based link) . The first controller 122 can include a computer-readable medium 126 that executes instructions directing the actions of the mobile platform 110, including, but not limited to, operation of various components of the mobile platform including the payload 162 (e.g., a camera) . The second controller 124 can include one or more input/output devices, e.g., a display and control buttons. In some embodiments, the operator at least partly manipulates the second controller 124 to control the mobile platform 110 remotely, and receives feedback from the mobile platform 110 via the display and/or other interfaces on the second controller 124. In some embodiments, the mobile platform 110 operates autonomously, in which case the second controller 124 can be eliminated, or can be used solely for operator override functions.

In order to provide for safe and efficient operation, it may be beneficial for autonomous vehicles, UAVs, and other types of unmanned vehicles to be able to autonomously or semi-autonomously detect obstacles and/or to engage in evasive maneuvers to avoid obstacles. Additionally, sensing environmental objects can be useful for mobile platform functions such as navigation, target tracking, and mapping, particularly when the mobile platform is operating in a semi-autonomous or fully autonomous manner.

Accordingly, the mobile platforms described herein can include one or more sensors (e.g., separate and independent from payload-type sensors) configured to detect objects in the environment surrounding the mobile platform. In some embodiments, the mobile platform includes one or more sensors (e.g., distance measurement device 140 of Fig. 1) configured to measure the distance between an object and the mobile platform. The distance measurement device can be carried by the mobile platform in various ways, such as above, underneath, on the side (s) of, or within the main body of the mobile platform. Optionally, the distance measurement device can be coupled to the mobile platform via a gimbal or other carrying mechanism that permits the device to be translated and/or rotated relative to the mobile platform. In some embodiments, the distance measurement device is an optical distance measurement device that uses light to measure distance to an object. The optical distance measurement device can be a LiDAR system or a laser rangefinder. In some embodiments, the distance measurement device is a camera that can image data, from which depth information can be determined. The camera can be a stereo camera or mono camera.

Fig. 9 illustrates an example process 900 for detecting obstacles using one or more sensors carried by a mobile platform, in accordance with some embodiments of the presently disclosed technology. At block 910, the process 900 includes obtaining sensor data (e.g., point cloud (s) , depth map, image (s) , or the like) that indicate at least a portion of an environment surrounding the mobile platform from the one or more sensors (e.g., LiDAR, radar, Time-of-Flight (ToF) camera, stereo camera, mono camera, or the like) . At block 920, the process 900 includes determining, based at least partly on the sensor data, depth information (e.g., depth map (s) , point cloud (s) , or the like) , feature map (s) (e.g., 2D grid based features) , and a plurality of candidate regions (e.g., regions defined on a 2D grid such as an image) . Illustratively, each candidate region indicate at least a portion of an obstacle within the environment.

In the context of block 910 and block 920, Fig. 12 illustrates an example implementation for generating feature map (s) , in accordance with some embodiments of the presently disclosed technology. With reference to Fig. 12, an image and preliminary features derived from a point cloud can be fed into a base neural network. As discussed above, the image and the point cloud can be obtained at block 910 of process 900. As will be discussed in detail below with reference to Fig. 2, a pre-processing module (e.g., part of a controller associated with the mobile platform) can project 3D points in the point cloud onto a 2D grid (e.g., a 2D plane) defined in accordance with the image, thereby generating a set of 2D grid-based preliminary features including, for example, height, angle, and/or distance values. As will be discussed in detail below with reference to Figs. 2, 10, 11A, and 11B, the base neural network can include multiple layers of transformation, which can transform the image and the preliminary features into one or more feature maps for further processing. In some embodiments, as will be discussed in detail below with reference to Figs. 3, 10, 11A, and 11B, the base neural network can include multiple layers of transformation, which can transform an image into one or more feature maps for further processing.

Referring back to Fig. 9, at block 930, the process 900 includes feeding the depth information, the feature map (s) , and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment. The status attribute includes at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles within the environment.

More specifically, Fig. 2 is a flowchart illustrating a method 200 that uses a hierarchy of ANNs to detect obstacles for a mobile platform, in accordance with some embodiments of the presently disclosed technology. The method 200 can be implemented by a controller (e.g., an onboard computer of a mobile platform, an associated computing device, and/or an associated computing service) .

With reference to Fig. 2, the controller can obtain point cloud data 202 (or another form of depth information) and image data 204 using one or more sensors carried by the mobile platform. As discussed above, a stereo camera, LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera or other sensor can provide data for obtaining depth information (e.g., measurements of distance between different portions of a scene and the sensor) for an environment that surrounds or is otherwise adjacent to, but does not necessarily abut, a mobile platform. Point cloud data 202 can be obtained directly from depth detection sensor (s) (e.g., LiDAR, a radar, or a Time-of-Flight (ToF) camera) or obtained indirectly from reconstruction using stereo camera or a mono camera. Illustratively, the point cloud data (or another form of depth information) is obtained based on sensor data generated by at least one of a LiDAR, a radar, or a Time-of-Flight (ToF) camera, a stereo camera, a mono camera, or the like. In some embodiments, the depth information can include a depth map determined based, at least in part, on disparity data generated, directly or indirectly, from the image data 204.

Illustratively, image data 204 can be provided by either stereo or mono cameras. In some embodiments, the controller obtains a series of point clouds and images that are temporally consecutive (e.g., frames of point clouds and images) . In some embodiments, a point cloud and an image that correspond to a same timepoint are used in the method 200.

The controller feeds the point cloud data 202 into a pre-processing module 210 (e.g., a neural network of one or more layers) , which outputs 2D grid based preliminary features 212. Illustratively, the pre-processing module 210 can be implemented to project the point cloud data 202 onto a 2D grid defined in accordance with the image data to obtain the projected data, the 2D grid has the same size as the obtained image data 204. Fig. 4A illustrates an example of such a 2D grid. Projecting the point cloud data 202 can be performed based on extrinsic and/or intrinsic calibration parameters associated with the camera (s) that generated the image data 204. For example, if the image data 204 has a size of 720x1280 pixels, the pre-processing module 210 can project scanning points of the point cloud onto 720x1280 individual grid blocks 402 that correspond to the pixels of the image data 204. In other words, each pixel can correspond to a grid block 402 of the 2D grid. Features such as height (e.g., z-coordinate of a 3D coordinate) , depth (e.g., distance to the mobile platform) , angle (e.g., a normal vector) or the like can be calculated for individual grid blocks based on projecting the point cloud data 202 onto a 2D grid defined in accordance with the image data. As such, the 2D grid based preliminary features 212 can have 720x1280 grid blocks, and each grid block can include one or more features (e.g., a height, distance, or angle measurement) derived from point cloud data.

In accordance with an example implementation, the point cloud data 202 is projected from a 3D coordinate system associated with the camera (s) that generated the image data 204 to a 2D grid of image data 204. If the number of points in the point cloud data 202 is N, and the 3D coordinate of an individual point is p = (x, y, z) , then the point can be projected based on:

where f _x, f _y are focal lengths and c _x, c _y are optical center coordinates (e.g., f _x, f _y and c _x, c _y can be obtained from the intrinsic calibration parameters associated with the camera (s) ) , and (u, v) is the pixel coordinate after the point is projected. Accordingly, a correspondence or mapping between a three-dimensional point (x, y, z) and a pixel coordinate (u, v) is established.

For each pixel coordinate (u, v) , the controller can perform feature encoding based on its corresponding 3D coordinate (x, y, z) in order to generate the preliminary features 212. For example, in accordance with some encoding schemes, the controller can encode the point cloud into a set of three-channel preliminary features. The three channels of feature can represent distance, height, and angle, respectively. The feature encoding can be based on:

where c1, c2, c3 are distance, height, and angle values, respectively, and α _z, α _y are normalization coefficients (which can be predetermined) . As a result, the projection of the point cloud data 202 can generate a set of 2D grid-based preliminary features each including respective distance, height, and angle values.

Referring back to Fig. 2, the controller can feed the preliminary features 212 (e.g., the projected data) and the image data 204 into a base neural network 220 (e.g., including one or more CNNs) . In various embodiments, the base neural network 220 can include one or more layers of convolution operations and/or pooling (e.g., down sampling) operations. Illustratively, with multiple layers of feature transformation, the base neural network 220 can output feature map 222 based on the preliminary features 212 and image data 204. In some embodiments, the feature map 222 can take the form of a 2D grid-based feature map that is smaller in size than the preliminary features 212. For example, the base neural network can include multiple modules (e.g., each including one or more convolution and/or pooling layers) that are cascaded in turn, each of which performs nonlinear feature transformation (e.g., convolution) and/or pooling (e.g., down sampling) on the input features (preliminary features 212 and the image data 204) . After the multiple levels of feature transformations and/or pooling operations, the base neural network can output a feature map 222 of a smaller size than the preliminary features 212.

In this regard, Fig. 10 illustrates an example process for generating feature maps. Typically, a CNN can include two main types of network layers, namely the convolution layer and the pooling layer. The convolution layer can be utilized to extract various features from the input (such as an image) to convolution layer. The pooling layer can be utilized to compress the features that are input to the pooling layer, thereby reducing the number of training parameters for the neural network and easing the degree of model over-fitting. In accordance with Fig. 10, if the input preliminary features is 32＊32 in size, then after convolution operations, the preliminary features can be transformed into a first set of 6 feature maps. This first set of feature maps each has a size of 28＊28. After pooling operations are performed on the first set of feature maps, a second set of 6 feature maps are generated. The second set of feature maps each has a size of 14＊14.

Fig. 11A illustrates an example of cascaded convolution and pooling layers used in the base neural network, in accordance with some embodiments of the presently disclosed technology. As illustrated, the base neural network has network has 3 convolution layers (e.g., operations leading to C1, C3, and C5, respectively) and 3 pooling layers (e.g., operations leading to S2, S4, and S6, respectively) that are serially connected with one another in a cascaded manner. Using the cascaded convolution and pooling layers, an input (64＊64 size) is transformed into a first set (C1) of 6 feature maps (60＊60 size) , and in turn into a second set (S2) of 6 feature maps (30＊30 size) , a third set (C3) of 16 feature maps (26＊26 size) , a fourth set (S4) of 16 feature maps (13＊13 size) , a fifth set (C5) of feature maps (10＊10 size) , and a sixth set (C6) of feature maps (5＊5 size) as output. In some embodiments, the sixth set (C6) of feature maps can be further transformed by full connection layer (s) and/or Gaussian layer (s) to, for example, a vector as output from the base neural network. Fig. 11B illustrates an example of the input and feature map sets involved with the cascaded structure of Fig. 11A.

Referring back to Fig. 2, the feature map 222 can be a common input to multiple neural networks or applicable components to achieve obstacle detection and/or 3D positioning, in accordance with various embodiments of the presently disclosed technology. With continued reference to Fig. 2, the controller can feed the feature map 222 into a candidate region neural network 230 (e.g., including one or more CNNs) to generate the plurality of candidate regions defined in accordance with the image data 204. In various embodiments, the candidate region neural network 230 can include one or more layers of convolution and/or pooling operations.

Fig. 8 is a flowchart illustrating a candidate region determination process 800 using a candidate region neural network 830 (e.g., corresponding to the candidate region neural network 230 of Fig. 2) , in accordance with some embodiments of the presently disclosed technology. With reference to Fig. 8, the candidate region neural network 830 can include one or more modules 840-870 for feature transformation, likelihood estimation, 2D grid regression, and/or redundancy filtering.

Illustratively, the feature transformation module 840 receives the feature map 822 as input, which is further transformed to feed to (a) the likelihood estimation module 850 that can predict the probability that each pixel in the image data 204 (or each grid block in a corresponding 2D grid) “belongs” to an obstacle and (b) the 2D grid regression module 860 that can determine a corresponding 2D region representing the obstacle to which the pixel (or grid block) “belongs. ” Outputs from the likelihood estimation module 850 and the 2D grid regression module 860 are fed to the redundancy filtering module 870 that can filter out redundant 2D regions according to the predicted probabilities and/or overlaps of 2D regions. The redundancy filtering module 870 can then output a smaller number of candidate regions to be included in the candidate region data 832 (e.g., candidate region data 232) .

With reference back to Fig. 2, illustratively, the candidate region neural network 230 can include one or more CNNs. The candidate region neural network 230 can output candidate region data 232 that include or indicate candidate region (s) . Each candidate region can indicate at least a portion of an obstacle within the environment (e.g., portions of an image that show at least some part of the obstacle) . For example, Fig. 4B illustrates an example candidate region 410 identified on a 2D grid. Illustratively, the 2D grid can be defined (e.g., smaller in size) in accordance with the image data 204, in some embodiments, the 2D grid for identifying the candidate regions can be the image data 204.

A candidate region can be a 2D region including a group of connected or disconnected grid blocks including a base block 412. Each grid block can correspond to an individual pixel or a block (e.g., 3x3 in size) of pixels in the image data 204. Using each grid block of the 2D grid as a base block 412, the candidate region neural network 230 can output (1) a likelihood (e.g., estimated probability) that the base block 412 indicates some part of an obstacle and (2) a candidate region 410 that includes the base block 412 and potentially corresponds to at least a portion of the obstacle. In some embodiments, various criteria can be applied to the candidate regions and/or their associated likelihoods to select a subset of the data for output by filtering out redundant candidates. For example, a candidate region that exceeds a threshold level of overlap with one or more other candidate regions can be excluded from the output. As another example, candidate regions can be filtered out if the likelihoods of their respective base blocks belonging to an obstacle fall below a threshold value.

In accordance with the context generally described above, the presently disclosed technology can include (1) a data acquisition and pre-processing aspect and (2) a feature map and candidate region determination aspect. By way of an implementation example, the data acquisition and pre-processing aspect can include: obtaining a stereo color image (e.g., with a resolution of 720＊1280) acquired by a stereo camera carried by the mobile platform in motion, generating a 3D point cloud via 3D reconstruction based on the stereo image, and determining input features based on the 3D point cloud and the image.

To determine the input features, preliminary features can be obtained from the 3D point cloud. Illustratively, the point cloud is projected onto 2D grid (e.g., a plane) of the image using the stereo camera′s calibration parameters. The projection can result in the preliminary features (e.g., the projected data) of the same size (e.g., 720＊1280) as the image. For each grid block of the 2D grid, various features (e.g., a height, distance, or angle measurement) can be calculated based on practical needs and/or computational efficiency. For example, 3 features (e.g., height, distance, and angle) associated with each grid block can be calculated, and therefore the preliminary features has a dimension of 720＊1280＊3.

Next, (a) the preliminary features (e.g., having a dimension of 720＊1280＊3) and (b) the left-eye (or right-eye) image of the stereo image can be spliced or otherwise combined. Illustratively, the left-eye image is an RGB image that also has a dimension of 720＊1280＊3. The combination of (a) and (b) therefore generates input features having a dimension of 720＊1280＊6, which can serve as input for determining feature map and candidate regions.

In accordance with the specific example, the feature map and candidate region determination aspect can include the use of the base neural network and the candidate neural network. The neural networks can include layers of convolution and pooling operations based on practical needs and/or computational efficiency.

Illustratively, the base neural network receives the input features (e.g., having a dimension of 720＊1280＊6) as an input. The base neural network can include 4 modules that are cascaded in turn, each of which performs nonlinear feature transformation and 2x down-sampling. Accordingly, after 4 rounds of down-samplings, the base neural network can output a feature map with a resolution of 45＊80. The candidate region neural network receives the feature map as an input, predicts the probability that each pixel in the left-eye image “belongs” to an obstacle, and a corresponding 2D region representing or indicating the obstacle. Based on the predicted probabilities, the candidate region neural network filters out redundant 2D regions and outputs the remaining 2D regions as candidate regions that represent or indicate obstacles. The number of candidate regions can be hundreds, such as 500, 400, 300, or less.

Referring back to Fig. 2, the controller can feed the point cloud data 202, feature map 222, and candidate region data 232 to an obstacle detection neural network 240 (e.g., including one or more CNNs) . As will discussed in further detail below with reference to Fig. 5, the obstacle detection neural network 240 can output one or more status attributes of detected obstacles 242 including predictions of type, pose, orientation, 3D position, 3D size, and/or other attributes of detected obstacles of the one or more obstacles within the environment. The controller can output commands or instructions based on the status attribute (s) of detected obstacles 242 to control at least certain motion (e.g., acceleration, deceleration, turning, etc. ) of the mobile platform to avoid contact with the detected obstacles.

Fig. 3 is a flowchart illustrating a method 300 that uses a hierarchy of ANNs to detect obstacles for a mobile platform, in accordance with some embodiments of the presently disclosed technology. The method 300 can be implemented by a controller (e.g., an onboard computer of a mobile platform, an associated computing device, and/or an associated computing service) . In some embodiments, the method 300 can use various combinations of data obtained by stereo or mono camera (s) at different layers or stages of the hierarchy of ANNs to achieve obstacle detection and 3D positioning.

With reference to Fig. 3, the controller can obtain image data 302 using camera (s) or other visual sensor (s) carried by the mobile platform. As discussed above, image data generated by camera (s) can provide a basis for obtaining depth information (e.g., measurements of distance between different portions of a scene and the sensor) for an environment that surrounds or is otherwise adjacent to, but does not necessarily abut, a mobile platform. In various embodiments, image data 302 can be provided by stereo and/or mono camera (s) . In some embodiments, the controller obtains stereo images and/or a mono image that correspond to a particular timepoint. In some embodiments, the controller obtains a series images that are temporally consecutive (e.g., frames images) .

More specifically, the controller can feed the image data 302 into a base neural network 320 (e.g., including one or more CNNs) . The base neural network 320 can be structurally equivalent, similar, or dissimilar to the base neural network 220 used in the method 200 described above with reference to Fig. 2. In various embodiments, the base neural network 320 can include one or more layers of convolution and/or pooling operations. Illustratively, with multiple layers of convolution and/or pooling operations, the base neural network 320 can output a feature map 322 based on the image data 302. The feature map 322 can take the form of a 2D grid-based feature map smaller in size than individual image (s) included in the image data 302.

With continued reference to Fig. 3, the controller can feed the feature map 322 into a depth estimating neural network 310 (e.g., including one or more CNNs) , which outputs depth information 312 (e.g., a depth map) . Illustratively, the depth estimating neural network 310 can be implemented to estimate depth information that corresponds to different locations defined within the image data 302. For example, for a target image included in the image data 302, the depth estimating neural network 310 can analyze multiple frames of images before and/or after the target image and output depth information 312 that includes an estimated depth value (e.g., distance from the mobile platform) for each pixel of the target image. Alternatively, depth information (e.g., a depth map) can be determined based on disparity data generated, directly or indirectly, from at least one of a stereo camera or mono camera.

The controller can feed the feature map 322 into an intermediate neural network such as a candidate region neural network 330 (e.g., candidate region neural network 830 that can include one or more CNNs) . The candidate region neural network 330 can be structurally equivalent, similar, or dissimilar to the candidate region neural network 230 used in the method 200 as described above with reference to Fig. 2. In various embodiments, the candidate region neural network 330 can include one or more layers of convolution and/or pooling operations. For example, individual neurons can apply a respective convolution operation to their inputs, and the outputs of a cluster of neurons at one layer can be combined into a single neuron in the next layer. Illustratively, the candidate region neural network 330 can output candidate region data 332 that include or indicate candidate region (s) . Each candidate region can indicate at least a portion of an obstacle within the environment.

As discussed above with reference to Fig. 4B, a candidate region can be a group of connected or disconnected grid blocks including a base block 412. Each grid block can correspond to an individual pixel or a block (e.g., 2x4 in size) of pixels in the obtained image. Using each grid block of the 2D grid as a base block 412, the candidate region neural network 330 can output (1) a likelihood (e.g., an estimated probability) that the base block 412 indicates at least some part of an obstacle and (2) a corresponding candidate region 410 that includes the base block 412 and potentially indicates at least a portion of the obstacle. As discussed above with reference to Fig. 4B, various criteria can be applied to the candidate regions, their associated base blocks and/or likelihoods to select a subset of the data for output.

By way of an implementation example with reference to Fig. 8, the obtained image has a resolution of 100＊50 (i.e., the image has 5000 pixels) and the image includes 2D representations of one or more obstacles including obstacle A. With reference to Fig. 8, the feature transformation module 840 receives the feature map 822 corresponding to the obtained image as input and transforms it to feed into (a) the likelihood estimation module 850 that can predict the probability that each pixel in the image “belongs” to an obstacle. Illustratively, the likelihood estimation module 850 predicts that 100 pixels in the image “belong” to obstacle A with respective probabilities.

With continued reference to Fig. 8, the feature transformation module 840 also feeds the transformed feature map to the 2D grid regression module 860 that can determine a corresponding 2D region representing the obstacle to which the pixel (or grid block) “belongs. ” Illustratively, because 100 pixels “belong” to obstacle A, the 2D grid regression module 860 can determine 100 corresponding 2D regions (e.g., 2D frames) that represent or indicate the obstacle A.

Based on the estimated probabilities that each of the 100 pixels “belong” to the obstacle A, a non-maximum suppression method (or other suitable filtering methods) can be used to remove a subset of regions (e.g., those overlap with one another beyond a threshold degree) from the 100 2D regions. The remaining 2D regions can be retained as output candidate regions for obstacle A.

Referring back to Fig. 3, the controller can feed the depth information 312, feature map 322, and candidate region data 332 into an obstacle detection neural network 340 (e.g., including one or more CNNs) . The obstacle detection neural network 340 can be structurally equivalent, similar, or dissimilar to the obstacle detection neural network 240 used in the method 200 as described above with reference to Fig. 2. As will discussed in further detail below, the obstacle detection neural network 340 can output one or more status attributes of detected obstacles 342 including predictions of type, pose, orientation, 3D position, 3D size, and/or other attributes of the detected obstacles. The controller can output commands or instructions based on the status attribute (s) of detected obstacles 342 to control at least certain motion (e.g., acceleration, deceleration, turning, etc. ) of the mobile platform to avoid contact with the detected obstacles.

Fig. 5 is a flowchart illustrating an obstacle detection process 500 using an obstacle detection neural network 540 (e.g., corresponding to the obstacle detection neural network 240 used in the method 200 as described above with reference to Fig. 2 or the obstacle detection neural network 340 used in the method 300 as describe above with reference to Fig. 3) , in accordance with some embodiments of the presently disclosed technology. With reference to Fig. 5, the obstacle detection neural network 540 can include an initial position sub-network 510 (e.g., a first subnetwork including one or more ANNs) , a region feature sub-network 520 (e.g., a second subnetwork including one or more ANNs) , and a 3D prediction sub-network 530 (e.g., including one or more ANNs) .

The initial position sub-network 510 can receive depth information 502 (e.g., the point cloud data 202 as in method 200 or the depth information 312 as in method 300) and candidate region data 504 (e.g., the candidate region data 232 as in method 200 or the candidate region data 332 outputted from the candidate region neural network 330 as in method 300) as input. If the depth information 502 is not in a form of a point cloud, embodiments of the presently disclosed technology include converting the depth information 502 into point cloud data, for example, based on the extrinsic and/or intrinsic calibration parameters of an associated camera that generated

image data

204 or 302. For each candidate region (e.g., a 2D region on an obtained image or associated 2D grid) included in the candidate region data 504, the initial position sub-network 510 can (a) identify a 3D region (e.g., a subset of a point cloud) that correspond to the candidate region using the depth information 312, and (b) compute and output an initial 3D position for a potential obstacle that includes the identified 3D region based on various statistics (e.g., a mean of median value of 3D coordinates of corresponding scanning points) that characterize the 3D region.

The region feature sub-network 520 (e.g., a fully connected neural network) can receive the candidate region data 504 (e.g., the candidate region data 232 as in method 200 or the candidate region data 332 as in method 300) and feature map 506 (e.g., the feature map 222 as in method 200 or the feature map 322 as in method 300) as input, perform one or more layers of linear and/or non-linear feature transformation, and output region features for individual candidate regions included in the candidate region data 504. The size of region features for each candidate region can be determined based on practical needs and/or computational resource constraints. As an example, the candidate regions are normalized to a fixed size, then a fixed-length feature vector is obtained based on one or more pooling operations, before multi-layer feature transformation is performed to obtain region features for each candidate region.

Illustratively, each candidate region can correspond to a respective 2D region in an original image, which serves as the basis for generating the feature map 506. Using a relationship (e.g., ratio) between the size of the original image and the size of the feature map, the controller can identify a respective reduced 2D region on the feature map that corresponds to each candidate region. Various operations can be performed on feature map in the reduced 2D regions to generate region features. In various embodiments, the operations can include pooling and/or feature transformations (e.g., full connection and/or convolution) . As an example, each reduced 2D regions can be normalized, so each region feature can be a fixed-length feature vector that are calculated based on respective reduced 2D regions identified on the feature map.

The 3D prediction sub-network 530 can receive outputs from the initial position sub-network 510 and the region feature sub-network 520, and output status attribute (s) of detected obstacles 542. For example, the 3D prediction sub-network 530 can predict and output the type, pose, orientation, 3D position, 3D size, and/or other attributes of the detected obstacles. The 3D prediction sub-network 530 can determine and output a confidence level for each candidate region to indicate a probability that the candidate region belongs to an obstacle. In some embodiments, the output is filtered based on the confidence level. For example, candidate regions whose confidence levels fall below a threshold can be excluded from the output. Illustratively, one or more controllers of a mobile platform can perform automated or semi-automated mapping, navigation, emergency maneuvers, or other actions that control certain movements of the mobile platform using the various status attributes of the detected obstacles.

In some embodiments, the 3D prediction sub-network 530 includes one or more sub-modules (e.g., neural network branches) that predict at least one of the semantic category (e.g., a type of an obstacle) , 2D region (e.g., 2D region of the image data) , orientation, 3D size, and 3D position of the obstacle (s) . Each sub-module can include a linear transformation process that maps the output (e.g., region features for individual candidate regions) of the region feature sub-network 520 to respective dimensions of the sub-module output.

Illustratively, the semantic category prediction sub-module can predict a confidence level that indicates the probability that a candidate region “belongs” to a semantic category (e.g., the probability of “belonging” to a vehicle, pedestrian, bicycle or background) .

Illustratively, the 2D region prediction sub-module can use a center point, length, and width to represent a 2D region of a corresponding obstacle. The 2D region prediction sub-module can estimate an offset from each candidate region to a corresponding obstacle′s 2D region, thereby obtaining the position (s) of 2D regions that indicate obstacle (s) .

Illustratively, the orientation prediction sub-module can divide the angle range between -180 degrees and +180 degrees into multiple intervals (e.g., two intervals of [-180°, 0°] and [0°, 180°] ) , and calculate the center for each interval. The orientation prediction sub-module can predict a specific interval to which an obstacle's orientation angle belongs, and calculate the difference between the obstacle's orientation angle and the center of the interval to which it belongs, thereby obtaining the orientation angle of the obstacle.

Illustratively, the 3D size prediction sub-module can perform predictions using the average length, width, and height (or other measurements relating to 3D size) of the 3D representation (e.g., frame) of obstacles of each semantic category. The average measurements can be obtained from training data collected offline. In the prediction process, the 3D size prediction sub-module can predict a ratio of the 3D size of an obstacle to the average 3D size of a corresponding category, thereby obtaining the 3D size attribute of the obstacle.

Illustratively, the 3D position prediction sub-module can predict the offset between the 3D position of an obstacle and the initial 3D position of a corresponding input candidate region, thereby obtaining the 3D position of the obstacle.

In some embodiments, based on the confidence levels of the semantic categories as predicted, the 3D prediction sub-network 530 can further filter its output to retain only those with a confidence level greater than a certain threshold. Various suitable filtering methods (e.g., non-maximum suppression) can be used based on practical needs and/or computational efficiency.

In accordance with the above description, status attributes such as the semantic category, 2D region, orientation, 3D size and 3D position of obstacle (s) in the mobile platform's current road scene can be obtained. This output can be provided to downstream applications of the mobile platform, such as route planning and control, to facilitate automated navigation, autonomous driving, or other functionalities.

The various ANN components used in accordance with embodiments of the presently disclosed technology can be trained in various ways as deemed proper by those skilled in the art. Illustratively, training samples can be pre-collected, each sample containing input data (e.g., point clouds and corresponding images for method 200, stereo images for method 300) and their associated 3D region (s) that are manually identified to represent obstacle (s) . The parameters of the neural network (s) can be learned through a sufficiently large number of training samples.

Illustratively, the base neural network 220, candidate region neural network 230, and obstacle detection neural network 240 can be trained separately or jointly. When trained separately, the training data for the different neural networks can be independent from one another (e.g., based on different time and context) . When trained jointly, the training data for the different neural networks correspond with one another (e.g., associated with a same series of point cloud and/or image frames) . In some embodiments, a part (e.g., the base neural network 220 and the candidate region neural network 230) of the ANN hierarchy used in method 200 is trained jointly while at least another part (e.g., the obstacle detection neural network 240) of the ANN hierarchy is trained separately.

Similarly, the depth estimating neural network 310, the base neural network 320, the candidate region neural network 330, and the obstacle detection neural network 340 can be trained separately or jointly. For example, suitable training methods can include collecting various images and LiDAR point clouds corresponding to the images as training data. Because the LiDAR point clouds provide depth measurements of the environment depicted by the images, the base neural network 320 and the depth estimating neural network 310 can be trained jointly based on the images (as input to the base neural network 320) and their associated depth measurements (as output from the depth estimating neural network 310) . Proper network parameters can be obtained when the training is performed on sufficient data samples.

Further, the obstacle detection neural network 540 can be trained separately or jointly with other neural networks disclosed herein. For example, various images and LiDAR point clouds corresponding to the images can be collected. The LiDAR point clouds can include manually marked 3D regions that represent obstacles. Joint training of the neural networks of method 200 can be based on the images and their corresponding LiDAR point clouds (as input) and various attributes of the corresponding marked 3D regions (as output) . Proper network parameters can be obtained when the training is performed on sufficient data samples.

Fig. 6 illustrates examples of mobile platforms configured in accordance with various embodiments of the presently disclosed technology. As illustrated, a representative mobile platform as disclosed herein may include at least one of an unmanned aerial vehicle (UAV) 602, a manned aircraft 604, an autonomous car 606, a self-balancing vehicle 608, a terrestrial robot 610, a smart wearable device 612, a virtual reality (VR) head-mounted display 614, or an augmented reality (AR) head-mounted display 616.

Fig. 7 is a block diagram illustrating an example of the architecture for a computer system 700 or other control device that can be utilized to implement various portions of the presently disclosed technology. In Fig. 7, the computer system 700 includes one or more processors 705 and memory 710 connected via an interconnect 725. The interconnect 725 may represent any one or more separate physical buses, point to point connections, or both, connected by appropriate bridges, adapters, or controllers. The interconnect 725, therefore, may include, for example, a system bus, a Peripheral Component Interconnect (PCI) bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB) , IIC (I2C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 674 bus, sometimes referred to as “Firewire. ”

The processor (s) 705 may include central processing units (CPUs) to control the overall operation of, for example, the host computer. In certain embodiments, the processor (s) 705 accomplish this by executing software or firmware stored in memory 710. The processor (s) 705 may be, or may include, one or more programmable general-purpose or special-purpose microprocessors, digital signal processors (DSPs) , programmable controllers, application specific integrated circuits (ASICs) , programmable logic devices (PLDs) , or the like, or a combination of such devices.

The memory 710 can be or include the main memory of the computer system. The memory 710 represents any suitable form of random access memory (RAM) , read-only memory (ROM) , flash memory, or the like, or a combination of such devices. In use, the memory 710 may contain, among other things, a set of machine instructions which, when executed by processor (s) 705, causes the processor (s) 705 to perform operations to implement embodiments of the presently disclosed technology. In some embodiments, the memory 710 can contain an operating system (OS) 730 that manages computer hardware and software resources and provides common services for computer programs.

Also connected to the processor (s) 705 through the interconnect 725 is a (optional) network adapter 715. The network adapter 715 provides the computer system 700 with the ability to communicate with remote devices, such as the storage clients, and/or other storage servers, and may be, for example, an Ethernet adapter or Fiber Channel adapter.

The techniques described herein can be implemented by, for example, programmable circuitry (e.g., one or more microprocessors) programmed with software and/or firmware, or entirely in special-purpose hardwired circuitry, or in a combination of such forms. Special-purpose hardwired circuitry may be in the form of, for example, one or more application-specific integrated circuits (ASICs) , programmable logic devices (PLDs) , field-programmable gate arrays (FPGAs) , etc.

Software or firmware for use in implementing the techniques introduced here may be stored on a machine-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A "machine- readable storage medium, ” as the term is used herein, includes any mechanism that can store information in a form accessible by a machine (a machine may be, for example, a computer, network device, cellular phone, personal digital assistant (PDA) , manufacturing tool, any device with one or more processors, etc. ) . For example, a machine-accessible storage medium includes recordable/non-recordable media (e.g., read-only memory (ROM) ; random access memory (RAM) ; magnetic disk storage media; optical storage media; flash memory devices; etc. ) , etc.

The term "logic, ” as used herein, can include, for example, programmable circuitry programmed with specific software and/or firmware, special-purpose hardwired circuitry, or a combination thereof.

Some embodiments of the disclosure have other aspects, elements, features, and/or steps in addition to or in place of what is described above. These potential additions and replacements are described throughout the rest of the specification. Reference in this specification to “various embodiments, ” “certain embodiments, ” or “some embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. These embodiments, even alternative embodiments (e.g., referenced as “other embodiments” ) are not mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not other embodiments. For example, some embodiments use depth information generated from stereo camera (s) , while other embodiments can use depth information generated from LiDAR (s) , 3D-ToF, or RGB-D. Still further embodiments can use depth information generated from a combination of sensors. As used herein, the phrase “and/or” as in “A and/or B” refers to A alone, B alone, and both A and B.

To the extent any materials incorporated by reference herein conflict with the present disclosure, the present disclosure controls.

Claims

A computer-implemented method for detecting obstacles using a laser unit and a camera unit both carried by a common autonomous vehicle, comprising:

determining preliminary features based, at least in part, on projecting a point cloud obtained by the laser unit onto a two-dimensional grid corresponding to an image obtained by the camera, wherein the point cloud includes three-dimensional measurements of at least a portion of an environment surrounding the autonomous vehicle and wherein the image includes a two-dimensional representation of the portion of the environment surrounding the autonomous vehicle;

feeding the preliminary features and the image to a base neural network to generate a feature map;

feeding the feature map to an intermediate neural network to generate a plurality of candidate regions of the image that indicate at least a portion of an obstacle within the environment surrounding the autonomous vehicle;

feeding the point cloud, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of one or more obstacles within the environment; and

causing navigation of the autonomous vehicle based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles.
A computer-implemented method for detecting obstacles using a stereo camera unit carried by an autonomous vehicle, comprising:

feeding image data obtained by the stereo camera unit to a base neural network to generate a feature map, wherein the image data includes a two-dimensional representation of at least a portion of an environment surrounding the autonomous vehicle;

feeding the feature map to an intermediate neural network to generate a plurality of candidate regions of the image that indicate at least a portion of an obstacle within the environment surrounding the autonomous vehicle;

feeding the feature map to a depth-estimation neural network to generate a depth map corresponding to the image data;

feeding the depth map, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of one or more obstacles within the environment; and

causing navigation of the autonomous vehicle based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles.
A computer-implemented method for detecting obstacles using one or more sensors carried by a mobile platform, comprising:

obtaining sensor data that indicate at least a portion of an environment surrounding the mobile platform from the one or more sensors;

determining depth information, a feature map, and a plurality of candidate regions based at least partly on the sensor data, wherein each candidate region indicate at least a portion of an obstacle within the environment; and

feeding the depth information, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment.
The method of claim 3, wherein the one or more sensors include at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.
The method of claim 3, wherein the depth information is determined based, at least in part, on a point cloud.
The method of claim 5, wherein the point cloud is obtained based on sensor data generated by at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.
The method of claim 3, wherein the depth information includes a depth map determined based, at least in part, on disparity data generated, directly or indirectly, from at least one of a stereo camera or mono camera.
The method of claim 3, wherein the depth information is generated by feeding the feature map to a depth estimating neural network separate from the obstacle detection neural network.
The method of claim 3, wherein the sensor data includes one or more images and wherein the feature map is generated by at least feeding the one or more images to a base neural network separate from the obstacle detection neural network.
The method of claim 9, wherein the sensor data includes a point cloud and wherein the feature map is generated based, at least in part, on projecting the point cloud onto a 2D grid defined in accordance with at least one of the one or more images.
The method of claim 10, wherein the projecting is based, at least in part, on extrinsic and/or intrinsic calibration parameters with respect to at least one sensors that generated the one or more images.
The method of claim 10, wherein the projected data includes at least one of a height, distance, or angle measurement for individual grid blocks of the 2D grid.
The method of claim 10, wherein the feature map is generated by further feeding the projected data to the base neural network.
The method of claim 9, wherein the feature map is smaller in size than at least one of the one or more images.
The method of claim 9, wherein at least one of the base neural network or the obstacle detection neural network includes one or more convolutional and/or pooling layers.
The method of claim 9, further comprising feeding the feature map to an intermediate neural network to generate the plurality of candidate regions defined in accordance with the image data.
The method of claim 16, wherein each candidate region is a two-dimensional (2D) region that includes a respective target pixel of the image data.
The method of claim 17, wherein the respective target pixel is associated with a probability of indicating at least a portion of the obstacle within the environment.
The method of claim 16, wherein at least two of the base neural network, the intermediate neural network, and the obstacle detection neural network are jointly trained.
The method of claim 3, wherein the obstacle detection neural network includes a first subnetwork configured to determine an initial 3D position of an obstacle for each candidate region of the subset based, at least in part, on the depth information and the at least a subset of the plurality of candidate regions.
The method of claim 20, wherein the obstacle detection neural network includes a second subnetwork configured to generate one or more region features for each candidate region of the subset based, at least in part, on the at least a subset of the plurality of candidate regions and the feature map.
The method of claim 21, wherein the obstacle detection neural network includes a third subnetwork configured to predict, for each candidate region of the subset, at least one of a type, pose, orientation, 3D position, or 3D size of an obstacle corresponding to the candidate region, based at least in part on the initial 3D position and the one or more region features.
The method of claim 3, wherein the mobile platform includes at least one of an unmanned aerial vehicle (UAV) , a manned aircraft, an autonomous car, a self-balancing vehicle, or a robot.
The method of claim 3, further comprising causing navigation of the mobile platform based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles within the environment.
A non-transitory computer-readable medium storing computer-executable instructions that, when executed, cause one or more processors associated with a mobile platform to perform actions comprising:

obtaining sensor data that indicate at least a portion of an environment surrounding the mobile platform from one or more sensors carried by the mobile platform;

determining depth information, a feature map, and a plurality of candidate regions based at least partly on the sensor data, wherein each candidate region indicate at least a portion of an obstacle within the environment; and

feeding the depth information, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment.
The computer-readable medium of claim 25, wherein the one or more sensors include at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.
The computer-readable medium of claim 25, wherein the depth information is determined based, at least in part, on a point cloud.
The computer-readable medium of claim 27, wherein the point cloud is obtained based on sensor data generated by at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.
The computer-readable medium of claim 25, wherein the depth information includes a depth map determined based, at least in part, on disparity data generated, directly or indirectly, from at least one of a stereo camera or mono camera.
The computer-readable medium of claim 25, wherein the depth information is generated by feeding the feature map to a depth estimating neural network separate from the obstacle detection neural network.
The computer-readable medium of claim 25, wherein the sensor data includes one or more images and wherein the feature map is generated by at least feeding the one or more images to a base neural network separate from the obstacle detection neural network.
The computer-readable medium of claim 31, wherein the sensor data includes a point cloud and wherein the feature map is generated based, at least in part, on projecting the point cloud onto a 2D grid defined in accordance with at least one of the one or more images.
The computer-readable medium of claim 32, wherein the projecting is based, at least in part, on extrinsic and/or intrinsic calibration parameters with respect to at least one sensors that generated the one or more images.
The computer-readable medium of claim 32, wherein the projected data includes at least one of a height, distance, or angle measurement for individual grid blocks of the 2D grid.
The computer-readable medium of claim 32, wherein the feature map is generated by further feeding the projected data to the base neural network.
The computer-readable medium of claim 31, wherein the feature map is smaller in size than at least one of the one or more images.
The computer-readable medium of claim 31, wherein at least one of the base neural network or the obstacle detection neural network includes one or more convolutional and/or pooling layers.
The computer-readable medium of claim 31, wherein the actions further comprise feeding the feature map to an intermediate neural network to generate the plurality of candidate regions defined in accordance with the image data.
The computer-readable medium of claim 38, wherein each candidate region is a two-dimensional (2D) region that includes a respective target pixel of the image data.
The computer-readable medium of claim 39, wherein the respective target pixel is associated with a probability of indicating at least a portion of the obstacle within the environment.
The computer-readable medium of claim 38, wherein at least two of the base neural network, the intermediate neural network, and the obstacle detection neural network are jointly trained.
The computer-readable medium of claim 25, wherein the obstacle detection neural network includes a first subnetwork configured to determine an initial 3D position of an obstacle for each candidate region of the subset based, at least in part, on the depth information and the at least a subset of the plurality of candidate regions.
The computer-readable medium of claim 42, wherein the obstacle detection neural network includes a second subnetwork configured to generate one or more region features for each candidate region of the subset based, at least in part, on the at least a subset of the plurality of candidate regions and the feature map.
The computer-readable medium of claim 43, wherein the obstacle detection neural network includes a third subnetwork configured to predict, for each candidate region of the subset, at least one of a type, pose, orientation, 3D position, or 3D size of an obstacle corresponding to the candidate region, based at least in part on the initial 3D position and the one or more region features.
The computer-readable medium of claim 25, wherein the mobile platform includes at least one of an unmanned aerial vehicle (UAV) , a manned aircraft, an autonomous car, a self-balancing vehicle, or a robot.
The computer-readable medium of claim 25, wherein the actions further comprise causing navigation of the mobile platform based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles within the environment.
A mobile platform including a programmed controller that at least partially controls one or more motions of the mobile platform, wherein the programmed controller includes one or more processors configured to:

obtain sensor data that indicate at least a portion of an environment surrounding the mobile platform from one or more sensors carried by the mobile platform;

determine depth information, a feature map, and a plurality of candidate regions based at least partly on the sensor data, wherein each candidate region indicate at least a portion of an obstacle within the environment; and

feed the depth information, the feature map, and at least a subset of the plurality of candidate regions to an obstacle detection neural network to predict at least one status attribute of one or more obstacles within the environment.
The mobile platform of claim 47, wherein the one or more sensors include at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.
The mobile platform of claim 47, wherein the depth information is based, at least in part, on a point cloud.
The mobile platform of claim 49, wherein the point cloud is obtained based on sensor data generated by at least one of a LiDAR, a radar, a Time-of-Flight (ToF) camera, a stereo camera, or a mono camera.
The mobile platform of claim 47, wherein the depth information includes a depth map determined based, at least in part, on disparity data generated, directly or indirectly, from at least one of a stereo camera or mono camera.
The mobile platform of claim 47, wherein the depth information is generated by feeding the feature map to a depth estimating neural network separate from the obstacle detection neural network.
The mobile platform of claim 47, wherein the sensor data includes one or more images and wherein the feature map is generated by at least feeding the one or more images to a base neural network separate from the obstacle detection neural network.
The mobile platform of claim 53, wherein the sensor data includes a point cloud and wherein the feature map is generated based, at least in part, on projecting the point cloud onto a 2D grid defined in accordance with at least one of the one or more images.
The mobile platform of claim 54, wherein the projecting is based, at least in part, on extrinsic and/or intrinsic calibration parameters with respect to at least one sensors that generated the one or more images.
The mobile platform of claim 54, wherein the projected data includes at least one of a height, distance, or angle measurement for individual grid blocks of the 2D grid.
The mobile platform of claim 54, wherein the feature map is generated by further feeding the projected data to the base neural network.
The mobile platform of claim 53, wherein the feature map is smaller in size than at least one of the one or more images.
The mobile platform of claim 53, wherein at least one of the base neural network or the obstacle detection neural network includes one or more convolutional and/or pooling layers.
The mobile platform of claim 53, wherein the one or more processors are further configured to feed the feature map to an intermediate neural network to generate the plurality of candidate regions defined in accordance with the image data.
The mobile platform of claim 60, wherein each candidate region is a two-dimensional (2D) region that includes a respective target pixel of the image data.
The mobile platform of claim 61, wherein the respective target pixel is associated with a probability of indicating at least a portion of the obstacle within the environment.
The mobile platform of claim 60, wherein at least two of the base neural network, the intermediate neural network, and the obstacle detection neural network are jointly trained.
The mobile platform of claim 47, wherein the obstacle detection neural network includes a first subnetwork configured to determine an initial 3D position of an obstacle for each candidate region of the subset based, at least in part, on the depth information and the at least a subset of the plurality of candidate regions.
The mobile platform of claim 64, wherein the obstacle detection neural network includes a second subnetwork configured to generate one or more region features for each candidate region of the subset based, at least in part, on the at least a subset of the plurality of candidate regions and the feature map.
The mobile platform of claim 65, wherein the obstacle detection neural network includes a third subnetwork configured to predict, for each candidate region of the subset, at least one of a type, pose, orientation, 3D position, or 3D size of an obstacle corresponding to the candidate region, based at least in part on the initial 3D position and the one or more region features.
The mobile platform of claim 47, wherein the mobile platform includes at least one of an unmanned aerial vehicle (UAV) , a manned aircraft, an autonomous car, a self-balancing vehicle, or a robot.
The mobile platform of claim 47, wherein the one or more processors are further configured to cause navigation of the mobile platform based, at least in part, on the predicted at least one of a type, pose, orientation, three-dimensional position, or three-dimensional size of the one or more obstacles within the environment.