WO2022141262A1

WO2022141262A1 - Object detection

Info

Publication number: WO2022141262A1
Application number: PCT/CN2020/141674
Authority: WO
Inventors: Xiaozhi Chen
Original assignee: SZ DJI Technology Co., Ltd.
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-07

Abstract

A three-dimensional (3D) object detecting method includes obtaining a first image containing a target object and a second image containing the target object; estimating an initial 3D bounding box corresponding to the target object based on the first image and the second image; performing local matching based on the initial 3D bounding box; and obtaining information of the target object by optimizing the local matching.

Description

OBJECT DETECTION

TECHNICAL FIELD

The present disclosure relates to the field of image processing and, more particularly, to camera-based three-dimensional object detection.

BACKGROUND

Robot perception has been implemented in many applications of robotics, such as autonomous driving. Three-dimensional (3D) object detection is the key technology of a robot perception system. For example, in the application of autonomous driving, 3D object detection technology is used to obtain 3D information (including 3D coordinates, size, and orientation, etc. ) of obstacles, such as vehicles, pedestrians, etc., in various driving scenarios, e.g., a road scene, so as to provide obstacle information for downstream decision-making and control units.

SUMMARY

In accordance with the disclosure, there is provided a three-dimensional (3D) object detecting method. The method includes obtaining a first image containing a target object and a second image containing the target object; estimating an initial 3D bounding box corresponding to the target object based on the first image and the second image; performing local matching based on the initial 3D bounding box; and obtaining information of the target object by optimizing the local matching.

Also in accordance with the disclosure, there is provided an image processing apparatus. The image processing apparatus includes a processor and a memory storing instructions. When executed by the processor, the instructions causes the processor to obtain a first image containing a target object and a second image containing the target object; estimate an initial 3D bounding box corresponding to the target object based on the first image and the second image; perform local matching based on the initial 3D bounding box; and obtain information of the target object by optimizing the local matching.

Also in accordance with the disclosure, there is provided a mobile platform. The mobile platform includes a first image sensor configured to obtain a first image containing a target object, a second image sensor configured to obtain a second image containing the target object, and a processor. The processor is configured to obtain the first image and the second image through the first image sensor and the second image sensor, respectively; estimate an initial 3D bounding box corresponding to the target object based on the first image and the second image; perform local matching based on the initial 3D bounding box; and obtain information of the target object by optimizing the local matching.

DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chat of an example three-dimensional (3D) object detecting method according to some embodiments of the present disclosure.

FIG. 2 is a flow chat of another example 3D object detecting method according to some embodiments of the present disclosure.

FIG. 3 is a flow chat of another example 3D object detecting method according to some embodiments of the present disclosure.

FIG. 4 a flow chat of another example 3D object detecting method according to some embodiments of the present disclosure.

FIG. 5 a flow chat of another example 3D object detecting method according to some embodiments of the present disclosure.

FIG. 6 schematically shows an example region-determination network according to some embodiments of the present disclosure.

FIG. 7 schematically shows an example regression network according to some embodiments of the present disclosure.

FIG. 8 schematically shows an example relationship between an observation angle and an orientation angle of a target object according to some embodiments of the present disclosure.

FIG. 9 schematically shows an example 3D bonding box according to some embodiments of the present disclosure.

FIG. 10 schematically shows an example application of a 3D object detecting method according to some embodiments of the present disclosure.

FIG. 11 schematically shows another example application of a 3D object detecting method according to some embodiments of the present disclosure.

FIG. 12 schematically shows an example image processing apparatus according to some embodiments of the present disclosure.

FIG. 13 schematically shows an example mobile platform according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

Technical solutions of the present disclosure will be described with reference to the drawings. It will be appreciated that the described embodiments are some rather than all of the embodiments of the present disclosure. Other embodiments conceived by those having ordinary skills in the art on the basis of the described embodiments without inventive efforts should fall within the scope of the present disclosure.

Example embodiments will be described with reference to the accompanying drawings, in which the same numbers refer to the same or similar elements unless otherwise specified.

Unless otherwise defined, all the technical and scientific terms used herein have the same or similar meanings as generally understood by one of ordinary skill in the art. As described herein, the terms used in the specification of the present disclosure are intended to describe example embodiments, instead of limiting the present disclosure. The term “and/or” used herein includes any suitable combination of one or more related items listed.

Those of ordinary skill in the art will appreciate that the example elements and algorithm steps described below can be implemented in electronic hardware, or in a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. One of ordinary skill in the art can use different methods to implement the described functions for different application scenarios, but such implementations should not be considered as beyond the scope of the present disclosure.

Three-dimensional (3D) object detection can be used by a robot perception system of an unmanned vehicle, e.g., an autonomous driving vehicle, to obtain 3D information (including 3D coordinates, size, and orientation, etc. ) of obstacles, such as vehicles, pedestrians, etc., such that a downstream controller of the unmanned vehicle can control the vehicle based on the obtained 3D information of the obstacles.

Laser sensors can be used for object detection as the laser sensors can directly obtain distance information of the obstacles. A laser-based object detection can generally realize relatively high precision of distance measurement. However, the laser sensors are usually expensive, which is not conducive to the mass production of autonomous driving vehicles. Further, the laser point cloud obtained by the laser sensors is usually sparse and lacks detailed information, which is not conducive to determine semantic information of obstacles.

Contrary to the laser-based object detection, a camera-based object detection, through processing images acquired by a camera, can acquire very rich details and semantic information of the obstacles, but are unable to obtain high-precision distance measurement. Technologies using camera-based object detection are mature for two-dimensional (2D) object detection, but it is difficult to recover the 3D information of the object from 2D images. Conventional 3D object detection methods based on binocular images usually rely on explicit calculation of dense pixel-level depth maps, and are very sensitive to illumination changes, weak texture areas, image noise, etc., thereby limiting the application of these methods.

Neural networks have been successfully applied to image classification and 2D image detection. However, how to apply deep learning for obtaining 3D information of the objects in the 2D images remains a problem to be solved.

To solve the above technical problems, one aspect of the present disclosure provides a camera-based 3D object detection method, which combines neural networks and geometric reasoning to realize the 3D positioning of 3D objects. According to the method, an initial 3D bounding box of the 3D objects through neural networks is obtained, and local matching, e.g., local dense matching, local sparse matching, local dense-sparse matching, local feature matching, etc., is performed to directly optimize the position of a 3D bounding box of the object, and this method does not require pixel-level calculation of a depth map. As such, the method of the present disclosure is more robust to lamination changes, weak texture areas, and noise, has higher positioning accuracy, and is applicable to a wider range of application scenarios. This method can achieve high-precision 3D positioning of objects based on cameras, e.g., binocular cameras alone, and this method is thus very suitable for obstacle perception in autonomous driving scenarios, and can also be extended to other applications of robotics, such as unmanned vehicles, logistics robots, cleaning robots, etc.

FIG. 1 flow chat of an example 3D object detecting method according to some embodiments of the present disclosure. As shown in FIG. 1, at 101, a first image and a second image are obtained, where the first image and the second image may both contain a target object. Each of the first image and the second image can be referred to as a 2D image.

The first image and the second image may be obtained via one or more image sensors of a camera, e.g., a binocular camera. In some embodiments, the first image and the second image may correspond to substantially the same photographing scene. The photographing scene may include one or more target objects. A representation of the one or more target objects in the first image matches a representation of the one or more target objects in the second image.

In some embodiments, the first image and the second image may be obtained via a first image sensor and a second image sensor, respectively, of a camera, e.g., a binocular camera. In some embodiments, the first image may be obtained via a left image sensor of the binocular camera, and the second image may be obtained via a right image sensor of the binocular camera.

In some embodiments, the first image and the second image are obtained by processing a first raw image and a second raw image, respectively. The first image sensor (e.g., the left image sensor) may capture the first raw image and the second image sensor (e.g., the right image sensor) may capture the second raw image. An epipolar line of the first raw image may not be aligned with or parallel to an epipolar line of the second raw image. The first raw image and the second raw image may be subjected to one or more processes such as undistortion, stereo rectification, and/or cropping, to obtain the first image and the second image, respectively. An epipolar line of the first image and an epipolar line of the second image may be both aligned with a same horizontal line. The first image and the second image obtained by processing the first raw image and the second raw image, respectively, may have a same orientation and may be translationally displaced with respect to each other. In some embodiments, the first raw image and the second image may be processed through stereo rectification using Bouguet method if internal parameters and distortion parameters of the two image sensors are known. In some embodiments, the first raw image and the second raw image may be processed through stereo rectification using Hartley method if the internal parameters and/or the distortion parameters of the two image sensors are unknown.

At 102, an initial 3D bounding box corresponding to the target object is estimated based on the first image and the second image.

The initial 3D bounding box may be determined based on the target object and the target object may include one of a building, a vehicle, and a pedestrian. For example, a 3D scale and/or a shape of the initial 3D bounding box may be determined based on the target object in the first image and/or the second image. The 3D scale of the initial bounding box may include a dimension (s) in one or more directions, an area (s) of one or more surfaces, or a volume, etc., of the initial bounding box.

In some embodiments, the initial 3D bounding box may have a default shape, e.g., a cuboid. In some other embodiments, a shape of the 3D bounding box may be selected from a plurality of predetermined shapes, such as a cuboid, a sphere, a cylinder, a cone, etc. For example, the shape of the initial bounding box may be selected based on the target object in the first image and/or the second image. The shape of the initial 3D bounding boxes corresponding to different target objects may be different.

At 103, local matching is performed based on the initial 3D bounding box.

Local matching refers to image matching within a first portion of the first image and a second portion of the second image. That is, image matching is not performed between the entire first and second images, but between portions of the first and second images. In some embodiments, the first portion of the first image and the second portion of the second image may both correspond to the initial 3D bounding box. In some embodiments, the first portion of the first image and the second portion of the second image may be obtained according to the initial 3D bounding box. Image matching may be feature-based matching, dense matching, sparse matching, dense-sparse matching etc. The feature-based matching compares and analyses features between images. The dense matching establishes the dense correspondence between images. Scale invariant feature transform (SIFT) algorithm may be used for a dense matching process. The sparse matching utilizes sparse an approximation theory, which deals with sparse solutions for systems of linear equations. Belief propagation (BP) algorithm may be used for a sparse matching process. The dense-sparse matching combines the dense matching and the sparse matching.

At 104, information of the target object is obtained by optimizing the local matching. The information of the target object may include at least one of a depth at a center of the target object, 3D scale information of the target object, an orientation angle of the target object, 3D positioning of the target object, or semantic information of the target object.

In some embodiments, as shown in FIG. 2, performing local matching on the initial 3D object (103) includes projecting the initial 3D bounding box onto one of the first image and the second image to obtain a visible portion of the initial 3D bounding box (201) , and performing local matching based on the visible portion of the initial 3D bounding box (202) .

The initial 3D bounding box may be projected onto the first image or the second image. The projection of the 3D bounding box onto the first image or the second image transforms the 3D bounding box to a 2D projection region in the first image or the second image. The 2D projection region may correspond to a visible portion of the initial 3D bounding box in the first image or the second image.

In some embodiments, the initial 3D bounding box may be projected onto the first image to obtain a 2D projection region as the visible portion of the initial 3D bounding box. The 2D projection region in the first image may be used to determine a corresponding 2D region in the second image based on a correspondence relationship, e.g., a matching relationship, between the first image and the second image, where the corresponding 2D region in the second image may be determined as the visible portion of the initial 3D bounding box in the second image. In some embodiments, the first image and the second image obtained after stereo rectification have only the translation difference, i.e., the first image the second image have the same coordinate in a vertical direction and have same height, and a disparity corresponding to each pixel within the 2D projection region in the first image may be used to determine the corresponding 2D region in the second image. For example, for pixel i with coordinates (u _i, v _i) within the 2D projection region in the first image, based on a disparity d _i corresponding to pixel i in the first image, the corresponding pixel in the second region may be determined to be at (u _i-d _i, v _i) in the second image. Similarly, the initial 3D bounding box may be projected onto the second image to obtain a 2D projection region as the visible portion of the initial 3D bounding box in the second image, and the 2D projection region in the second image may be used to determine a corresponding 2D region in the first image, where the corresponding 2D region in the first image may be determined as the visible portion of the initial 3D bounding box in the first image.

In some other embodiments, the initial 3D bounding box may be projected onto the first image to obtain a first 2D projection region as the visible portion of the initial 3D bounding box in the first image, and the initial 3D bounding box may be projected onto the second image to obtain a second 2D projection region as the visible portion of the initial 3D bounding box in the second image.

In some embodiments, as shown in FIG. 3, performing local matching based on the visible portion of the initial 3D bounding box (202) includes calculating a disparity corresponding to each pixel within the visible portion according to the initial 3D bounding box (301) , and determining a matching error of pixels within the visible portion based on the disparities corresponding to the pixels within the visible portion (302) .

In some embodiments, the disparity corresponding to each pixel within the visible portion may be calculated based on a depth corresponding to each pixel within the visible portion. The depth corresponding to each pixel within the visible portion may be calculated according to coordinates of a center of the initial 3D bounding box and a 3D scale of the initial 3D bounding box.

For example, the initial 3D bounding box may be projected to the first image, where the first image may be obtained via a first image sensor (e.g., a left image sensor) of a binocular camera, and the disparity corresponding to each pixel may be calculated based on the following equations:

z _i = f (z, s) Equation (2)

where, d _i is a disparity corresponding to pixel i in the visible portion; z _i is the depth corresponding to pixel i in the visible portion; f is a focal length of the first image sensor; b is a baseline of the binocular camera and represents a distance between the first image sensor (e.g., the left image sensor) and the second image sensor (e.g., the right image sensor) ; z is a depth at a center of the initial 3D bounding box; s is a 3D scale (e.g., a dimension (s) in one or more directions, an area (s) of one or more surfaces, a volume, etc. ) of the initial 3D bounding box; and f (z, s) represents a mapping relationship between the pixel depth and the center depth and the 3D scale of the initial 3D bounding box.

In some embodiments, based on a mapping relationship between the initial 3D bounding box and the visible portion (e.g., 2D projection region) of the initial 3D bounding box, a location in the initial 3D bounding box corresponding to pixel i in the visible portion may be determined. Based on the location, the depth at a center of the initial 3D bounding box, and a 3D scale (e.g., a dimension (s) in one or more directions, an area (s) of one or more surfaces, a volume, etc. ) of the initial 3D bounding box, the depth corresponding to pixel i in the visible portion may be determined.

In some embodiments, the matching error of pixels within the visible portion of the initial 3D bounding box in the first image and the second image may be a sum of pixel matching errors. A pixel matching error (or simply “matching error” ) refers to a matching error between a pixel within the visible portion of the initial 3D bonding box in the first image and a corresponding pixel in the second image (e.g., a corresponding pixel within the visible portion of the initial 3D bounding box in the second image) .

For example, the first image and the second image obtained after stereo rectification have only the translation difference, i.e., the first image the second image have the same coordinate in a vertical direction and have same height, and the matching error of the pixels within the visible portion of the initial 3D bounding box may be calculated based on the following equation:

where, E is the matching error of the pixels within the visible portion; L is a matching error function; N is total number of pixels within the visible portion; I _l and I _r are representations for the first image and a second image, respectively; (u _i, v _i) represents coordinates of pixel i within the visible portion; and d _i is the disparity corresponding to pixel i.

In some embodiments, I _l and I _r may represent pixel values of pixel i within the visible portion of the initial 3D bounding box in the first image and a second image, respectively. In some other embodiments, I _l and I _r may represent feature maps of the first image and a second image, respectively.

In some embodiments, the matching error function L may be determined according to application scenarios. For example, the matching error function L can be an L1 norm or an L2 norm.

In some embodiments, as shown in FIG. 3, obtaining the information of the target object by optimizing the local matching (104) includes adjusting the initial 3D bounding box according to the matching error to obtain an updated 3D bounding box (e.g., an optimized 3D bounding box) where the updated 3D bounding box corresponds to a lowest matching error (303) , and obtaining the information of the target object based on the updated 3D bounding box (304) .

The initial 3D bounding box may be adjusted to optimize the local matching (e.g., to decrease the matching error) , so as to obtain an updated 3D bounding box that corresponds to a lowest matching error. Based on the optimized local matching, the depth corresponding to a center of the updated 3D bounding box may be obtained, and coordinates at the center of the updated 3D bounding box may also be obtained. As such, the 3D position, 3D scale, and/or orientation of the target object may be determined based on the depth corresponding to the center of the updated 3D bounding box and/or coordinates at the center of the updated 3D bounding box. The 3D scale of the target object may include a dimension (s) of the target object in one or more directions, a volume of the target object, or an area (s) in one or more surfaces of the target objects.

Determining an appropriate initial 3D bounding box may help to reduce the amount of calculation for optimizing the local matching. The appropriate initial 3D bounding box may approximately simulate a shape and/or an occupied space of the target object. In some embodiments, as shown in FIG. 4, determining the initial 3D bounding box (102) further includes obtaining a first region of the first image and a second region of the second image, where the first region and the second region both contain the target object (401) ; estimating initial information of the target object based on the first region and the second region (402) ; and estimating the initial 3D bounding box corresponding to the target object based on the first region, the second region, and/or the initial information (403) .

In some embodiments, the first region and the second region may be 2D bounding boxes, each of the 2D bounding boxes including one or more pixels. The first region and the second region may be obtained based on a first feature map of the first image and a second feature map of the second image, respectively.

In some embodiments, the first image and the second image may be processed through a base network to obtain the first feature map and the second feature map, respectively. The base network may include at least one of convolution processing, pooling processing, or non-linear computation processing. The base network may include a first base network for processing the first image for obtaining the first feature map, and a second base network for processing the second image to obtain the second feature map. The first base network and the second network may have the same network structure and base weight coefficient. The base weight coefficient may be determined through deep learning, e.g., training with a plurality of training images. The first image may be processed through the first base network using the base weight coefficient to obtain the first feature map. The second image may be processed through the second base network using the base weight coefficient to obtain the second feature map.

In some embodiments, a plurality of first candidate regions and a plurality of second candidate regions may be determined in the first image based on the first feature map and the second image based on the second feature map, respectively, where the plurality of first candidate regions have one-to-one correspondence with the plurality of second candidate regions. The first region and the second region may be determined from the plurality of first candidate regions and the plurality of second candidate regions, respectively. In some embodiments, a number of the plurality of first candidate regions or a number of the plurality of second candidate regions is in a range from 99 to 1000. For example, the number of the plurality of first candidate regions or the number of the plurality of second candidate regions may be 300.

In some embodiments, the plurality of regions of the first image and the plurality of regions of the second image may be processed through a region-determination network to determine the plurality of first candidate regions and the plurality of second candidate regions, respectively. The region-determination network includes at least one of convolution processing, pooling processing, or non-linear computation processing. For example, the first feature map and the second feature map may be processed through the region-determination network using a region-determination weight coefficient to determine the plurality of first candidate regions in the first image and the plurality of second candidate regions in the second image, respectively. The region-determination weight coefficient may be determined through deep learning, e.g., training with a plurality of training images.

In some embodiments, the plurality of regions of the first image and the plurality of regions of the second image may be processed through a redundancy filter to obtain the plurality of first candidate regions and the plurality of second candidate regions, respectively. For example, the plurality of regions of the first image and/or the plurality of regions of the second image may be processed through a redundancy filter using a non-maximum suppression method to obtain the plurality of first candidate regions and the plurality of second candidate regions, respectively.

In some embodiments, the first feature map and the second feature map may be processed through a feature transformation network. After the feature transformation network, the plurality of regions in the first image and the plurality of regions in the second image may be predicted to correspond to the target object, and a confidence of each of the plurality of regions in the first image belonging to the target object and a confidence of each of the plurality of regions in the second image belonging to the target object may also be predicted. The plurality of regions of the first image and the plurality of regions of the second image may be processed through a redundancy filter using a non-maximum suppression method based on the predicted confidence to obtain the plurality of first candidate regions and the plurality of second candidate regions, respectively.

For example, a set A contains N regions in in the first image or the second image that are predicted to belong to the target object. The set A may be processed through a redundancy filter using a non-maximum suppression method based on the predicted confidence of each region in the N regions to obtain one or more candidate regions in a set B. The process may include (1) initializing the set B as an empty set; (2) iteratively executing the following steps until the set A is empty: (i) taking a region b with the highest confidence from set A and adding the region b to set B, (ii) calculating a coincidence degree of each of the remaining regions in the set A and the region b in the set B, and (iii) deleting one or more regions in the set A whose coincidence degree is greater than a certain threshold; and (3) outputting the set B.

In an example scenario, each of the first image and the second image has a resolution of 100*50, i.e., each of the firs image and the second image has 5000 pixels. Each of the first image and the second image contains a plurality of objects that include a target object. The region-determination network predicts there are 100 pixels in each of the first image and the second image belonging to the target object, and each of the 100 pixels in the first image and a corresponding one of the 100 pixels in the second image correspond a pair of 2D boxes that predicted to belong to the target object. That is, 100 pairs of 2D boxes are predicted to belong to the target object. A confidence for each of the 100 pixels in the first image belonging to the target object and/or a confidence for each of the 100 pixels in the second image belonging to the target object may be predicted. According to the predicted confidence, the non-maximum suppression method may be used to remove one of more pairs of 2D boxes whose coincidence degree exceed a certain threshold, and thus only a few pairs of 2D boxes are retained as candidate regions (including the first candidate regions and the second candidate regions) of the target object.

In some embodiments, the initial information may include a semantic class of the target object. The semantic class of the target object may be selected from a plurality of semantic classes. For example, the plurality of semantic classes may include a vehicle, a person, a building, a plant, and/or a background. In some embodiments, if the target object is a vehicle, the plurality of semantic classes may further include a plurality of types of the vehicle including a motorcycle, a trailer, a passenger car, a bus, a micro, a sedan, a CRV, a SUV, a hatchback, a roadster, a pickup, a van, a coupe, a supercar, a campervan, a mini truck, a cabriolet, a minivan, a truck, a big truck, etc. In some embodiments, if the target object is a vehicle, the plurality of semantic classes may further include a plurality of brands of the vehicle.

In some embodiments, the initial information may include positional information of the first region and positional information of the second region. For example, the positional information of the first region may include at least one of a width of the first region, a length of the first region, or coordinates of a center of the first region; and the positional information of the second region may include at least one of a width of the second region, a length of the second region, or coordinates of a center of the second region.

In some embodiments, the initial information may include an observation angle with respect to the target object, and an orientation angle of the target object. The orientation angle of the target object may be obtained based on the observation angle with respect to the target object.

In some embodiments, the initial information may include a 3D scale of the target object. The 3D scale of the target object may include a dimension (s) of the target object in one or more directions, a volume of the target object, or an area (s) in one or more surfaces of the target objects. The 3D scale of the target object may be determined based on the semantic class of the target object. For example, a correspondence relationship between averaged scales and semantic classes may be established. Different semantic classes may correspond to different averaged scales. The 3D scale of the target object may be estimated based on the semantic class of the target object and the established correspondence relationship between averaged scales and a plurality of semantic classes.

In some embodiments, the initial information may include one or more key points of the initial 3D bounding box. For example, the one or more key points of the initial 3D bounding box may include three or more corner points of the 3D bounding box.

In some embodiments, the initial 3D bounding box corresponding to the target object may be obtained based on the first region and the second region. In some embodiments, the initial 3D bounding box corresponding to the target object may be obtained further based on the estimated initial information of the target object.

FIG. 5 a flow chat of another example 3D object detecting method according to some embodiments of the present disclosure. As shown in FIG. 5, a first raw image 501 and a second raw image 502 are processed through stereo rectification 503 to obtain a first image 504 and a second image 505, respectively. An epipolar line of the first image 504 and an epipolar line of the second image 505 may be both aligned with a same horizontal line. The first image 504 is processed through a first base network 506 to obtain a first feature map 508. The second image 505 is processed through a second base network 507 to obtain a second feature map 509. The first feature map 508 and the second feature map 509 are input to a region-determination network 510, which outputs a plurality of first candidate regions 512 in the first image 504 and a plurality of second candidate regions 513 in the second image 505. The first feature map 508, the second feature map 509, the plurality of first candidate regions 512 and the plurality of second candidate regions 513 are input to a regression network 511 to output 3D bounding box information 514. Consistent with some embodiments described above, an initial 3D bounding box may be estimated based on the 3D bounding box information 514. Local matching 515 is performed on the initial 3D bounding box and based on the local matching, information of the target object 516 is obtained. The information of the target object 516 includes a depth at a center of the target object, a 3D scale, an orientation angle, 3D positioning, and/or semantic information of the target object.

FIG. 6 schematically shows an example region-determination network according to some embodiments of the present disclosure. As shown in FIG. 6, the first feature map 601 and the second feature map 602 are feature transformed through a feature transformation network 603. After the feature transformation, 2D bounding box prediction 605 is performed for the transformed first feature map and transformed second feature map to obtain a plurality of regions in the first image and a plurality of regions in the second image, respectively, that are predicted to belong to the target object. The plurality of regions in the first image have one-to-one correspondence with the plurality of regions in the second image. Confidence prediction 604 is performed for each of the plurality of regions in the first image to determine a confidence of each of the plurality of regions in the first image belonging to the target object. Confidence prediction 604 is also performed for each of the plurality of regions in the second image to determine a confidence of each of the plurality of regions in the second image belonging to the target object. The plurality of regions in the first image and the plurality of regions in the second image are filtered via a redundancy filter 606 to obtain a plurality of first candidate regions 607 and a plurality of second candidate regions 608, respectively.

FIG. 7 schematically shows an example regression network according to some embodiments of the present disclosure. As shown in FIG. 7, a first feature map 701 and a second feature map 702 are input to a fusion network 704. A plurality of first candidate regions 703, a plurality of second candidate regions 705, and output of the fusion network are input to a candidate region feature network 706. The fusion network 704 and the candidate region feature network 706 may be parts of a feature fusion network, which may be a convolutional neural network.

The plurality of first candidate regions 703 and the plurality of second candidate regions 705 may be normalized to have a same predetermined scale. The first feature map 701 and the second feature map 702 may be processed by a pooling processing in the fusion network 704 to obtain a feature vector with a preset length. Based on the obtained feature vector, the candidate region feature network 706 may perform multiple level of feature transformation to obtain features of the plurality of first candidate regions 703 and the plurality of second candidate regions 705, and fuse the features of the of the plurality of first candidate regions 703 and the plurality of second candidate regions 705 to output fused features that combines information of the first image and the second image.

The fused features may be used to output estimated initial information of the target object and estimated information of the initial 3D bounding box. For example, as shown in FIG. 7, the output of candidate region feature network 706 is input into object semantic class information estimation 707 to obtain an semantic class of the target object. The semantic class of the target object may be selected from a plurality of semantic classes. A confidence of each of the plurality of candidate regions 703 belonging to a certain semantic class and/or each of the plurality of candidate regions 705 belonging to a certain semantic class may be calculated. The semantic class of the target object may be obtained based on the calculated confidence.

As shown in FIG. 7, the output of candidate region feature network 706 is also input into 2D bonding box adjustment 708 to adjust a fist region and a second region (e.g., a first 2D bounding box and a second 2D bounding box corresponding to the target object) in the first image and the second image, respectively. Each 2D bonding box may be represented by coordinates of a center point pixel, and a length and a width of the 2D bounding box. As described above, the first image and the second image have only the translation difference after stereo rectification. For example, the first image the second image have the same coordinate in a vertical direction and have same height. The 2D bonding box adjustment 708 can predict a offset between each candidate regions and a true value of a 2D bonding box corresponding to the target object, so that the estimated 2D bounding box can be adjusted according to the predicted offset to obtain a more accurate 2D bounding box.

The output of candidate region feature network 706 is also input into 3D scale estimation 709 to estimate a 3D scale of the target object. The 3D scale of the target object may include a dimension (s) of the target object in one or more directions, a volume of the target object, or an area (s) in one or more surfaces of the target objects. For example, the 3D scale of the target object may be estimated based on the semantic class of the target object and the established correspondence relationship between averaged scales and a plurality of semantic classes.

The output of candidate region feature network 706 is also input into observation estimation 710 to estimate an observation angle (e.g., viewpoint) of first image or the second image with respect to the target object. An orientation angle of the target object may be estimated based on the observation angle. FIG. 8 schematically shows an example relationship between an observation angle and an orientation angle of a target object according to some embodiments of the present disclosure. As shown in FIG. 8, the camera 810 captures an image containing a target object 820 (e.g., a car) . x _c may represent a vertical direction of the image, and z _c may represent a depth direction of the image that is perpendicular to the vertical direction of the image. z _c may also represent an orientation direction of the camera 810, and x _c may represent a direction perpendicular to the orientation direction of the camera 810.

An angle θ may represent an orientation angle of the target object 820, and the angle θ may be angle between an orientation direction z of the target object 820 and the orientation direction z _c of the camera 810. The orientation direction z of the target object 820 may be a heading direction or forward driving direction of the target object 820. An angle β may represent an azimuth direction of the target object 820, and the angle β may be an angle between an observation direction of the camera 810 with respect to the target object 820 and the orientation direction z _c of the camera 810. The observation direction of the camera 810 with respect to the target object 820 may be a direction along a line from the camera 810 to a center point of the target object 820.

An observation angle α of the target object 820 may be an angle between the observation direction of the camera 810 with respect to the target object 820 and the orientation direction z of the target object 820. As shown in FIG. 8, the relationship between the observation angle α of the camera 810 with respect to the target object 820, the orientation angle θ of the target object 820, and the azimuth direction β of the target object 820 is α = β + θ. Referring to FIG. 7, the regression network can estimate an observation angle α (e.g., viewpoint) of the camera 810 with respect to the target object 820, the orientation angle θ of the target object 820 may be obtained based on the observation angle α of the camera 810.

As shown in FIG. 7, the output of candidate region feature network 706 is also input into key points estimation 711 to estimate key points of an initial 3D bounding box and/or an updated 3D bounding box. FIG. 9 schematically shows an example 3D bonding box 900 according to some embodiments of the present disclosure. The 3D bounding box 900, which may be an initial bounding box, an adjusted initial bounding box, or an updated initial bounding box obtained by optimizing local matching. As shown in FIG. 9, the 3D bounding box 900 includes eight corner points 901-908 and a center point 909 at a center of the 3D bounding box 900. The key points estimation 711 may output three or more points of key points of eight corner points 901-908 and a center point 909 of the 3D bounding box 900.

FIGs. 10 and 11 schematically show example applications of 3D object detecting methods according to some embodiments of the present disclosure. As shown in FIG. 10, an image 1010 may be the first image and/or the second image. The image 1010 includes a plurality of objects including one or more target objects. For example, there are two target objects in the image 1010. After being processed by an neural network or multiple neural networks, the image 1010 may be transformed to an image 1020 and a region 1021 in the image 1020 corresponding to the one or more target objects can be estimated. The region 1021 may be a 2D bounding box containing the one or more target objects. Based on the region 1021, an initial 3D bounding box 1031 corresponding to the one or more target objects maybe obtained. As shown in FIG. 11, an initial 3D bounding box 1111 corresponding to one or more target objects is projected to an image 1120, which maybe one of the first image and the second image to obtain a visible portion 1121 of the initial 3D bounding box 1111. For example, there are one target object in the image 1110. Consistent with the 3D object detection method according to some embodiments of the present disclosure, local matching may be performed within the visible portion 1121 of the initial 3D bounding box 1111 and information of the one or more target objects may be obtained by optimizing local matching.

The 3D object detection method according to some embodiments of the present disclosure combines neural networks and geometric reasoning to realize the 3D positioning of 3D objects and does not require pixel-level matching calculation between entire images. According to the method, local matching is performed to directly optimize the position of a 3D frame of the object. As such, the method of the present disclosure is more robust to lamination changes, weak texture areas, and noise, has higher positioning accuracy, and is applicable to a wider range of application scenarios. This method can achieve high-precision 3D positioning of objects based on cameras, e.g., binocular cameras alone, and provides detailed semantic information of the objects. This method is thus very suitable for obstacle perception in autonomous driving scenarios, and can also be extended to other applications of robotics, such as unmanned vehicles, logistics robots, cleaning robots, etc.

Another aspect of the present disclosure provides an image processing apparatus. FIG. 12 schematically shows an example image processing apparatus 1200 according to some embodiments of the present disclosure. The image processing apparatus 1200 may be a camera, an image processing assembly mounted on an unmanned vehicle, a mobile phone, a tablet, or another apparatus with an image processing function. As shown in FIG. 12, the image processing apparatus 1200 includes one or more of the following components: a processor 1210 and a memory 1220.

The processor 1210 may be configured to control operations, e.g., image processing, of the image processing apparatus 1200. The processor 1210 is configured to execute computer- readable instructions. In addition, the processor 1210 may also include one or more components (not shown) to facilitate interaction between the processor 1210 and the memory 1220.

The memory1220 may store a plurality of computer-readable instructions, images to be processed, images being processed, and images processed by the image processing apparatus 1200. The memory 1220 may be any type of volatile or non-volatile memory or a combination thereof, such as a static random-access memory (SRAM) , an electrically erasable programmable read-only memory (EEPROM) , an erasable programmable read-only memory (EPROM) , a programmable read-only memory (PROM) , a read-only memory (ROM) , a magnetic memory, a flash memory, a disk, or an optical disk.

In some embodiments, the computer-readable instructions can be executed by the processor 1210 to cause the processor 1210 to implement a method consistent with the disclosure, such as the example 3D object detection method described above in connection with FIGs. 1-11.

In some embodiments, the instructions can cause the processor 1210 to obtain a first image and a second image, where the first image and the second image may both contain a target object. Each of the first image and the second image can be referred to as a 2D image.

The first image and the second image may be obtained via one or more image sensors of a camera, e.g., a binocular camera that is communicatively connected to the image processing apparatus 1200. In some embodiments, the first image and the second image may correspond to substantially the same photographing scene. The photographing scene may include one or more target objects. A representation of the one or more target objects in the first image matches a representation of the one or more target objects in the second image.

In some embodiments, the first image and the second image may be obtained via a first image sensor and a second image sensor, respectively, of a camera, e.g., a binocular camera. The camera is communicatively connected to the image processing apparatus 1200. In some embodiments, the first image may be obtained via a left image sensor of the binocular camera, and the second image may be obtained via a right image sensor of the binocular camera.

In some embodiments, the instructions can cause the processor 1210 to estimate an initial 3D bounding box corresponding to the target object based on the first image and the second image. The initial 3D bounding box may be determined based on the target object and the target object may include one of a building, a vehicle, and a pedestrian. For example, a 3D scale and/or a shape of the initial 3D bounding box may be determined based on the target object in the first image and/or the second image. In some embodiments, the initial 3D bounding box may have a default shape, e.g., a cuboid. In some other embodiments, a shape of the 3D bounding box may be selected from a plurality of predetermined shapes, such as a cuboid, a sphere, a cylinder, a cone, etc. For example, the shape of the initial bounding box may be selected based on the target object in the first image and/or the second image. The shape of the initial 3D bounding boxes corresponding to different target objects may be different.

In some embodiments, the instructions can cause the processor 1210 to perform local matching based on the initial 3D bounding box. Local matching refers to image matching within a first portion of the first image and a second portion of the second image. That is, image matching is not performed between the entire first and second images, but between portions of the first and second images. In some embodiments, the first portion of the first image and the second portion of the second image may both correspond to the initial 3D bounding box. In some embodiments, the first portion of the first image and the second portion of the second image may be obtained according to the initial 3D bounding box. Image matching may be feature-based matching, dense matching, etc., where the feature-based matching compares and analyses features between images, and the dense matching establishes the dense correspondence between images.

In some embodiments, the instructions can cause the processor 1210 to obtain information of the target object by optimizing the local matching. The information of the target object may include at least one of a depth at a center of the target object, 3D scale information of the target object, an orientation angle of the target object, 3D positioning of the target object, or semantic information of the target object.

In some embodiments, the instructions can cause the processor 1210 to project the initial 3D bounding box onto one of the first image and the second image to obtain a visible portion of the initial 3D bounding box; and perform local matching based on the visible portion of the initial 3D bounding box. The initial 3D bounding box may be projected onto the first image or the second image. The projection of the 3D bounding box onto the first image or the second image transforms the 3D bounding box to a 2D projection region in the first image or the second image. The 2D projection region may correspond to a visible portion of the initial 3D bounding box in the first image or the second image.

In some embodiments, the initial 3D bounding box may be projected onto the first image to obtain a 2D projection region as the visible portion of the initial 3D bounding box. The 2D projection region in the first image may be used to determine a corresponding 2D region in the second image based on a correspondence relationship, e.g., a matching relationship, between the first image and the second image, where the corresponding 2D region in the second image may be determined as the visible portion of the initial 3D bounding box in the second image. In some embodiments, a disparity corresponding to each pixel within the 2D projection region in the first image may be used to determine the corresponding 2D region in the second image. For example, for pixel i with coordinates (u _i, v _i) within the 2D projection region in the first image, based on a disparity d _i corresponding to pixel i in the first image, the corresponding pixel in the second region may be determined to be at (u _i -d _i, v _i) in the second image. Similarly, the initial 3D bounding box may be projected onto the second image to obtain a 2D projection region as the visible portion of the initial 3D bounding box in the second image, and the 2D projection region in the second image may be used to determine a corresponding 2D region in the first image, where the corresponding 2D region in the first image may be determined as the visible portion of the initial 3D bounding box in the first image.

In some embodiments, the instructions can cause the processor 1210 to calculate a disparity corresponding to each pixel within the visible portion according to the initial 3D bounding box; and determine a matching error of pixels within the visible portion based on the disparity corresponding to each pixel within the visible portion.

In some embodiments, the disparity corresponding to each pixel within the visible portion may be calculated based on a depth corresponding to each pixel within the visible portion. The depth corresponding to each pixel within the visible portion may be calculated according to coordinates of a center of the initial 3D bounding box and a 3D scale of the initial 3D bounding box. The 3D scale of the initial bounding box may include a dimension (s) in one or more directions, an area (s) of one or more surfaces, or a volume, etc.

. In some embodiments, the matching error of pixels within the visible portion of the initial 3D bounding box in the first image and the second image may be a sum of pixel matching errors. A pixel matching error (or simply “matching error” ) refers to a matching error between each pixel within the visible portion of the initial 3D bonding box in the first image and a corresponding pixel in the second image (e.g., a corresponding pixel within the visible portion of the initial 3D bounding box in the second image) .

In some embodiments, the instructions can cause the processor 1210 to adjust the initial 3D bounding box according to the matching error to obtain an updated 3D bounding box (e.g., an optimized 3D bounding box) , where the updated 3D bounding box corresponds to a lowest matching error; and obtain the information of the target object based on the updated 3D bounding box. The initial 3D bounding box may be adjusted to optimize the local matching (e.g., to decrease the matching error) , so as to obtain an updated 3D bounding box that corresponds to a lowest matching error. Based on the optimized local matching, the depth corresponding to a center of the updated 3D bounding box may be obtained, and coordinates at the center of the updated 3D bounding box may also be obtained. As such, the 3D positioning, 3D scale, and/or orientation of the target object may be determined based on the depth corresponding to the center of the updated 3D bounding box and/or coordinates at the center of the updated 3D bounding box.

In some embodiments, the instructions can cause the processor 1210 to obtain a first region of a first image and a second region of the second image may be obtained, where the first region and the second region may both contain the target object. The first region and the second region may be 2D bounding boxes, each of the 2D bounding boxes including one or more pixels. The first region and the second region may be obtained based on a first feature map of the first image and a second feature map of the second image, respectively.

In some embodiments, the instructions can cause the processor 1210 to estimate initial information of the target object based on the first region and the second region. The initial information may include a semantic class of the target object, positional information of the first region and positional information of the second region, an observation angle with respect to the target object, an orientation angle of the target object, 3D scale of the target object, and/or one or more key points of the initial 3D bounding box. The 3D scale of the target object may include a dimension (s) of the target object in one or more directions, a volume of the target object, or an area (s) in one or more surfaces of the target objects.

In some embodiments, the instructions can cause the processor 1210 to obtain the initial 3D bounding box corresponding to the target object based on the first region and the second region. In some embodiments, the initial 3D bounding box corresponding to the target object may be obtained further based on the estimated initial information of the target object.

For detailed description of parts of the image processing apparatus 1200, reference can be made to descriptions of the 3D object detection method, which are not repeated here.

The image processing apparatus consistent with embodiments of the present disclosure utilizes neural networks and geometric reasoning to realize the 3D positioning of 3D objects and is not required to perform pixel-level matching calculation between entire images, but performs local matching to directly optimize the position of a 3D frame of the object. The image processing apparatus can be robust to lamination changes, weak texture areas, and noise, has higher positioning accuracy, and is applicable to a wider range of application scenarios. This image processing apparatus can achieve high-precision 3D positioning of objects based on cameras, e.g., binocular cameras alone, and provides detailed semantic information of the objects. The image processing apparatus is thus very suitable for obstacle perception in autonomous driving scenarios, and can also be extended to other applications of robotics, such as unmanned vehicles, logistics robots, cleaning robots, etc.

Another aspect of the present disclosure provides a mobile platform. FIG. 13 schematically shows an example mobile platform 1300 according to some embodiments of the present disclosure. The mobile platform may be an autonomous driving robot, an unmanned vehicle, a cleaning robot, a logistics robot, etc. As shown in FIG. 13, the mobile platform 1300 includes a processor 1310 and a memory 1320. The processor 1310 can be configured to control operations, e.g., image acquisition, image processing, and image display, etc., of the mobile platform 1300. The processor 1310 is configured to execute computer-readable instructions stored in the memory 1320. The memory 1320 may store a plurality of computer-readable instructions of the mobile platform 1300. The memory 1320 may be any type of volatile or non-volatile memory or a combination thereof, such as a static random-access memory (SRAM) , an electrically erasable programmable read-only memory (EEPROM) , an erasable programmable read-only memory (EPROM) , a programmable read-only memory (PROM) , a read-only memory (ROM) , a magnetic memory, a flash memory, a disk, or an optical disk.

In some embodiments, the computer-readable instructions can be executed by the processor 1310 to cause the processor 1310 to implement a method consistent with the disclosure, such as the example 3D object detection method described above in connection with FIGs. 1-11.

In some embodiments, the instructions can cause the processor 1310 to obtain a first image and a second image, where the first image and the second image may both contain a target object. Each of the first image and the second image can be referred to as a 2D image.

The first image and the second image may be obtained via one or more image sensors of a camera, e.g., binocular camera that is communicatively connected to the mobile platform1300. In some embodiments, the first image and the second image may correspond to substantially the same photographing scene. The photographing scene may include one or more target objects. A representation of the one or more target objects in the first image matches a representation of the one or more target objects in the second image.

In some embodiments, the first image and the second image may be obtained via a first image sensor and a second image sensor, respectively, of a camera, e.g., a binocular camera. The camera is communicatively connected to the mobile platform 1300. In some embodiments, the first image may be obtained via a left image sensor of the binocular camera, and the second image may be obtained via a right image sensor of the binocular camera.

In some embodiments, the first image and the second image are obtained by processing a first raw image and a second raw image, respectively. The first image sensor (e.g., the left image sensor) may capture the first raw image and the second image sensor (e.g., the right image sensor) may capture the second raw image. An epipolar line of the first raw image may not be aligned with or parallel to an epipolar line of the second raw image. The first raw image and the second raw image may thus be subjected to one or more processes such as undistortion, stereo rectification, and/or cropping, to obtain the first image and the second image, respectively. An epipolar line of the first image and an epipolar line of the second image may be both aligned with a same horizontal line. The first image and the second image obtained by processing the first raw image and the second raw image, respectively, may have a same orientation and may be translationally displaced with respect to each other. In some embodiments, the first raw image and the second image may be processed through stereo rectification using Bouguet method if internal parameters and distortion parameters of the two image sensors are known. In some embodiments, the first raw image and the second raw image may be processed through stereo rectification using Hartley method if the internal parameters and/or the distortion parameters of the two image sensors are unknown.

In some embodiments, the instructions can cause the processor 1310 to estimate an initial 3D bounding box corresponding to the target object based on the first image and the second image. The initial 3D bounding box may be determined based on the target object and target object may include one of a building, a vehicle, and a pedestrian. For example, a 3D scale and/or a shape of the initial 3D bounding box may be determined based on the target object in the first image and/or the second image. In some embodiments, the initial 3D bounding box may have a default shape, e.g., a cuboid. In some other embodiments, a shape of the 3D bounding box may be selected from a plurality of predetermined shapes, a cuboid, a sphere, a cylinder, a cone, etc. For example, the shape of the initial bounding box may be selected based on the target object in the first image and/or the second image. The shape of the initial 3D bounding boxes corresponding to different target objects may be different.

In some embodiments, the instructions can cause the processor 1310 to perform local matching based on the initial 3D bounding box. Local matching refers to image matching within only a first portion of the first image and a second portion of the second image. That is, image matching is not performed between the entire first and second images, but between portions of the first and second images. In some embodiments, the first portion of the first image and the second portion of the second image may both correspond to the initial 3D bounding box. In some embodiments, the first portion of the first image and the second portion of the second image may be obtained according to the initial 3D bounding box. Image matching may be feature-based matching, dense matching, etc., where the feature-based matching compares and analyses features between images, and the dense matching establishes the dense correspondence between images.

In some embodiments, the instructions can cause the processor 1310 to obtain information of the target object by optimizing the local matching. The information of the target object may include at least one of a depth at a center of the target object, 3D scale information of the target object, an orientation angle of the target object, 3D positioning of the target object, or semantic information of the target object.

In some embodiments, the instructions can cause the processor 1310 to project the initial 3D bounding box onto one of the first image and the second image to obtain a visible portion of the initial 3D bounding box; and perform local matching based on the visible portion of the initial 3D bounding box. The initial 3D bounding box may be projected onto the first image or the second image. The projection of the 3D bounding box onto the first image or the second image transforms the 3D bounding box to a 2D projection region in the first image or the second image. The 2D projection region may correspond to a visible portion of the initial 3D bounding box in the first image or the second image.

In some embodiments, the instructions can cause the processor 1310 to calculate a disparity corresponding to each pixel within the visible portion according to the initial 3D bounding box; and determine a matching error of pixels within the visible portion based on the disparity corresponding to each pixel within the visible portion. In some embodiments, the disparity corresponding to each pixel within the visible portion may be calculated based on a depth corresponding to each pixel within the visible portion. The depth corresponding to each pixel within the visible portion may be calculated according to coordinates of a center of the initial 3D bounding box and a 3D scale of the initial 3D bounding box.

In some embodiments, the matching error of pixels within the visible portion of the initial 3D bounding box in the first image and the second image may be a sum ofpixel matching errors. A pixel matching error (or simply “matching error” ) refers to a matching error between each pixel within the visible portion of the initial 3D bonding box in the first image and a corresponding pixel in the second image (e.g., a corresponding pixel within the visible portion of the initial 3D bounding box in the second image) .

In some embodiments, the instructions can cause the processor 1310 to adjust the initial 3D bounding box according to the matching error to obtain an updated 3D bounding box (e.g., an optimized 3D bounding box) , where the updated 3D bounding box corresponds to a lowest matching error; and obtain the information of the target object based on the updated 3D bounding box. The initial 3D bounding box may be adjusted to optimize the local matching (e.g., to decrease the matching error) , so as to obtain an updated 3D bounding box that corresponds to a lowest matching error. Based on the optimized local matching, the depth corresponding to a center of the updated 3D bounding box may be obtained, and coordinates at the center of the updated 3D bounding box may also be obtained. As such, the 3D positioning, 3D scale, and/or orientation of the target object may be determined based on the depth corresponding to the center of the updated 3D bounding box and/or coordinates at the center of the updated 3D bounding box.

In some embodiments, the instructions can cause the processor 1310 to obtain a first region of the first image and a second region of the second image may be obtained, where the first region and the second region may both contain the target object. The first region and the second region may be 2D bounding boxes, each of the 2D bounding boxes including one or more pixels. The first region and the second region may be obtained based on a first feature map of the first image and a second feature map of the second image, respectively.

In some embodiments, the instructions can cause the processor 1310 to estimate initial information of the target object based on the first region and the second region. The initial information may include a semantic class of the target object, positional information of the first region and positional information of the second region, an observation angle with respect to the target object, an orientation angle of the target object, 3D scale of the target object, and/or one or more key points of the initial 3D bounding box.

In some embodiments, the instructions can cause the processor 1310 to obtain the initial 3D bounding box corresponding to the target object based on the first region and the second region. In some embodiments, the initial 3D bounding box corresponding to the target object may be obtained further based on the estimated initial information of the target object.

In some embodiments, the processor 1310 can process images sent from an image acquisition apparatus, e.g., a binocular camera, mounted on the mobile platform 1300. The mobile platform 1300 also includes a first image sensor 1330 (e.g., a left image sensor of the binocular camera) and a second image sensor 1340 (e.g., a right image sensor of the binocular camera) . The first image sensor 1330 can be configured to obtain a first raw image and the second image sensor 1340 can be configured to obtain a second raw image. The first raw image and the second raw image may be processed to obtain a first image and a second image.

In some embodiments, as shown in FIG. 13, the mobile platform 1300 also includes an image display unit 1350, configured to display images processed and sent by the processer 1310.

In some embodiments, as shown in FIG. 13, the mobile platform 1300 also includes a user input/output interface 1360. The user input/output interface 1360 may be a display, a touch control display, a keyboard, buttons, or a combination thereof. For example, the user input/output interface 1360 can be a touch control display, and through a screen, the user can input instructions input to the mobile platform 1300.

In addition, the processor 1310 may also include one or more components (not shown) to facilitate interaction between the processor 1310 and the first image sensor 1330 and the second image sensor 1340, the image display unit 1350 and the user input/output interface 1360.

For detailed description of parts of the mobile platform 1300, reference can be made to descriptions of the image processing apparatus 1200, which are not repeated here.

The mobile platform consistent with embodiments of the present disclosure utilizes neural networks and geometric reasoning to realize the precise 3D positioning of 3D objects based on images captured by a binocular camera without assistant of a laser sensor, thereby saving the cost of the laser sensor, which can be very expensive. Further, the mobile platform is not required to perform pixel-level matching calculation between entire images, but performs local matching to directly optimize the position of a 3D frame of the object. The 3D object detection performed by the mobile platform can be robust to lamination changes, weak texture areas, and noise, has higher positioning accuracy, and is applicable to a wider range of application scenarios. This 3D object detection performed by the mobile platform can achieve high-precision 3D positioning of objects based on cameras, e.g., binocular cameras alone, and provides detailed semantic information of the objects. The 3D object detection performed by the mobile platform is thus very suitable for obstacle perception in autonomous driving scenarios. The 3D object detection method implemented by the mobile platform can also be extended to other applications of robotics, such as autonomous driving vehicles, logistics robots, etc. Further, based on the 3D object detection performed by the mobile platform, the mobile platform can not only precisely position obstacles and but also obtain rich detailed information about the obstacles, based on both of which, the mobile platform can better construct a simulated driving environmental including the obstacles and/or better control the mobile platform to avoid the obstacles.

For simplification purposes, detailed descriptions of the operations of apparatus, device, and units may be omitted, and references can be made to the descriptions of the methods.

The disclosed apparatuses, device, and methods may be implemented in other manners not described here. For example, the devices described above are merely illustrative. For example, multiple elements or components may be combined or may be integrated into another system, or some features may be ignored, or not executed. Further, the coupling or direct coupling or communication connection shown or discussed may include a direct connection or an indirect connection or communication connection through one or more interfaces, devices, or units, which may be electrical, mechanical, or in other form.

The elements described as separate components may or may not be physically separate. That is, the units may be located in one place or may be distributed over a plurality of network elements. Some or all of the components may be selected according to the actual needs to achieve the object of the present disclosure.

In addition, the functional units in the various embodiments of the present disclosure may be integrated in one processing unit, or each unit may be an individual physically unit, or two or more units may be integrated in one unit.

A method consistent with the disclosure can be implemented in the form of computer program stored in a non-transitory computer-readable storage medium, which can be sold or used as a standalone product. The computer program can include instructions that enable a computer device, such as a personal computer, a server, or a network device, to perform part or all of a method consistent with the disclosure, such as one of the example methods described above. The storage medium can be any medium that can store program codes, for example, a USB disk, a mobile hard disk, a read-only memory (ROM) , a random-access memory (RAM) , a magnetic disk, or an optical disk.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. It is intended that the specification and embodiments be considered as examples only and not to limit the scope of the disclosure. Any modification and equivalently replacement for the technical solution of the present disclosure should all fall in the spirit and scope of the technical solution of the present disclosure.

Claims

A three-dimensional (3D) object detecting method comprising:

obtaining a first image containing a target object and a second image containing the target object;

estimating an initial 3D bounding box corresponding to the target object based on the first image and the second image;

performing local matching based on the initial 3D bounding box; and

obtaining information of the target object by optimizing the local matching.
The method of claim 1, wherein obtaining the information of the target object includes:

obtaining at least one of a depth at a center of the target object, 3D scale information of the target object, an orientation angle of the target object, 3D positioning of the target object, or semantic information of the target object.
The method of claim 1, wherein performing local matching based on the initial 3D bounding box includes:

projecting the initial 3D bounding box onto one of the first image and the second image to obtain a visible portion of the initial 3D bounding box; and

performing local matching based on the visible portion of the initial 3D bounding box.
The method of claim 3, wherein performing local matching based on the initial 3D bounding box further includes:

calculating a disparity corresponding to each pixel within the visible portion according to the initial 3D bounding box; and

determining a matching error of pixels within the visible portion based on the disparity corresponding to each pixel within the visible portion.
The method of claim 4, obtaining the information of the target object by optimizing the local matching further includes:

adjusting the initial 3D bounding box according to the matching error to obtain an updated 3D bounding box, the updated 3D bounding box corresponding to a lowest matching error; and

obtaining the information of the target object based on the updated 3D bounding box.
The method of claim 4, wherein calculating the disparity includes:

calculating a depth corresponding to each pixel within the visible portion; and

calculating the disparity corresponding to each pixel within the visible portion based on the depth corresponding to each pixel within the visible portion.
The method of claim 6, wherein calculating the depth includes:

calculating the depth corresponding to each pixel within the visible portion according to coordinates of a center of the initial 3D bounding box and a 3D scale of the initial 3D bounding box.
The method of claim 1, wherein estimating the initial 3D bounding box includes:

estimating the initial 3D bounding box corresponding to the target object based on a first region of the first image and a second region of the second image, both the first region and the second region containing the target object.
The method of claim 8, wherein the first region and the second region are two-dimensional (2D) bounding boxes, each of the 2D bounding boxes including one or more pixels.
The method of claim 8, wherein estimating the initial 3D bounding box further includes estimating initial information of the target object based on the first region and the second region.
The method of claim 10, wherein estimating the initial 3D bounding box further includes estimating the initial 3D bounding box based on the first region and the second region and the initial information of the target object.
The method of claim 10, further comprising:

obtaining a first feature map of the first image;

determining the first region containing the target object based on the first feature map;

obtaining a second feature map of the second image; and

determining the second region containing the target object based on the second feature map.
The method of claim 12, wherein:

obtaining the first feature map includes processing the first image through a base network to obtain the first feature map; and

obtaining the second feature map includes processing the second image through the base network to obtain the second feature map.
The method of claim 13, further comprising:

obtaining a base weight coefficient for the base network through training with a plurality of training images;

wherein:

processing the first image includes processing the first image through the base network using the base weight coefficient; and

processing the second image includes processing the second image through the base network using the base weight coefficient.
The method of claim 13, wherein the base network includes at least one of convolution processing, pooling processing, or non-linear computation processing.
The method of claim 12, further comprising:

determining a plurality of first candidate regions in the first image based on the first feature map;

determining a plurality of second candidate regions in the second image based on the second feature map, the plurality of first candidate regions having one-to-one correspondence with the plurality of second candidate regions;

determining the first region from the plurality of first candidate regions; and

determining the second region from the plurality of second candidate regions.
The method of claim 16, wherein:

determining the plurality of first candidate regions includes filtering a plurality of regions in the first image using a non-maximum suppression method to determine the plurality of first candidate regions; or

determining the plurality of second candidate regions includes filtering a plurality of regions in the second image using the non-maximum suppression method to determine the plurality of second candidate regions.
The method of claim 16, wherein a number of the plurality of first candidate regions or a number of the plurality of second candidate regions is in a range from 99 to 1000.
The method of claim 16, wherein:

determining the plurality of first candidate regions includes processing the first feature map through a region-determination network to determine the plurality of first candidate regions; and

determining the plurality of second candidate regions includes processing the second feature map through the region-determination network to determine the plurality of second candidate regions.
The method of claim 19, further comprising:

obtaining a region-determination weight coefficient for the region-determination network through training with a plurality of training images;

wherein:

processing the first feature map includes processing the first feature map through the region-determination network using the region-determination weight coefficient; and

processing the second feature map includes processing the second feature map through the region-determination network using the region-determination weight coefficient.
The method of claim 19, wherein the region-determination network includes at least one of convolution processing, pooling processing, or non-linear computation processing.
The method of claim 10, wherein estimating the initial information of the target object includes estimating a semantic class of the target object.
The method of claim 22, wherein the semantic class of the target object includes at least one of vehicle, person, or background.
The method of claim 10, wherein estimating the initial information of the target object includes determining positional information of the first region and positional information of the second region.
The method of claim 24, wherein:

the positional information of the first region includes at least one of a width of the first region, a length of the first region, or coordinates of a center of the first region; and

the positional information of the second region includes at least one of a width of the second region, a length of the second region, or coordinates of a center of the second region.
The method of claim 10, wherein estimating the initial information of the target object includes estimating an observation angle with respect to the target object.
The method of claim 26, wherein estimating the initial information of the target object further includes estimating an orientation angle of the target object according to the observation angle.
The method of claim 10, wherein estimating the initial information of the target object includes estimating a 3D scale of the target object.
The method of claim 28, wherein estimating the 3D scale of the target object includes estimating the 3D scale of the target object according to a semantic class of the target object.
The method of claim 29, wherein the semantic class of the target object includes at least one of vehicle, person, or background.
The method of claim 29, wherein estimating the 3D scale of the target object further includes:

establishing a correspondence relationship between averaged scales and semantic classes; and

estimating the 3D scale of the target object based on the semantic class of the target object and the correspondence relationship.
The method of claim 10, wherein estimating the initial information of the target object includes determining one or more key points of the initial 3D bounding box.
The method of claim 32, wherein the one or more key points of the initial 3D bounding box includes three or more corner points of the 3D bounding box.
The method of claim 1, wherein obtaining the first image and the second image includes:

obtaining a first raw image and a second raw image;

obtaining the first image by performing stereo rectification on the first raw image; and

obtaining the second image by performing stereo rectification on the second raw image.
The method of claim 34, wherein the first image and the second image have a same orientation and are translationally displaced with respect to each other.
An image processing apparatus, comprising:

a processor; and

a memory storing instructions that, when executed by the processor, cause the processor to:

obtain a first image containing a target object and a second image containing the target object;

estimate an initial 3D bounding box corresponding to the target object based on the first image and the second image;

perform local matching based on the initial 3D bounding box; and

obtain information of the target object by optimizing the local matching.
The image processing apparatus of claim 36, wherein the information of the target object includes at least one of a depth at a center of the target object, 3D scale information of the target object, an orientation angle of the target object, 3D positioning of the target object, or semantic information of the target object.
The image processing apparatus of claim 36, wherein the instructions further cause the processor to:

project the initial 3D bounding box onto one of the first image and the second image to obtain a visible portion of the initial 3D bounding box; and

perform local matching based on the visible portion of the initial 3D bounding box.
The image processing apparatus of claim 38, wherein the instructions further cause the processor to:

calculate a disparity corresponding to each pixel within the visible portion according to the initial 3D bounding box; and

determine a matching error of pixels within the visible portion based on the disparity corresponding to each pixel within the visible portion.
The image processing apparatus of claim 39, wherein the instructions further cause the processor to:

adjust the initial 3D bounding box according to the matching error to obtain an updated 3D bounding box, the updated 3D bounding box corresponding to a lowest matching error; and

obtain the information of the target object based on the updated 3D bounding box.
The image processing apparatus of claim 39, wherein the instructions further cause the processor to:

calculate a depth corresponding to each pixel within the visible portion; and

calculate the disparity corresponding to each pixel within the visible portion based on the depth corresponding to each pixel within the visible portion.
The image processing apparatus of claim 41, wherein the instructions further cause the processor to:

calculate the depth corresponding to each pixel within the visible portion according to coordinates of a center of the initial 3D bounding box and a 3D scale of the initial 3D bounding box.
The image processing apparatus of claim 36, wherein the instructions further cause the processor to:

estimate the initial 3D bounding box corresponding to the target object based on a first region of the first image and a second region of the second image, both the first region and the second region containing the target object.
The image processing apparatus of claim 43, wherein the first region and the second region are two-dimensional (2D) bounding boxes, each of the 2D bounding boxes including one or more pixels.
The image processing apparatus of claim 43, wherein the instructions further cause the processor to:

estimate initial information of the target object based on the first region and the second region.
The image processing apparatus of claim 45, wherein the instructions further cause the processor to:

estimate the initial 3D bounding box based on the first region and the second region and the initial information of the target object.
The image processing apparatus of claim 45, wherein the instructions further cause the processor to:

obtain a first feature map of the first image;

determine the first region containing the target object based on the first feature map;

obtain a second feature map of the second image; and

determine the second region containing the target object based on the second feature map.
The image processing apparatus of claim 47, wherein the instructions further cause the processor to:

obtain the first feature map includes processing the first image through a base network to obtain the first feature map; and

obtain the second feature map includes processing the second image through the base network to obtain the second feature map.
The image processing apparatus of claim 48, the instructions further cause the processor to:

obtain a base weight coefficient for the base network through training with a plurality of training images;

process the first image through the base network using the base weight coefficient; and

process the second image through the base network using the base weight coefficient.
The image processing apparatus of claim 48, wherein base network includes at least one of convolution processing, pooling processing, or non-linear computation processing.
The image processing apparatus of claim 47, wherein the instructions further cause the processor to:

determine a plurality of first candidate regions in the first image according to the first feature map;

determine a plurality of second candidate regions in the second image according to the second feature map, the plurality of first candidate regions having one-to-one correspondence with the plurality of second candidate regions;

determine the first region from the plurality of first candidate regions; and

determine the second region from the plurality of second candidate regions.
The image processing apparatus of claim 51, wherein the instructions further cause the processor to:

determine the plurality of first candidate regions includes filtering a plurality of regions in the first image using a non-maximum suppression method to determine the plurality of first candidate regions; or

determine the plurality of second candidate regions includes filtering a plurality of regions in the second image using the non-maximum suppression method to determine the plurality of second candidate regions.
The image processing apparatus of claim 51, wherein a number of the plurality of first candidate regions or a number of the plurality of second candidate regions is in a range from 99 to 1000.
The image processing apparatus of claim 51, wherein the instructions further cause the processor to:

process the first feature map through a region-determination network to determine the plurality of first candidate regions; and

process the second feature map through the region-determination network to determine the plurality of second candidate regions.
The image processing apparatus of claim 54, wherein the instructions further cause the processor to:

obtain a region-determination weight coefficient for the region-determination network through training with a plurality of training images;

process the first feature map through the region-determination network using the region-determination weight coefficient; and

process the second feature map includes processing the second feature map through the region-determination network using the region-determination weight coefficient.
The image processing apparatus of claim 54, wherein the region-determination network includes at least one of convolution processing, pooling processing, or non-linear computation processing.
The image processing apparatus of claim 45, wherein the instructions further cause the processor to estimate a semantic class of the target object.
The image processing apparatus of claim 57, wherein the semantic class of the target object includes at least one of vehicle, person, or background.
The image processing apparatus of claim 45, wherein the instructions further cause the processor to determine positional information of the first region and positional information of the second region.
The image processing apparatus of claim 59, wherein:

the positional information of the first region includes at least one of a width of the first region, a length of the first region, or coordinates of a center of the first region; and

the positional information of the second region includes at least one of a width of the second region, a length of the second region, or coordinates of a center of the second region.
The image processing apparatus of claim 45, wherein the instructions further cause the processor to estimate an observation angle with respect to the target object.
The image processing apparatus of claim 61, wherein the instructions further cause the processor to estimate an orientation angle of the target object according to the observation angle.
The image processing apparatus of claim 45, wherein the instructions further cause the processor to estimate a 3D scale of the target object.
The image processing apparatus of claim 63, wherein the instructions further cause the processor to estimate the 3D scale of the target object according to a semantic class of the target object.
The image processing apparatus of claim 64, wherein the semantic class of the target object includes at least one of vehicle, person, or background.
The image processing apparatus of claim 64, wherein the instructions further cause the processor to:

establish a correspondence relationship between averaged scales and semantic classes; and

estimate the 3D scale of the target object based on the semantic class of the target object and the correspondence relationship.
The image processing apparatus of claim 45, wherein the instructions further cause the processor to determine one or more key points of the initial 3D bounding box.
The image processing apparatus of claim 67, wherein the one or more key points of the initial 3D bounding box includes three or more corner points of the 3D bounding box.
The image processing apparatus of claim 36, wherein the instructions further cause the processor to:

obtain a first raw image and a second raw image;

obtain the first image by performing stereo rectification on the first raw image; and

obtain the second image by performing the stereo rectification on the second raw image.
The image processing apparatus of claim 69, wherein the first image and the second image have a same orientation and are translationally displaced with respect to each other.
A mobile platform, comprising:

a first image sensor configured to obtain a first image containing a target object;

a second image sensor configured to obtain a second image containing the target object;

a processor configured to:

obtain the first image and the second image through the first image sensor and the second image sensor, respectively;

estimate an initial 3D bounding box corresponding to the target object based on the first image and the second image;

perform local matching based on the initial 3D bounding box; and

obtain information of the target object by optimizing the local matching.
The mobile platform of claim 71, wherein the information of the target object includes at least one of a depth at a center of the target object, 3D scale information of the target object, an orientation angle of the target object, 3D positioning of the target object, or semantic information of the target object.
The mobile platform of claim 71, wherein the processor is further configured to:

project the initial 3D bounding box onto one of the first image and the second image to obtain a visible portion of the initial 3D bounding box; and

perform local matching based on the visible portion of the initial 3D bounding box.
The mobile platform of claim 73, wherein the processor is further configured to:

calculate a disparity corresponding to each pixel within the visible portion according to the initial 3D bounding box; and

determine a matching error of pixels within the visible portion based on the disparity corresponding to each pixel within the visible portion.
The mobile platform of claim 74, wherein the processor is further configured to:

adjust the initial 3D bounding box according to the matching error to obtain an updated 3D bounding box, the updated 3D bounding box corresponding to a lowest matching error; and

obtain the information of the target object based on the updated 3D bounding box.
The mobile platform of claim 74, wherein the processor is further configured to:

calculate a depth corresponding to each pixel within the visible portion; and

calculate the disparity corresponding to each pixel within the visible portion based on the depth corresponding to each pixel within the visible portion.
The mobile platform of claim 76, wherein the processor is further configured to:

calculate the depth corresponding to each pixel within the visible portion according to coordinates of a center of the initial 3D bounding box and a 3D scale of the initial 3D bounding box.
The mobile platform of claim 71, wherein the processor is further configured to:

estimate the initial 3D bounding box corresponding to the target object based on a first region of the first image and a second region of the second image, both the first region and the second region containing the target object.
The mobile platform of claim 78, wherein the first region and the second region are two-dimensional (2D) bounding boxes, each of the 2D bounding boxes including one or more pixels.
The mobile platform of claim 78, wherein the processor is further configured to:

estimate initial information of the target object based on the first region and the second region.
The mobile platform of claim 80, wherein the processor is further configured to:

estimate the initial 3D bounding box based on the first region and the second region and the initial information of the target object.
The mobile platform of claim 80, wherein the processor is further configured to:

obtain a first feature map of the first image;

determine the first region containing the target object based on the first feature map;

obtain a second feature map of the second image; and

determine the second region containing the target object based on the second feature map.
The mobile platform of claim 82, wherein the processor is further configured to:

obtain the first feature map includes processing the first image through a base network to obtain the first feature map; and

obtain the second feature map includes processing the second image through the base network to obtain the second feature map.
The mobile platform of claim 83, wherein the processor is further configured to:

obtain a base weight coefficient for the base network through training with a plurality of training images;

process the first image through the base network using the base weight coefficient; and

process the second image through the base network using the base weight coefficient.
The mobile platform of claim 83, wherein base network includes at least one of convolution processing, pooling processing, or non-linear computation processing.
The mobile platform of claim 82, wherein the processor is further configured to:

determine a plurality of first candidate regions in the first image according to the first feature map;

determine a plurality of second candidate regions in the second image according to the second feature map, the plurality of first candidate regions having one-to-one correspondence with the plurality of second candidate regions;

determine the first region from the plurality of first candidate regions; and

determine the second region from the plurality of second candidate regions.
The mobile platform of claim 86, wherein the processor is further configured to:

determine the plurality of first candidate regions includes filtering a plurality of regions in the first image using a non-maximum suppression method to determine the plurality of first candidate regions; or

determine the plurality of second candidate regions includes filtering a plurality of regions in the second image using the non-maximum suppression method to determine the plurality of second candidate regions.
The mobile platform of claim 86, wherein a number of the plurality of first candidate regions or a number of the plurality of second candidate regions is in a range from 99 to 1000.
The mobile platform of claim 86, wherein the processor is further configured to:

process the first feature map through a region-determination network to determine the plurality of first candidate regions; and

process the second feature map through the region-determination network to determine the plurality of second candidate regions.
The mobile platform of claim 89, wherein the processor is further configured to:

obtain a region-determination weight coefficient for the region-determination network through training with a plurality of training images;

process the first feature map through the region-determination network using the region-determination weight coefficient; and

process the second feature map includes processing the second feature map through the region-determination network using the region-determination weight coefficient.
The mobile platform of claim 89, wherein the region-determination network includes at least one of convolution processing, pooling processing, or non-linear computation processing.
The mobile platform of claim 80, wherein the processor is further configured to estimate a semantic class of the target object.
The mobile platform of claim 92, wherein the semantic class of the target object includes at least one of vehicle, person, or background.
The mobile platform of claim 80, wherein the processor is further configured to determine positional information of the first region and positional information of the second region.
The mobile platform of claim 94, wherein the processor is further configured to:

the positional information of the first region includes at least one of a width of the first region, a length of the first region, or coordinates of a center of the first region; and

the positional information of the second region includes at least one of a width of the second region, a length of the second region, or coordinates of a center of the second region.
The mobile platform of claim 80, wherein the processor is further configured to estimate an observation angle with respect to the target object.
The mobile platform of claim 96, wherein the processor is further configured to estimate an orientation angle of the target object according to the observation angle.
The mobile platform of claim 80, wherein the processor is further configured to estimate a 3D scale of the target object.
The mobile platform of claim 98, wherein the processor is further configured to estimate the 3D scale of the target object according to a semantic class of the target object.
The mobile platform of claim 99, wherein the semantic class of the target object includes at least one of vehicle, person, or background.
The mobile platform of claim 99, wherein the processor is further configured to:

establish a correspondence relationship between averaged scales and semantic classes; and

estimate the 3D scale of the target object based on the semantic class of the target object and the correspondence relationship.
The mobile platform of claim 80, wherein the processor is further configured to determine one or more key points of the initial 3D bounding box.
The mobile platform of claim 102, wherein the one or more key points of the initial 3D bounding box includes three or more corner points of the 3D bounding box.
The mobile platform of claim 71, wherein:

the first image sensor is further configured to capture a fist raw image;

the second image sensor is further configured to capture a second raw image; and

the processor is further configured to:

obtain the first image by performing stereo rectification on the first raw image; and

obtain the second image by performing the stereo rectification on the second raw image.
The mobile platform of claim 104, wherein the first image and the second image have a same orientation and are translationally displaced with respect to each other.
The mobile platform of claim 71, further comprising a display device configured to display the information of the target object.