CN116189150A

CN116189150A - Monocular 3D target detection method, device, equipment and medium based on fusion output

Info

Publication number: CN116189150A
Application number: CN202310193012.XA
Authority: CN
Inventors: 安超; 韦松; 张兵
Original assignee: Jika Intelligent Robot Co ltd
Current assignee: Jika Intelligent Robot Co ltd
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-05-30
Anticipated expiration: 2043-03-02

Abstract

The disclosure relates to a monocular 3D target detection method, apparatus, device and medium based on fusion output. The method comprises the following steps: extracting features of the image to be detected to obtain a set of parameters associated with the object to be detected in the image to be detected, wherein at least part of the parameters in the set of parameters comprise two-dimensional detection frame parameters and three-dimensional detection frame parameters aiming at the object to be detected; obtaining a first type three-dimensional prediction result based on depth estimation based on at least a first set of sub-parameters in a set of parameters; obtaining a second type of three-dimensional prediction result based on geometric constraints based on at least a second set of sub-parameters of the set of parameters, wherein the first set of sub-parameters at least partially coincide with the second set of sub-parameters; and obtaining a target three-dimensional detection result aiming at the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result. In this way, the target detection speed and accuracy can be effectively improved, and the target false-missing detection probability is reduced.

Description

Monocular 3D target detection method, device, equipment and medium based on fusion output

Technical Field

The present disclosure relates generally to the field of autopilot and computer vision, and in particular to fusion output-based monocular 3D object detection methods, apparatus, electronic devices, and computer readable storage media.

Background

With the development of deep learning, many tasks related to computer vision break the limitations of the traditional methods and make breakthrough progress, and the corresponding results have been successfully applied to various fields of transportation, security, medical treatment and the like.

Target detection is one of the important tasks of computer vision, from traditional feature engineering to an end-to-end deep learning detection framework, and has made a major breakthrough with the aim of locating the target object position through a detection frame and giving a corresponding category. Two-dimensional object detection only requires locating the pixel position of the object in the image, while three-dimensional object detection requires locating the position of the object in the real world, and thus three-dimensional object detection is more difficult and challenging relative to two-dimensional object detection tasks.

The automatic driving can fundamentally change the life and travel modes of people, reduce traffic accidents caused by human errors and improve travel efficiency and driving safety, so that great effort is put into the development of automatic driving technology in academia and industry in recent years, and three-dimensional target detection is one of key technologies in the automatic driving field and is receiving a great deal of attention. In particular, 3D object detection based on deep learning has become a research hotspot.

The prior art studies on monocular 3D target detection are as follows. Chinese patent application CN112070659a proposes a method for three-dimensional information correction using deep convolutional neural networks. And detecting the input image data in a detection stage to obtain a preliminary three-dimensional detection result, and correcting the detection result by adopting a residual correction network in a correction stage to obtain an optimized detection result. The method is a two-stage model, the model reasoning speed is low, and the actual application of automatic driving is difficult to meet.

The paper Single-Stage Single 3D Object Detection via Keypoint Estimation proposes a Monocular 3D detection method based on a 3D center projection point, the method directly predicts the projection position of the 3D detection frame center point on a 2D image, depth information corresponding to the projection point, the size and the rotation angle of a three-dimensional detection frame, and a 3D detection result can be obtained according to the prediction result, but a model is difficult to effectively detect targets such as shielding, cutting and the like, and the detection precision is not high.

Therefore, a simple and effective monocular 3D target detection method is urgently needed, so that the method can meet the real-time and high-precision requirements of monocular 3D target detection of automatic driving, and meanwhile, targets such as shielding, cutting-off and the like can be effectively detected, and missing detection and detection are avoided.

Disclosure of Invention

According to an example embodiment of the present disclosure, a solution for monocular 3D object detection based on fusion output is provided to at least partially solve the problems existing in the prior art.

In a first aspect of the present disclosure, a monocular 3D target detection method based on fusion output is provided. The method comprises the following steps: extracting features of the image to be detected to obtain a set of parameters associated with the object to be detected in the image to be detected, wherein at least part of the parameters in the set of parameters comprise two-dimensional detection frame parameters and three-dimensional detection frame parameters aiming at the object to be detected; obtaining a first type three-dimensional prediction result based on depth estimation based on at least a first set of sub-parameters in a set of parameters; obtaining a second type of three-dimensional prediction result based on geometric constraints based on at least a second set of sub-parameters of the set of parameters, wherein the first set of sub-parameters at least partially coincide with the second set of sub-parameters; and obtaining a target three-dimensional detection result aiming at the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

In some embodiments, the set of parameters includes at least one or more of: the position of the center point of the two-dimensional detection frame; the projection position of the center point of the three-dimensional detection frame; three-dimensional detection frame target class; offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame; detecting the position of a projection point of a corner point of a frame in three dimensions; the size of the three-dimensional detection frame; the depth of the projection point at the center of the three-dimensional detection frame; the direction of the three-dimensional detection frame; and the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame. The projection may be a projection of a corresponding point on the three-dimensional detection frame onto the two-dimensional image, and correspondingly, the projection position may be a projection position of a corresponding point on the three-dimensional detection frame onto the two-dimensional image.

In some embodiments, wherein the first set of sub-parameters includes one or more of: the center point of the two-dimensional detection frame is positioned; the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame; the depth of the projection point at the center of the three-dimensional detection frame; the size of the three-dimensional detection frame; the direction of the three-dimensional detection frame; and the three-dimensional detection frame target class. Wherein the second set of sub-parameters includes one or more of: the center point of the two-dimensional detection frame is positioned; the offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame; the three-dimensional detection frame corner projection point positions; the size of the three-dimensional detection frame; the three-dimensional detection frame target category and the direction of the three-dimensional detection frame.

In some embodiments, deriving the depth estimation based three-dimensional prediction result of the first type based on at least a first set of sub-parameters of the set of parameters preferably comprises: based on the position of the center point of the two-dimensional detection frame in the first group of sub-parameters, the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame, the depth of the center projection point of the three-dimensional detection frame and the combination of calibrated camera parameters, obtaining the position of the three-dimensional detection frame aiming at the target to be detected; and obtaining the first type three-dimensional prediction result based on the position of the three-dimensional detection frame, the three-dimensional detection frame target class, the size of the three-dimensional detection frame and the direction of the three-dimensional detection frame.

In some embodiments, deriving the second type of three-dimensional prediction result based on the geometric constraint based on at least a second set of sub-parameters of the set of parameters preferably comprises: correlating the position of the center point of the two-dimensional detection frame, the offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame and the position of the projection point of the corner point of the three-dimensional detection frame based on geometric constraint to obtain a correlation result; and estimating and correcting the three-dimensional detection frame based on the association result and combining the size of the three-dimensional detection frame, the target class of the three-dimensional detection frame and the direction of the three-dimensional detection frame by using a nonlinear least square method so as to obtain the second-type three-dimensional prediction result. Wherein the association may be implemented, for example, in a nearest-neighbor matching manner. In another embodiment, the position of the projection point of the three-dimensional detection frame can also be obtained directly by using the two-dimensional detection frame center point and the offset from the two-dimensional detection frame center point to the three-dimensional detection frame center projection point, so that the model reasoning speed is further improved.

In some embodiments, obtaining the target three-dimensional detection result for the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result includes: and performing non-maximum suppression on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

In some embodiments, the method further comprises: determining calibrated camera parameters and processing the labeling detection information of the image to be detected to obtain an initial true value for model training; preprocessing the image to be detected, the calibrated camera parameters and the labeling detection information; and decoding the first type three-dimensional prediction result and the second type three-dimensional prediction result and performing loss calculation on the corresponding initial true values, thereby completing the model training.

In a second aspect of the present disclosure, a monocular 3D object detection device based on a fusion output is provided. The device comprises: an image feature extraction module configured to perform feature extraction on an image to be detected to obtain a set of parameters associated with a target to be detected in the image to be detected, at least some of the set of parameters including two-dimensional detection frame parameters and three-dimensional detection frame parameters for the target to be detected; the first type three-dimensional prediction result acquisition module is configured to obtain a first type three-dimensional prediction result based on depth estimation at least based on a first group of sub-parameters in the group of parameters; a second type three-dimensional prediction result acquisition module configured to obtain a second type three-dimensional prediction result based on geometric constraints based on at least a second set of sub-parameters of the set of parameters, wherein the first set of sub-parameters at least partially coincide with the second set of sub-parameters; and a target three-dimensional detection result acquisition module configured to obtain a target three-dimensional detection result for the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

In some embodiments, the set of parameters includes at least one or more of: the position of the center point of the two-dimensional detection frame; the projection position of the center point of the three-dimensional detection frame; three-dimensional detection frame target class; offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame; detecting the position of a projection point of a corner point of a frame in three dimensions; the size of the three-dimensional detection frame; the depth of the projection point at the center of the three-dimensional detection frame; the direction of the three-dimensional detection frame; and the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame.

In some embodiments, the first type three-dimensional prediction result acquisition module may be further configured to: based on the position of the center point of the two-dimensional detection frame in the first group of sub-parameters, the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame, the depth of the center projection point of the three-dimensional detection frame and the combination of calibrated camera parameters, obtaining the position of the three-dimensional detection frame aiming at the target to be detected; and obtaining the first type three-dimensional prediction result based on the position of the three-dimensional detection frame, the three-dimensional detection frame target class, the size of the three-dimensional detection frame and the direction of the three-dimensional detection frame.

In some embodiments, the second type three-dimensional prediction result acquisition module may be further configured to: correlating the position of the center point of the two-dimensional detection frame, the offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame and the position of the projection point of the corner point of the three-dimensional detection frame based on geometric constraint to obtain a correlation result; and estimating and correcting the three-dimensional detection frame based on the association result and the size, direction and category of the three-dimensional detection frame by using a nonlinear least square method so as to obtain the second-type three-dimensional prediction result. Wherein the association may be implemented, for example, in a nearest-neighbor matching manner. In another embodiment, the position of the projection point of the three-dimensional detection frame can also be obtained directly by using the two-dimensional detection frame center point and the offset from the two-dimensional detection frame center point to the three-dimensional detection frame center projection point, so that the model reasoning speed is further improved.

In some embodiments, the target three-dimensional detection result acquisition module may be further configured to: and performing non-maximum suppression on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

In some embodiments, the apparatus may be further configured to: determining calibrated camera parameters and processing the labeling detection information of the image to be detected to obtain an initial true value for model training; preprocessing the image to be detected, the calibrated camera parameters and the labeling detection information; and decoding the first type three-dimensional prediction result and the second type three-dimensional prediction result and performing loss calculation on the corresponding initial true values, thereby completing the model training.

In a third aspect of the present disclosure, an electronic device is provided. The apparatus includes: one or more processors; and storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The medium has stored thereon a computer program which, when executed by a processor, implements a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, a computer program product is provided. The article of manufacture comprises a computer program/instruction which, when executed by a processor, implements a method according to the first aspect of the disclosure.

It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements. The accompanying drawings are included to provide a better understanding of the present disclosure, and are not to be construed as limiting the disclosure, wherein:

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a schematic flow diagram of a fusion output-based monocular 3D object detection method, according to some embodiments of the present disclosure;

FIG. 3 illustrates an overall structural schematic of an object detection model according to some embodiments of the present disclosure;

FIG. 4 illustrates a three-dimensional detection box corresponding proxel Gaussian schematic in accordance with some embodiments of the disclosure;

FIG. 5 illustrates target prediction visualization results according to some embodiments of the present disclosure;

fig. 6 illustrates a schematic block diagram of a fusion output-based monocular 3D object detection device, according to some embodiments of the present disclosure; and

FIG. 7 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As described above, the existing target detection methods, such as the central net3D and the SMOKE, only use a single two-dimensional detection frame center point or a projection point of a three-dimensional detection frame center point on a two-dimensional image as a positive sample point, so that a single positive sample point selection mode can result in low recall rate of a model and cause missed detection, and the current target detection method has low reasoning speed, is difficult to satisfy the actual application of automatic driving, is difficult to effectively detect targets such as shielding and cutting, has low detection precision, and may affect the safety of automatic driving when serious.

At least in view of the above problems, various embodiments of the present disclosure provide a solution for object detection, where the solution may be based on a sparse depth estimation method, where a two-dimensional detection frame center point and a projection point of a three-dimensional detection frame center point on a two-dimensional image are adopted as positive sample points, corresponding positions, depths, and size and rotation angle information of a three-dimensional detection frame are predicted at the positive sample points, and based on a geometric constraint method, projection positions of three-dimensional detection frame corner points on the image are predicted, and a nonlinear least square method is adopted to estimate position information of the three-dimensional detection frame. The model can further combine the results obtained in two aspects, and the three-dimensional detection frame with the filtering redundancy is restrained through the non-maximum value, so that an accurate three-dimensional detection result is finally obtained, and therefore missing detection and false detection can be effectively avoided, the reasoning speed is high, the precision is high, meanwhile, the requirements of the precision and the instantaneity of target detection in automatic driving are met, and the method has good engineering practical value.

Exemplary embodiments of the present disclosure will be described below in conjunction with fig. 1-7.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented.

As shown in fig. 1, in an example of an environment 100, a vehicle 110 may be traveling on a roadway, with a plurality of objects or obstacles to be detected present around the roadway, such as a gaming vehicle traveling on the roadway, a roadside-standing vehicle, a pedestrian on the roadway, and the like. In such an environment, in order to satisfy the automatic driving safety, the vehicle 110 needs to accurately detect the surrounding obstacle, and thus accurately and rapidly make a correct plan during the driving. It should be noted that the above-mentioned obstacle is merely exemplary, and the obstacle may be any dynamic or static obstacle around the vehicle 110, which is not limited by the present disclosure.

It should be appreciated that the environment 100 shown in FIG. 1 is merely one example environment in which a vehicle 110 may be traveling. In addition to traveling on an outdoor roadway, vehicle 110 is traveling in various environments such as tunnels, outdoor parking lots, building interiors (e.g., indoor parking lots), communities, parks, and the like, which is not limiting to the present disclosure.

In the example of fig. 1, vehicle 110 may be any type of vehicle that may carry a person and/or object and that is moved by a power system such as an engine, including, but not limited to, a car, truck, bus, electric vehicle, motorcycle, caravan, train, and the like. In some embodiments, the vehicle 110 in the environment 100 may be a vehicle having some autonomous capability, such a vehicle also being referred to as an unmanned vehicle or an autonomous vehicle. In some embodiments, vehicle 110 may also be a vehicle with semi-autonomous driving capabilities.

As shown in fig. 1, vehicle 110 may also include a computing device 120. In some embodiments, computing device 120 may be communicatively coupled to vehicle 110. Although shown as a separate entity, computing device 120 may be embedded in vehicle 110. Computing device 120 may also be an entity external to vehicle 110 and may communicate with vehicle 110 via a wireless network. Computing device 120 may be any device having computing capabilities.

As shown in fig. 1, computing device 120 may be any type of fixed, mobile, or portable computing device, including, but not limited to, a desktop computer, a laptop computer, a notebook computer, a netbook computer, a tablet computer, a multimedia computer, a mobile phone, and the like, as non-limiting examples; all or a portion of the components of computing device 120 may be distributed across the cloud. Computing device 120 contains at least a processor, memory, and other components typically found in general purpose computers to perform computing, storage, communication, control, etc. functions.

In some embodiments, computing device 120 may include a system for autonomous vehicle speed planning. The system architecture may include, for example, modules for system input, information processing, system execution, and the like. The system input module may include at least a sensing unit and a body unit. The sensing unit may include radar, cameras, etc. for acquiring forward vehicle motion information. The body unit may be used to obtain vehicle kinematics. The information processing module can be used for processing the motion information of the front vehicle, obtaining the position lighting information of the front vehicle under the own vehicle coordinate system, and can be used for calculating and processing the motion state of the own vehicle. The information processing module can also comprise an ADAS domain controller which is used for realizing a vehicle auxiliary driving algorithm and outputting a vehicle control command. The system execution module is used for executing the request instruction. The parameter transmission between each unit includes but is not limited to CAN bus transmission and Ethernet transmission. Various embodiments for determining a vehicle travel track according to the present disclosure may be included in at least one of the sensing unit and the information processing module, for example.

Fig. 2 illustrates a schematic flow diagram of a fusion output-based monocular 3D object detection method 200, according to some embodiments of the present disclosure. The method 200 may be implemented, for example, by the computing device 120 shown in fig. 1.

At block 201, feature extraction is performed on an image to be detected to obtain a set of parameters associated with a target to be detected in the image to be detected, at least some of the set of parameters including two-dimensional detection frame parameters and three-dimensional detection frame parameters for the target to be detected.

In some embodiments, calibration camera parameters may be first determined and annotation detection information for the image to be detected may be processed. The annotation detection information may include two-dimensional detection frame annotation information and three-dimensional detection frame annotation information. In some embodiments, the labeling information of the two-dimensional detection frame and the category information of the target can be implemented in any suitable manner in the art, the three-dimensional detection frame in the three-dimensional labeling information can be expressed as a center point position (x, y, z), the height, width and length (h, w, l) of the three-dimensional detection frame, and a rotation angle θ, and then the rotation matrix R of the 3D labeling information _θ Can be expressed as:

in some embodiments, the intrinsic camera parameter matrix K including the focal length of the camera, the focal position of the optical axis and the image plane may be:

In some embodiments, the camera extrinsic matrix may be:

wherein R is _3x3 For rotating matrix, T _3x1 Is a translation matrix.

In some embodiments, the corner point Cor of the three-dimensional detection frame may be calculated as follows:

wherein R is _θ H, w and l are the height, width and length of the three-dimensional detection frame, respectively, and x, y and z are the center point position coordinates, respectively.

In this way, the projection points of the 8 corner points Cor of the three-dimensional detection frame to the image can be:

wherein the method comprises the steps of

Representing a matrix operation.

In some embodiments, further as shown in fig. 4, gao Situ may be generated at the corresponding proxels for the supervisory truth information of location and category. Fig. 4 illustrates a gaussian diagram of three-dimensional detection frames corresponding to proxels according to some embodiments of the present disclosure, where the light white position in fig. 4 is the gaussian diagram. In some embodiments, the truth information for model training may be obtained by the following equation:

wherein (x) _b' ,y _b' ) Is the original image coordinate (x) _b ,y _b ) S is the downsampling ratio, and sigma is the adaptive variance.

It should be appreciated that the above formulas are merely exemplary, and that the determination of calibration camera parameters and the processing of annotation detection information for an image to be detected may be performed in any other suitable manner, and the disclosure is not limited in this regard.

In some embodiments, the image to be detected, the calibrated camera parameters, and the annotation detection information may also be preprocessed. Specifically, operations such as image size (size), image padding (padding), flip (flip) and the like can be performed on the image to be detected, so as to realize data augmentation and planning range image data. The image is preprocessed, and the annotation information and the camera parameters are required to be correspondingly processed synchronously.

In some embodiments, as shown in fig. 3, the preprocessed image may be entered into a feature extraction network (backfone) for image feature extraction to obtain a set of parameters associated with the object to be detected in the image to be detected. The feature extraction network in fig. 3 may be deployed, for example, at least in part in the computing device 120 shown in fig. 1.

Fig. 3 illustrates an overall structural schematic diagram of an object detection model according to some embodiments of the present disclosure. Referring to fig. 3, the feature extraction network may be, for example, DLA34, and the image may output a multi-scale (e.g., 4-fold, 8-fold, 16-fold, 32-fold downsampling) feature map after passing through the feature extraction network. The multi-scale feature map is feature fused, for example, through a feature pyramid network (Feature Pyramid Network, FPN), such that the feature map contains more rich contextual and semantic information. The feature pyramid network can output a 4-time downsampled feature map after feature fusion.

With continued reference to FIG. 3, in this embodiment, the feature map input model task header (head) outputs a prediction map of a different task with one or more of the following sets of parameter information: the position of the center point of the two-dimensional detection frame; the projection position of the center point of the three-dimensional detection frame; three-dimensional detection frame target class; offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame; detecting the position of a projection point of a corner point of a frame in three dimensions; the size of the three-dimensional detection frame; the depth of the projection point at the center of the three-dimensional detection frame; the direction of the three-dimensional detection frame; and the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame. The projection may be a projection of a corresponding point on the three-dimensional detection frame onto the two-dimensional image, and correspondingly, the projection position may be a projection position of a corresponding point on the three-dimensional detection frame onto the two-dimensional image.

In this embodiment, the different model task heads may combine or output the plurality of parameters in the above parameter information separately. As shown in fig. 3, the hematmap head may output a two-dimensional center point position and a three-dimensional detection frame center point projection position and a target category, and the center2kpt offset head, kpt head, dim head, depth head, dir head, and center2pcenter oifset head output an offset from the two-dimensional detection frame center point position to the 8 corner projection points of the three-dimensional detection frame, from the 8 projection points of the three-dimensional detection frame, from the size of the three-dimensional detection frame, from the depth of the three-dimensional detection frame, from the two-dimensional center point position to the three-dimensional center projection point, respectively. It will thus be appreciated that the output of the above parameters may be achieved in other ways, depending on the overall structure of the detection model. It should also be appreciated that the output parameters of the various task heads described above are optional depending on the actual needs. For example, in the case where only 8 projection points of the three-dimensional detection frame are considered as constraints, parameters such as the offset of the two-dimensional center point position to the three-dimensional center projection point, the three-dimensional detection frame center point projection position, and the like may not be output, which will be described in more detail below.

In some embodiments, after obtaining the predicted result output of the set of parameters, the set of parameters may be decoded and the true values are subjected to loss calculation, and the network model parameters are updated by adopting a random gradient descent method according to the calculated loss values, so as to finally complete model training.

In one embodiment, the classification loss function may be, for example, a Gaussian Focal Loss function, and the classification loss is calculated by the following equation:

L _pos (h，h ^* )＝-(1-h) ^γ log(h)

L _neg (h，h ^* )＝-(1-h ^* )β(h) ^γ log(1-h)

L _cls ＝L _pos +L _wg ；

wherein h represents a category predicted value, h ^* The class truth value is represented, beta is 4.0, and gamma is 2.0.

In one embodiment, the depth estimation loss function may be, for example, a Laplacian Aleatoric Uncertainty Loss function, and the depth estimation loss is calculated by the following equation:

wherein d _u For the depth uncertainty prediction value, d is the anti-sigmoid depth prediction value, d ^* Is a depth truth value.

In other embodiments, the offset and size penalty function may employ an L2 Loss function. Each point in the L2 Loss function is continuous and smooth, is convenient for deriving, has a relatively stable solution, and is not described in detail in this disclosure.

At block 203, a first type of three-dimensional prediction result based on the depth estimate is obtained based at least on a first set of sub-parameters in the set of parameters. The first type of three-dimensional prediction result may be a depth estimation-based three-dimensional prediction result.

In one embodiment, the first set of sub-parameters may include one or more of the following: the position of the center point of the two-dimensional detection frame; offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame; the depth of the projection point at the center of the three-dimensional detection frame; the size of the three-dimensional detection frame; the direction of the three-dimensional detection frame; three-dimensional detection frame object category.

In one embodiment, the position of the three-dimensional detection frame for the target to be detected can be obtained based on the position of the two-dimensional detection frame center point in the first group of sub-parameters, the offset from the two-dimensional detection frame center point to the three-dimensional detection frame center projection point, the depth of the three-dimensional detection frame center projection point and the combination of calibration camera parameters, and the first type three-dimensional prediction result is obtained based on the position of the three-dimensional detection frame, the three-dimensional detection frame target type, the size of the three-dimensional detection frame and the direction of the three-dimensional detection frame.

In particular, three-dimensional detection predictions based on depth estimation may be derived fromPrediction result output of center point position of arriving two-dimensional detection frame _2dcenter Offset output from center point of two-dimensional detection frame to center projection point of three-dimensional detection frame _2doffset Depth output of central projection point of three-dimensional detection frame _deep And calibrating the parameter calib to obtain the position det3d of the three-dimensional detection frame _pos And finally, combining the size, the category and the direction of the three-dimensional detection frame to obtain a three-dimensional detection prediction result based on depth estimation, namely a first-type three-dimensional prediction result. Wherein the position det3d of the three-dimensional detection frame _pos This can be achieved by the following equation:

at block 205, a second type of three-dimensional prediction result based on the geometric constraint is obtained based at least on a second set of sub-parameters of the set of parameters, wherein the first set of sub-parameters at least partially coincide with the second set of sub-parameters. The second type of three-dimensional prediction result may be a geometric constraint-based three-dimensional prediction result.

In one embodiment, the second set of sub-parameters may include one or more of the following: the position of the center point of the two-dimensional detection frame; offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame; detecting the position of a projection point of a corner point of a frame in three dimensions; the size of the three-dimensional detection frame; the three-dimensional detection frame target class and the three-dimensional detection frame direction.

In one embodiment, the correlation result can be obtained by correlating the position of the center point of the two-dimensional detection frame, the offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame, and the position of the projection point of the corner point of the three-dimensional detection frame in the second group of sub-parameters based on geometric constraints, and the second type three-dimensional prediction result is obtained by combining the size of the three-dimensional detection frame, the direction of the three-dimensional detection frame, and the three-dimensional detection frame target class based on the correlation result by using a nonlinear least square method. Wherein the association may be implemented, for example, in a nearest-neighbor matching manner. In another embodiment, the position of the projection point of the three-dimensional detection frame can also be obtained directly by using the two-dimensional detection frame center point and the offset from the two-dimensional detection frame center point to the three-dimensional detection frame center projection point, so that the model reasoning speed is further improved.

In yet another embodiment, the second set of sub-parameters may further include an offset of the two-dimensional detection frame center point to the three-dimensional detection frame center projection point and a three-dimensional detection frame center point projection position. When calculating the three-dimensional detection frame prediction result based on geometric constraint, constraint calculation of a center point can be added, and accuracy of the prediction result is further ensured.

In such an embodiment, the first set of sub-parameters and the second set of sub-parameters each include a position of a center point of the two-dimensional detection frame, an offset from the center point of the two-dimensional detection frame to a projection point of the center of the three-dimensional detection frame, a size of the three-dimensional detection frame, a direction of the three-dimensional detection frame, a target class of the three-dimensional detection frame, and other parameters that overlap, so that the projection points of the center point of the two-dimensional detection frame and the center point of the three-dimensional detection frame on the two-dimensional image can be combined as positive sample points, and the problem of model missing detection can be effectively improved.

In one embodiment, specifically, the three-dimensional detection prediction based on geometric constraint is related according to the obtained two-dimensional frame center point position, the obtained two-dimensional detection frame center point to three-dimensional detection frame 8 corner point offset and the obtained three-dimensional detection frame 8 corner point position to obtain a related result group. Further, a nonlinear least square method (NLS) can be adopted to estimate the position of the three-dimensional detection frame according to the constraint of the 8 angular point projection points, and the 3D detection prediction result based on the geometric constraint is obtained after the estimation result is corrected. In one embodiment, as described above, for the correction process, the nonlinear least square method may be used to obtain the center point position of the three-dimensional detection frame, and then the size, direction and class of the three-dimensional detection frame are combined to obtain the complete three-dimensional detection frame.

In one embodiment, specifically, the position X, Y, Z of the center point of the three-dimensional detection frame can be estimated according to the positions of the projection points of 8 angular points of the three-dimensional detection frame and the internal parameters of the camera. And finally, combining the size of the three-dimensional detection frame and the direction of the three-dimensional detection frame to obtain the corrected three-dimensional detection frame by the position X, Y, Z of the central point of the three-dimensional detection frame. Specifically, the three-dimensional detection frame center point position based on geometric constraint can be obtained by the following equation:

wherein X, Y, Z denotes the center point projection position coordinates, kp denotes the associated projection point positions, w, l and h denote the width, length and height of the three-dimensional detection frame, θ denotes the direction angle, and x and y subscripts denote the abscissa and ordinate, respectively.

Further, the result group is correlated with the position det3d of the three-dimensional detection frame _pos This can be achieved by the following equation:

group＝min(L2(center2d+offset，kpts))

det 3d _pos ＝NLS(kpts，dims，rots)；

where center2d represents the two-dimensional frame center position, offset represents the two-dimensional detection frame center to three-dimensional detection frame projection point offset, kpts refers to the predicted 8 corner projection point positions, dims refers to the predicted dimension, and rots refers to the predicted direction.

It should be noted that the above embodiment is merely exemplary, and for example, the three-dimensional detection frame based on geometric constraint may be obtained by using 8 corner projection point positions and 9 projection points of the three-dimensional detection frame center point on the two-dimensional image as mentioned above, so as to obtain a more accurate calculation result. In some embodiments, the calculation of the three-dimensional detection frame based on geometric constraints may also be implemented using projection points other than corner points, which is not limited by the present disclosure.

At block 207, a target three-dimensional detection result for the target to be detected is obtained based on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

In one embodiment, the depth estimation-based three-dimensional detection prediction result of the first type and the geometric constraint-based three-dimensional detection prediction result can be combined, and the edge refinement is performed through a non-maximum suppression filtering redundancy detection frame with high overlapping degree, so that a target three-dimensional detection result is obtained. For non-maximum suppression filtering algorithms, the interaction ratio IoU (Intersection over Union), ioU of the three-dimensional detection box can be calculated first, enabling measurement of the criteria for detecting the accuracy of the corresponding object in a particular dataset. The higher IoU, the higher the degree of overlap. In the embodiment of the disclosure, a IoU threshold may be set according to the type as required, and frames with low confidence in the frames with high overlapping degree may be filtered out, so that the final obtained detection frame is the target three-dimensional detection frame.

In the embodiment, the detection result based on sparse depth estimation and the detection result based on geometric constraint are fused, so that the model precision can be effectively improved, and the method has the characteristics of high reasoning speed and high precision, and can be well applied to deployment implementation of actual automatic driving.

In one embodiment, FIG. 5 illustrates target prediction visualization results according to some embodiments of the present disclosure. Referring to fig. 5, vehicles and obstacles on both sides of a road are marked by two-dimensional frames and three-dimensional frames, and the obstacles are completely and accurately identified, thereby better assisting in the decision-making of automatic driving.

Figure 6 shows a schematic block diagram of a fusion output based monocular 3D object detection device 600 according to some embodiments of the present disclosure,

as shown in fig. 6, the apparatus 600 includes an image feature extraction module 601, a first type three-dimensional prediction result acquisition module 603, a second type three-dimensional prediction result acquisition module 605, and a target three-dimensional detection result acquisition module 607.

In the apparatus 600, the image feature extraction module 601 is configured to perform feature extraction on an image to be detected to obtain a set of parameters associated with an object to be detected in the image to be detected, at least some of the set of parameters including two-dimensional detection frame parameters and three-dimensional detection frame parameters for the object to be detected.

In the apparatus 600, the first type three-dimensional prediction result obtaining module 603 is configured to obtain a first type three-dimensional prediction result based on the depth estimation based on at least a first set of sub-parameters of the set of parameters.

In the apparatus 600, the second type three-dimensional prediction result obtaining module 605 is configured to obtain a second type three-dimensional prediction result based on the geometric constraint based on at least a second set of sub-parameters of the set of parameters, wherein the first set of sub-parameters at least partially coincides with the second set of sub-parameters.

In the apparatus 600, the target three-dimensional detection result obtaining module 607 is configured to obtain a target three-dimensional detection result for the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

In some embodiments, the first type three-dimensional prediction result obtaining module 603 may be further configured to obtain a position of the three-dimensional detection frame for the target to be detected based on the two-dimensional detection frame center point position, the offset of the two-dimensional detection frame center point to the three-dimensional detection frame center projection point, the depth of the three-dimensional detection frame center projection point in the first set of sub-parameters, and in combination with the calibration camera parameters; and obtaining a first type three-dimensional prediction result based on the position of the three-dimensional detection frame, the target class of the three-dimensional detection frame, the size of the three-dimensional detection frame and the direction of the three-dimensional detection frame.

In some embodiments, the second type three-dimensional prediction result obtaining module 605 may be further configured to correlate the two-dimensional detection frame center point position, the offset of the two-dimensional detection frame center point to the three-dimensional detection frame corner projection point, and the three-dimensional detection frame corner projection point position in the second set of sub-parameters based on the geometric constraint to obtain a correlation result; and estimating and correcting the three-dimensional detection frame based on the correlation result by using a nonlinear least square method and combining the size of the three-dimensional detection frame, the direction of the three-dimensional detection frame and the target class of the three-dimensional detection frame so as to obtain a second type three-dimensional prediction result. Wherein the association may be implemented, for example, in a nearest-neighbor matching manner. In another embodiment, the position of the projection point of the three-dimensional detection frame can also be obtained directly by using the two-dimensional detection frame center point and the offset from the two-dimensional detection frame center point to the three-dimensional detection frame center projection point, so that the model reasoning speed is further improved.

In some embodiments, the target three-dimensional detection result acquisition module 607 may be further configured to non-maximally suppress the first type of three-dimensional prediction result and the second type of three-dimensional prediction result.

In some embodiments, the apparatus 600 may be further configured to determine calibration camera parameters and process the annotation detection information of the image to be detected to obtain an initial truth value for model training; preprocessing the image to be detected, calibrating camera parameters and labeling detection information; and decoding the first type three-dimensional prediction result and the second type three-dimensional prediction result and carrying out loss calculation on the corresponding initial true values so as to complete model training.

In summary, according to the sparse depth estimation-based method disclosed by the embodiment of the disclosure, a projection point of a two-dimensional detection frame center point and a projection point of a three-dimensional detection frame center point on a two-dimensional image is taken as a positive sample point, so that the problem of model missing detection can be effectively solved. Based on a geometric constraint method, the model directly predicts the projection positions of 8 corner points of the three-dimensional detection frame on a two-dimensional image, and the offset from the center point of the two-dimensional detection frame to the projection points of the 8 corner points, wherein the three-dimensional detection frame has 7 degrees of freedom (without considering two rotational degrees of freedom), the obtained positions of the projection points of the 8 corner points can be used for estimating the three-dimensional positions through a nonlinear least square method, and the three-dimensional detection result can be obtained after correcting the estimation result. Finally, according to the embodiment of the disclosure, the detection output results are fused, and the three-dimensional detection frame with redundancy is filtered through non-maximum value inhibition, so that a final three-dimensional detection result is obtained, and the model precision is remarkably improved.

Fig. 7 illustrates a block diagram of a computing device 700 capable of implementing various embodiments of the present disclosure. Device 700 may be used, for example, to implement computing device 120 of fig. 1.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various suitable actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 702 or loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. One or more of the steps of the method 200 described above may be performed when a computer program is loaded into RAM703 and executed by the computing unit 701. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. The monocular 3D target detection method based on fusion output is characterized by comprising the following steps of:

Extracting features of an image to be detected to obtain a set of parameters associated with a target to be detected in the image to be detected, wherein at least part of the parameters in the set of parameters comprise two-dimensional detection frame parameters and three-dimensional detection frame parameters aiming at the target to be detected;

obtaining a first type three-dimensional prediction result based on depth estimation based on at least a first set of sub-parameters in the set of parameters;

obtaining a second type of three-dimensional prediction result based on geometric constraints based at least on a second set of sub-parameters of the set of parameters, wherein the first set of sub-parameters at least partially coincide with the second set of sub-parameters; and

and obtaining a target three-dimensional detection result aiming at the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

2. The method of claim 1, wherein the set of parameters includes at least one or more of:

the position of the center point of the two-dimensional detection frame;

the projection position of the center point of the three-dimensional detection frame;

three-dimensional detection frame target class;

offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame;

detecting the position of a projection point of a corner point of a frame in three dimensions;

The size of the three-dimensional detection frame;

the depth of the projection point at the center of the three-dimensional detection frame;

the direction of the three-dimensional detection frame; and

and (3) offsetting the center point of the two-dimensional detection frame from the center projection point of the three-dimensional detection frame.

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

wherein the first set of sub-parameters includes one or more of:

the center point of the two-dimensional detection frame is positioned;

the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame;

the size of the three-dimensional detection frame;

the direction of the three-dimensional detection frame; and

the three-dimensional detection frame target class;

wherein the second set of sub-parameters includes one or more of:

the center point of the two-dimensional detection frame is positioned;

the offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame;

the three-dimensional detection frame corner projection point positions;

the size of the three-dimensional detection frame;

the three-dimensional detection frame target class; and

the direction of the three-dimensional detection frame.

4. A method according to claim 3, wherein deriving a first type of three-dimensional prediction result based on depth estimation based on at least a first set of sub-parameters of the set of parameters preferably comprises:

Based on the position of the center point of the two-dimensional detection frame in the first group of sub-parameters, the offset from the center point of the two-dimensional detection frame to the center projection point of the three-dimensional detection frame, the depth of the center projection point of the three-dimensional detection frame and the combination of calibrated camera parameters, obtaining the position of the three-dimensional detection frame aiming at the target to be detected; and

and obtaining the first type three-dimensional prediction result based on the position of the three-dimensional detection frame, the three-dimensional detection frame target class, the size of the three-dimensional detection frame and the direction of the three-dimensional detection frame.

5. A method according to claim 3, wherein deriving a second type of three-dimensional prediction result based on geometric constraints based on at least a second set of sub-parameters of the set of parameters preferably comprises:

correlating the position of the center point of the two-dimensional detection frame, the offset from the center point of the two-dimensional detection frame to the projection point of the corner point of the three-dimensional detection frame and the position of the projection point of the corner point of the three-dimensional detection frame based on geometric constraint to obtain a correlation result; and

and estimating and correcting the three-dimensional detection frame based on the association result by using a nonlinear least square method and combining the size of the three-dimensional detection frame, the direction of the three-dimensional detection frame and the target class of the three-dimensional detection frame so as to obtain the second-type three-dimensional prediction result.

6. The method of claim 1, wherein obtaining a target three-dimensional detection result for the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result comprises:

and performing non-maximum suppression on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

7. The method according to claim 1, wherein the method further comprises:

determining calibrated camera parameters and processing the labeling detection information of the image to be detected to obtain an initial true value for model training;

preprocessing the image to be detected, the calibrated camera parameters and the labeling detection information; and

decoding the first type three-dimensional prediction result and the second type three-dimensional prediction result and carrying out loss calculation on the corresponding initial true value, thereby completing the model training.

8. Monocular 3D target detection device based on fusion output, characterized by comprising:

an image feature extraction module configured to perform feature extraction on an image to be detected to obtain a set of parameters associated with a target to be detected in the image to be detected, at least some of the set of parameters including two-dimensional detection frame parameters and three-dimensional detection frame parameters for the target to be detected;

The first type three-dimensional prediction result acquisition module is configured to obtain a first type three-dimensional prediction result based on depth estimation at least based on a first group of sub-parameters in the group of parameters;

a second type three-dimensional prediction result acquisition module configured to obtain a second type three-dimensional prediction result based on geometric constraints based on at least a second set of sub-parameters of the set of parameters, wherein the first set of sub-parameters at least partially coincide with the second set of sub-parameters; and

the target three-dimensional detection result acquisition module is configured to obtain a target three-dimensional detection result aiming at the target to be detected based on the first type three-dimensional prediction result and the second type three-dimensional prediction result.

9. An electronic device, the device comprising:

one or more processors; and

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method according to any of claims 1 to 7.