CN115861628A

CN115861628A - 3D target detection method, device, equipment and storage medium

Info

Publication number: CN115861628A
Application number: CN202211462545.5A
Authority: CN
Inventors: 王宇龙
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-28

Abstract

The application discloses a 3D target detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a monocular 3D target queue after a camera performs target detection on a 3D space, a laser 3D target queue after a laser radar performs target detection on the 3D space and corresponding laser 3D information; acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, projecting the laser 3D projection frames and the monocular 3D projection frames onto an image respectively, establishing a matching relation between the laser 3D projection frames and the monocular 3D projection frames, and combining 3D target information and monocular 3D target queue information to obtain a multi-modal feature map; and calculating the multi-modal characteristic graph according to the convolutional network, and filtering and detecting the calculated result based on the confidence coefficient threshold value to obtain a 3D target queue. The optimization of the 3D target detection performance can be realized.

Description

3D target detection method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of automatic driving, and in particular, to a method, an apparatus, a device, and a storage medium for 3D target detection.

Background

3D object detection and ranging plays a very important role in the field of autonomous driving. Lidar has abundant 3D information and high positioning accuracy, and is therefore often used as a target detection tool in autonomous driving. However, the target detection by simply using the lidar as a single sensor is usually accompanied by more false detections and missed detections, which are caused by the sparsity of the lidar itself and the insufficient performance of the detection model.

Thus, existing 3D object detection through multimodal fusion has been one of the mainstream approaches in the perception field in autonomous driving. Laser and image fusion algorithms are generally classified into data-level fusion, feature-level fusion and target-level fusion according to different fusion stages. In the target-level fusion scheme, the target is mostly deleted and reserved based on logic judgment, namely, a cost matrix is calculated through association of 3D and 2D. Such schemes typically require the addition of human experience and are somewhat affected by manual parameters resulting in lower generalization performance.

The CLOCS is a target-level fusion algorithm based on deep learning, which projects a laser 3D target onto an image and performs matching fusion on the 3D target and a 2D target according to the geometric consistency of a 3D projection frame and an image detection frame. However, the geometric consistency is usually lost due to the perspective principle of the camera and the occlusion phenomenon of the target frame, which results in the reduced coincidence degree of the 3D target frame and the 2D frame.

Therefore, the prior art has the technical problem of low 3D object detection performance.

Disclosure of Invention

In view of this, embodiments of the present application provide a method, an apparatus, a device and a storage medium for 3D target detection, which aim to solve the technical problem of low 3D detection performance in automatic driving.

The embodiment of the application provides a 3D target detection method, which comprises the following steps:

acquiring a monocular 3D target queue after a camera performs target detection on a 3D space, a laser 3D target queue after a laser radar performs target detection on the 3D space and corresponding laser 3D information, wherein the laser 3D information comprises a laser 3D probability distribution map and a laser 3D target information map;

acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, respectively projecting the laser 3D targets and the monocular 3D projection frame onto an image, establishing a matching relation between the laser 3D projection frame and the monocular 3D projection frame, and combining 3D target information in the laser 3D target information graph and information of the monocular 3D target queue to obtain a multi-modal characteristic graph;

and calculating the multi-modal characteristic diagram according to a convolutional network, and filtering and detecting the calculated result based on a confidence threshold value to obtain a 3D target queue.

In a possible embodiment of the present application, the obtaining of the laser 3D target queue and the corresponding laser 3D information of the laser radar after performing the target detection on the 3D space includes a laser 3D probability distribution map and a laser 3D target information map, which includes:

acquiring laser point cloud data of a laser radar for carrying out target detection on a 3D space;

and on the basis of the CenterPoint frame, a VoxelNet network is used as a main network, and the Centerhead is used as a detection head to calculate the laser point cloud data to obtain a laser 3D target queue and corresponding laser 3D information, including a laser 3D probability distribution map and a laser 3D target information map.

The monocular 3D object queue after the acquisition camera performs object detection on the 3D space comprises:

monocular 3D object queues are obtained based on a Smoke algorithm, and each monocular 3D object queue comprises confidence coefficient and size coordinate information.

In a possible implementation manner of the present application, the obtaining of all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue are respectively projected onto an image, the obtaining of a laser 3D projection frame and a monocular 3D projection frame, establishing a matching relationship between the laser 3D projection frame and the monocular 3D projection frame, and obtaining a multimodal feature map by combining 3D target information in the laser 3D target information map and information of the monocular 3D target queue includes:

acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, projecting the laser 3D targets and the monocular 3D projection frames onto an image;

calculating the intersection ratio between the laser 3D projection frame and the monocular 3D projection frame to form a first BEV characteristic diagram;

acquiring a second BEV characteristic diagram formed by the confidence degree of the laser 3D target;

obtaining the confidence coefficient of the monocular 3D object to form a third BEV characteristic diagram, wherein the confidence coefficient of the monocular 3D object is determined by information in the monocular 3D object queue;

calculating a distance numerical value of the laser 3D target based on the 3D target information of the laser 3D target in the laser 3D target information graph, and performing normalization to obtain a fourth BEV characteristic graph;

and splicing the four BEV feature maps to obtain a multi-modal feature map.

In a possible embodiment of the present application, the calculating an intersection-to-parallel ratio between the laser 3D projection frame and the monocular 3D projection frame to form a first BEV feature map includes:

traversing the laser 3D projection frames corresponding to the queue formed by the laser 3D target, and executing the steps on each traversed laser 3D projection frame:

calculating the intersection ratio between the laser 3D projection frame and each monocular 3D projection frame based on the size information of the monocular 3D projection frame and the laser 3D projection frame, wherein the size information is obtained by calculating the projection information of the monocular 3D target and the laser 3D target;

and selecting the intersection and combination ratio with the largest value in the calculation results as a target intersection and combination ratio to form a first BEV characteristic diagram.

In one possible embodiment of the present application, the obtaining the confidence level of the laser 3D target to form a second BEV feature map includes:

and taking a laser 3D probability distribution graph obtained after the laser radar carries out target detection on the 3D space as a second BEV characteristic graph.

In one possible embodiment of the present application, the obtaining the confidence level of the monocular 3D object forms a third BEV feature map, including:

projecting the monocular 3D targets to a new BEV feature map to obtain the position information of each monocular 3D target on the new BEV feature map;

and assigning the confidence of the monocular 3D object to a grid point corresponding to the position information on the new BEV feature map to form a third BEV feature map, wherein the confidence of the monocular 3D object is determined by the information in the monocular 3D object queue.

In a possible implementation manner of the present application, the calculating the multi-modal feature map according to a convolutional network, and filtering the calculated result based on a confidence threshold to obtain a 3D object queue includes:

performing dimension raising processing on the multi-modal feature map, and distributing channel weights to features in the multi-modal feature map through an SE (selective emitter) attention mechanism;

calculating the confidence of the multi-modal feature map based on the channel weights and the convolutional network;

and filtering and detecting the calculated result based on the confidence coefficient threshold value to obtain a 3D target corresponding to the filtered confidence coefficient, and forming a 3D target queue.

The present application further provides a 3D target detection device, the device comprising:

the target queue generating module is used for acquiring a monocular 3D target queue after a camera performs target detection on a 3D space, a laser 3D target queue after a laser radar performs target detection on the 3D space and corresponding laser 3D information, wherein the laser 3D information comprises a laser 3D probability distribution map and a laser 3D target information map;

the feature extraction module is used for acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue to be projected onto an image respectively, acquiring a laser 3D projection frame and a monocular 3D projection frame, establishing a matching relation between the laser 3D projection frame and the monocular 3D projection frame, and combining 3D target information in the laser 3D target information image and information of the monocular 3D target queue to acquire a multi-modal feature image;

and the network prediction module is used for calculating the multi-modal characteristic graph according to a convolutional network, and filtering and detecting the calculated result based on a confidence coefficient threshold value to obtain a 3D target queue.

In a possible implementation manner of the present application, the target queue generating module further includes:

the first acquisition submodule is used for acquiring laser point cloud data of a laser radar for carrying out target detection on a 3D space;

the first calculation submodule is used for calculating the laser point cloud data by using a VoxelNet network as a backbone network and using a Centerhead as a detection head based on a CenterPoint frame to obtain a laser 3D target queue and corresponding laser 3D information, wherein the laser 3D target queue comprises a laser 3D probability distribution map and a laser 3D target information map;

and/or, the target queue generating module further comprises:

and the second calculation submodule is used for obtaining monocular 3D object queues based on a Smoke algorithm, and each monocular 3D object queue comprises confidence coefficient and size coordinate information.

And/or, the feature extraction module comprises:

the second obtaining submodule is used for obtaining all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, projecting the laser 3D targets and the monocular 3D projection frame onto an image;

the third calculation submodule is used for calculating the intersection ratio between the laser 3D projection frame and the monocular 3D projection frame to form a first BEV characteristic diagram;

the third obtaining sub-module is used for obtaining a second BEV characteristic diagram formed by the confidence degrees of the laser 3D target;

a fourth obtaining sub-module, configured to obtain a confidence of the monocular 3D object to form a third BEV feature map, where the confidence of the monocular 3D object is determined by information in the monocular 3D object queue;

the fourth calculation submodule is used for calculating a distance numerical value of the laser 3D target based on the 3D target information of the laser 3D target in the laser 3D target information graph, and performing normalization to obtain a fourth BEV characteristic graph;

and the feature extraction submodule is used for splicing the four BEV feature maps to obtain a multi-modal feature map.

And/or, the third computation submodule further comprises:

a traversing unit, configured to traverse the laser 3D projection frames corresponding to the queue formed by the laser 3D target, and execute the following steps for each traversed laser 3D projection frame:

the data calculation subunit is configured to calculate an intersection ratio between the laser 3D projection frame and each monocular 3D projection frame based on size information of the monocular 3D projection frame and the laser 3D projection frame, where the size information is calculated from projection information of the monocular 3D object and the laser 3D object;

and the data selection subunit is used for selecting the intersection ratio with the largest value in the calculation result as a target intersection ratio to form a first BEV characteristic diagram.

And/or the third obtaining sub-module further comprises:

and the characteristic diagram acquisition unit is used for taking a laser 3D probability distribution diagram obtained after the laser radar carries out target detection on the 3D space as a second BEV characteristic diagram.

And/or the third obtaining sub-module further comprises:

the target projection unit is used for projecting the monocular 3D targets onto a new BEV feature map to obtain the position information of each monocular 3D target on the new BEV feature map;

and the assignment unit is used for assigning the confidence coefficient of the monocular 3D object to the grid point corresponding to the position information on the new BEV feature map to form a third BEV feature map, and the confidence coefficient of the monocular 3D object is determined by the information in the monocular 3D object queue.

And/or, the network prediction module further comprises:

the preprocessing submodule is used for performing dimensionality-increasing processing on the multi-modal feature map and distributing channel weight to features in the multi-modal feature map through an SE (selective emitter) attention mechanism;

the network prediction sub-module is used for calculating the confidence coefficient of the multi-modal feature map based on the channel weight and the convolutional network;

and the filtering submodule is used for carrying out filtering detection on the calculated result based on the confidence coefficient threshold value to obtain a 3D target corresponding to the filtered confidence coefficient, and a 3D target queue is formed.

The present application further provides a 3D target detection device, the 3D target detection device is an entity node device, the 3D target detection device includes: a memory, a processor and a program of the 3D object detection method stored on the memory and executable on the processor, the program of the 3D object detection method when executed by the processor being operable to implement the steps of the 3D object detection method as described above.

To achieve the above object, there is also provided a computer readable storage medium having a 3D object detection program stored thereon, where the 3D object detection program, when executed by a processor, implements any of the steps of the 3D object detection method described above.

The application provides a 3D target detection method, a device, equipment and a storage medium, wherein a monocular 3D target queue obtained after a camera performs target detection on a 3D space, a laser 3D target queue obtained after a laser radar performs target detection on the 3D space and corresponding laser 3D information are obtained, and the laser 3D information comprises a laser 3D probability distribution map and a laser 3D target information map; acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, respectively projecting the laser 3D targets and the monocular 3D projection frame onto an image, establishing a matching relation between the laser 3D projection frame and the monocular 3D projection frame, and combining 3D target information in the laser 3D target information graph and information of the monocular 3D target queue to obtain a multi-modal characteristic graph; and calculating the multi-modal characteristic diagram according to a convolutional network, and filtering and detecting the calculated result based on a confidence threshold value to obtain a 3D target queue. That is, a monocular 3D target queue after the camera performs target detection on the 3D space, a laser 3D target queue after the laser radar performs target detection on the 3D space, and corresponding laser 3D information are acquired, and a matching relationship between a laser 3D projection frame of the laser 3D target projected onto the image and a monocular 3D projection frame of the monocular 3D target projected onto the image is established, so that accurate multimodal features can be obtained. In addition, the multi-modal feature map is calculated through a simple convolutional network, context information is effectively utilized for learning, and the generalization performance of target detection is improved. That is, in the present application, the 3D target detection performance is optimized by performing post-fusion on the laser 3D target and the monocular 3D target and recalculating the confidence of the 3D target.

Drawings

Fig. 1 is a schematic flow chart of a 3D object detection method according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of an algorithm flow of a first embodiment of the 3D object detection method of the present application;

fig. 3 is a comparison diagram of the coincidence condition between a 2D detection frame and a monocular 3D projection frame and a laser 3D projection frame in the 3D target detection method of the present application;

FIG. 4 is a schematic structural diagram of a network prediction module in the 3D object detection method of the present application;

fig. 5 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present application.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

An embodiment of the present application provides a 3D target detection method, which is applied to a 3D target detection device in an embodiment of the 3D target detection method, and with reference to fig. 1 and fig. 2, the method includes:

step S10, acquiring a monocular 3D target queue after a camera performs target detection on a 3D space, a laser 3D target queue after a laser radar performs target detection on the 3D space and corresponding laser 3D information, wherein the laser 3D information comprises a laser 3D probability distribution map and a laser 3D target information map;

step S20, acquiring all laser 3D targets in a laser 3D target information graph and all monocular 3D targets in a monocular 3D target queue, projecting the laser 3D targets and the monocular 3D projection frames onto an image respectively, establishing a matching relation between the laser 3D projection frames and the monocular 3D projection frames, and combining 3D target information in the laser 3D target information graph and information of the monocular 3D target queue to obtain a multi-modal feature graph;

and S30, calculating the multi-modal characteristic diagram according to a convolutional network, and filtering and detecting the calculated result based on a confidence coefficient threshold value to obtain a 3D target queue.

The present embodiment is intended to: the detection performance of the 3D target in automatic driving is improved.

As an example, in the present application, a laser radar is used to detect a target in a 3D space to obtain a laser 3D target queue, a camera of a vehicle body is used to detect the target in the 3D space to obtain a monocular 3D target queue, the laser 3D target and the monocular 3D target are subjected to target level fusion, a confidence of the 3D target is recalculated, and the detection performance of the 3D target is optimized. Compared with the prior art that the 3D target and the 2D target are matched and fused through a geometric consistency mechanism of the laser 3D projection frame and the image 2D detection frame (in an ideal case, the 2D detection frame for image detection and the 3D projection frame for laser 3D detection are overlapped, but in an actual case, the 2D detection frame and the 3D projection frame cannot be completely overlapped due to the problems of shielding, camera perspective and the like, so that the 3D target detection performance is low due to the fact that the geometric consistency of the two frames is generally invalid due to the perspective principle of the camera and the shielding phenomenon of the target frame), the detection frame of the monocular 3D target and the detection frame of the laser 3D target have higher coincidence degree. In the application, a monocular 3D target queue after a camera performs target detection on a 3D space, laser 3D information corresponding to the laser 3D target queue and laser 3D information corresponding to the laser 3D target queue after a laser radar performs target detection on the 3D space are obtained, a matching relation between a laser 3D projection frame of the laser 3D target projected onto an image and a monocular 3D projection frame of the monocular 3D target projected onto the image is established, and accurate multimode characteristics can be obtained. In addition, the multi-modal feature map is calculated through a simple convolutional network, context information is effectively utilized for learning, and the generalization performance of target detection is improved. That is, in the application, the 3D target detection performance is optimized by performing post-fusion on the laser 3D target and the monocular 3D target and recalculating the confidence of the 3D target.

As an example, in the present application, a centrpoint frame is used for detecting a laser 3D object by using a VoxelNet network as a backbone network and using a centrhead as a detection head, so as to obtain laser 3D information, where the information includes a laser 3D probability distribution map and a laser 3D object information map, and then the laser 3D probability distribution map is fused and updated by using related data of a monocular 3D object, so as to obtain an accurate 3D object queue. Exemplarily, laser point cloud data collected by a laser radar is calculated based on a centragroint frame to obtain laser 3D information, and the laser 3D information comprises a laser 3D probability distribution diagram and a 3D target size coordinate information diagram. The laser 3D probability distribution map is generated directly as the second BEV profile. And all laser 3D target information represented in the laser 3D target dimension coordinate graph is in consistent association with the characteristics of the monocular 3D target, so that accurate multi-modal characteristics are obtained.

As an example, in the application, when feature fusion is performed on a 2D detection frame for image detection and a laser 3D projection frame, due to occlusion and image perspective problems, the overlap ratio of the 2D detection frame and the laser 3D projection frame is low, thereby causing failure of an IoU (Intersection) feature. Therefore, the laser 3D target and the monocular 3D target are projected onto the image, the obtained laser 3D projection frame and the monocular 3D projection frame establish a matching relation, and the multimode characteristics are obtained. That is, monocular 3D detection projections and laser 3D projections are used to establish a geometrical consistency association of the image and the laser. Illustratively, the intersection and combination ratio between the laser 3D projection frame and the monocular 3D projection frame is calculated to form a first BEV feature map, the confidence coefficient of the monocular 3D object is obtained to form a third BEV feature map, the distance value of the laser 3D object is calculated based on the 3D object information (coordinate information) of the laser 3D object in the laser 3D object information map, and normalization is performed to obtain a fourth BEV feature map. And performing feature extraction through the four layers of BEV feature maps to obtain a multi-modal feature map.

As an example, in the present application, a matching relationship is established by using a laser 3D projection frame and a monocular 3D projection frame, and a monocular 3D projection frame with a relatively large overlap ratio with the laser 3D projection frame is matched out, so as to calculate an intersection-parallel ratio between the laser 3D projection frame and the monocular 3D projection frame, obtain an IoU feature, and establish a consistent association between an image and laser. Illustratively, each laser 3D projection frame corresponding to the queue formed by the laser 3D objects is traversed, and the step of executing the traversed laser 3D projection frame is as follows: calculating the intersection ratio between the laser 3D projection frame and each monocular 3D projection frame based on the size information of the monocular 3D projection frame and the laser 3D projection frame, wherein the size information is obtained by calculating the projection information of the monocular 3D target and the laser 3D target; and selecting the intersection and combination ratio with the largest value in the calculation results as a target intersection and combination ratio to form a first BEV characteristic diagram.

As an example, in the present application, in order to implement the monocular 3D detection projection and the laser 3D projection to establish the consistent association between the image and the laser, the monocular 3D object is projected onto the new BEV feature map, and the confidence of the monocular 3D object is obtained for establishing the multi-modal feature map. Illustratively, projecting the monocular 3D object queue onto the new BEV feature map to obtain the position information of each monocular 3D object on the new BEV feature map; and assigning the confidence coefficient of the monocular 3D object to a grid point corresponding to the position information on the new BEV feature map to form a third BEV feature map.

As an example, in the present application, target-level fusion is performed on a multi-modal feature map through a simple convolution network (e.g., 2D convolution), so that context information of the feature is effectively utilized, and generalization performance of 3D target detection can be improved. Illustratively, the multi-modal feature map is subjected to dimension raising processing, and channel weights are distributed to features in the multi-modal feature map through an SE attention mechanism; calculating the confidence coefficient of the multi-modal feature map based on the channel weight and the convolutional network; and filtering and detecting the calculated result based on the confidence coefficient threshold value to obtain a 3D target corresponding to the filtered confidence coefficient, and forming a 3D target queue.

In this embodiment, the specific application scenarios targeted are:

3D target detection and ranging plays an important role in the field of automatic driving, and the lidar has abundant 3D information and high positioning accuracy, and thus is often used as a target detection tool in automatic driving. However, the target detection by simply using the lidar as a single sensor is usually accompanied by more false detections and missed detections, which are caused by the sparsity of the lidar itself and the insufficient performance of the detection model.

For the above reasons, in 3D object detection, geometric consistency of a 3D projection frame and an image detection frame is adopted when an image-detected 3D object and a laser-detected 3D object are subjected to object-level fusion. However, the consistency is usually lost due to the perspective principle of the camera and the occlusion phenomenon of the target frame, which results in the decrease of the coincidence degree of the 3D target frame and the 2D frame. In the target-level fusion scheme, human experience needs to be added in the scheme of calculating the cost matrix through the correlation of 3D and 2D, and the generalization performance is low due to the influence of manual parameters to a certain extent. Thereby causing a problem of low 3D detection performance in automatic driving.

As an example, the 3D target detection method may be applied to a 3D target detection system including a vehicle camera, a laser radar, and a 3D target detection apparatus.

As an example, the 3D object detecting device may be built in a vehicle, or built in other mobile terminals, or be independent of the vehicle and other mobile terminals.

As an example, the 3D (three-dimensional) object may be different types of entities in the 3D space, and may be a car, a bus, a pedestrian, or other obstacles, and the like, without limitation.

As an example, the 3D space refers to a range space of a camera or a laser radar detection target on a vehicle, and may be an environmental space around the vehicle during the vehicle driving process.

The method comprises the following specific steps:

as an example, a monocular 3D object refers to an object that is obtained from an image obtained by object detection in a camera 3D space, the object being in 3D form. The plurality of monocular 3D objects form a monocular 3D object queue.

As an example, the laser 3D target is a 3D target obtained by detecting an environment around a vehicle by a laser radar sensor and processing laser point cloud data acquired by the laser radar.

As an example, the BEV map is a bird's eye view, and a position map of the 3D object is obtained from a direction of the BEV view, and a plurality of grids are divided on the BEV map, each grid corresponding to one 3D object.

As an example, a laser 3D probability distribution map (W × H × 1), denoted as S, is obtained by laser radar detection ^lidar W and H are the dimensions of the profiles, respectively. The value for each point in the probability distribution map is the confidence, which refers to the probability that the 3D object exists at that point.

As an example, the confidence is one-dimensional information data, the 3D target information in the laser 3D target information map is seven-dimensional information data, the 3D target information comprises coordinate information, size information and yaw angle information of the 3D target, and the ith laser 3D target information is used for the information of the ith laser 3D target

And (4) showing.

As an example, a monocular 3D target after target detection is performed on a 3D space by a camera is acquired, and a laser 3D target after target detection is performed on the 3D space by a laser radar is acquired. The two pieces of information are used for detecting information such as the position and the distance of the 3D target vehicle and providing a data base for automatic driving.

Step S20, acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, projecting the laser 3D projection frames and the monocular 3D projection frames onto an image respectively, establishing a matching relation between the laser 3D projection frames and the monocular 3D projection frames, and combining 3D target information in the laser 3D target information graph and information of the monocular 3D target queue to obtain a multi-modal feature graph;

as an example, the multi-modal feature refers to a plurality of features for reflecting the 3D object, one feature represents one entity data, and the accuracy of obtaining the 3D object can be improved by feature data of different modalities.

As an example, a laser 3D projection frame refers to a 2D bounding frame obtained by projecting a laser 3D object onto an image. The monocular 3D projection frame is a 2D surrounding frame obtained by projecting a monocular 3D target on an image.

As an example, the image may also generate a 2D detection frame through 2D detection, and in an ideal case, the 2D detection frame coincides with a 3D projection frame (a bounding frame formed by projecting the laser 3D object on the image, that is, the laser 3D projection frame). However, due to the problems of occlusion or camera perspective of the 3D object, in practical situations, the 2D detection frame and the 3D projection frame cannot be completely overlapped. And the monocular 3D projection frame and the 3D projection frame belong to the same 3D space, so the projection given on the image is more consistent.

Referring to fig. 3, fig. 3 is a comparison diagram of the overlapping condition of the 2D detection frame and the monocular 3D projection frame with the laser 3D projection frame, respectively. Fig. 3 (a) and (b) show the coincidence of the 2D detection frame and the laser 3D projection frame, and fig. 3 (c) and (D) show the coincidence of the monocular 3D projection frame and the laser 3D projection frame.

Due to occlusion and image perspective problems, the coincidence degree of the 2D detection frame and the 3D projection frame is obviously low, and therefore the IoU characteristic is invalid. And the coincidence degree of the monocular 3D projection frame and the laser 3D projection frame is very high.

Therefore, the monocular 3D detection projection and the laser 3D projection are utilized to establish the consistency association of the image and the laser, the matching relation between the laser 3D projection frame and the monocular 3D projection frame is established, and a more accurate multi-modal characteristic diagram is obtained. That is, design the matching mechanism of monocular 3D and laser 3D projection, effectively solved 3D and 2D geometric inconsistency problem, promote the accuracy to 3D target detection.

As an example, the obtaining of all laser 3D objects in the laser 3D object queue and all monocular 3D objects in the monocular 3D object queue is performed by projecting onto an image respectively to obtain a laser 3D projection frame and a monocular 3D projection frame, establishing a matching relationship between the laser 3D projection frame and the monocular 3D projection frame, and obtaining a multimodal feature map by combining 3D object information in the laser 3D object information map and information of the monocular 3D object queue includes:

s21, acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, projecting the acquired laser 3D projection frame and monocular 3D projection frame onto an image;

step S22, calculating the intersection ratio between the laser 3D projection frame and the monocular 3D projection frame to form a first BEV characteristic diagram;

s23, obtaining the confidence coefficient of the laser 3D target to form a second BEV characteristic diagram;

step S24, obtaining the confidence coefficient of the monocular 3D object to form a third BEV characteristic diagram, wherein the confidence coefficient of the monocular 3D object is determined by information in the monocular 3D object queue;

step S25, calculating a distance numerical value of the laser 3D target based on the 3D target information of the laser 3D target in the laser 3D target information graph, and normalizing to obtain a fourth BEV characteristic graph;

and S26, splicing the four BEV feature maps to obtain a multi-modal feature map.

As an example, a laser 3D object and a monocular 3D object are projected onto an image, and the resulting laser 3D projection frame and monocular 3D projection frame refer to 2D bounding frames projected on the image.

As an example, for the position information accuracy of 3D object detection, the overlap or Intersection over IoU (Intersection over Union) is usually adopted for measurement. The intersection ratio IoU is a quotient between an intersection value between the laser 3D projection frame and the monocular 3D projection frame and a union value between the laser 3D projection frame and the monocular 3D projection frame, that is, the intersection value is divided by the union value. And calculating the intersection and union ratio of the laser 3D projection frame and the monocular 3D projection frame projected on the 2D bounding box on the image, wherein the intersection value and the union value respectively correspond to the intersection area and the union area between the laser 3D projection frame and the monocular 3D projection frame.

That is, ioU = Area of Intersection/Area of Union, where Area of Intersection is an Area of Intersection between the laser 3D projection frame and the 2D frame of the monocular 3D projection frame projected on the image, and Area of Union is an Area of Union between the laser 3D projection frame and the 2D frame of the monocular 3D projection frame projected on the image.

And forming a first BEV feature map by calculating the intersection ratio between the laser 3D projection frame and the monocular 3D projection frame, wherein the first BEV feature map comprises the IoU feature information between the two laser 3D targets and the monocular 3D target.

As an example, a confidence-developed second BEV profile of the laser 3D object is obtained. That is, the confidence distribution map of the laser 3D target on the second BEV feature map

Confidence refers to the probability of the laser 3D object appearing in each grid in the second BEV map.

As an example, a third BEV feature map is formed by obtaining the confidence level of the monocular 3D object, which is determined by the information in the monocular 3D object queue. That is, the confidence distribution map of the monocular 3D object on the third BEV feature map

Confidence refers to the probability of the monocular 3D object appearing in each grid of the third BEV feature map.

As an example, based on laser 3D meshTarget 3D target information, and calculating the distance value D from the laser 3D target to the center of the vehicle _i . That is, d _i And normalizing the distance map for the laser 3D target coordinate, and calculating the distance between the central point of the laser 3D target and the central point position of the vehicle.

Exemplary, d _i The calculation method is as follows:

wherein->

Is the Euclidean distance between the ith laser 3D target and the vehicle>

The maximum distance between the center point of the laser 3D target and the center point of the vehicle is defined, and therefore a characteristic layer of distance values is formed.

As an example, the multi-modal feature map is obtained by multi-modal features such as an IoU feature map (first BEV feature map), a laser 3D probability distribution map (second BEV feature map), a monocular 3D confidence level distribution map (third BEV feature map), and a distance distribution map (fourth BEV feature map). That is, a 4-dimensional feature map (W × H × 4) is designed by multimodal features:

as an example, the obtaining a second BEV feature map formed of the confidence level of the laser 3D target includes:

and S231, taking a laser 3D probability distribution map obtained after the laser radar performs target detection on the 3D space as a second BEV characteristic map.

As an example, the laser 3D probability distribution map detected by the laser 3D target based on the CenterPoint frame is directly used as the second BEV feature map.

As an example, the obtaining the confidence level of the monocular 3D object forms a third BEV feature map, including:

step S241, projecting the monocular 3D targets to a new BEV feature map to obtain position information of each monocular 3D target on the new BEV feature map;

step S242, assigning the confidence of the monocular 3D object to the grid point corresponding to the position information on the new BEV feature map to form a third BEV feature map.

As an example, the monocular 3D object queue is projected onto a new BEV feature map, which has the same size range as the laser 3D inspection, W × H. Each grid size on the third BEV diagram is denoted as p, and p =2X/W assuming that the range of the X direction of laser detection is (-X, X).

And projecting the central point of the monocular 3D target on the new BEV characteristic map based on the size coordinate information of the monocular 3D target to obtain the position information corresponding to the monocular 3D target on the new BEV characteristic map, namely finding the grid point of the monocular 3D target on the new BEV characteristic map. And assigning the confidence of the monocular 3D object to the grid point, wherein the assignment modes of other monocular 3D objects are basically the same, and are not described herein again. That is, the confidence of the 3D object is assigned to the grid points where the 3D object exists on the new BEV feature map, thereby obtaining a confidence distribution map of the monocular 3D object, i.e., the third BEV feature map.

As an example, because the number of monocular 3D objects is small, the formed monocular 3D confidence map is too sparse, and therefore, a gaussian blur needs to be performed on each grid point feature on the third BEV feature map. Gaussian blurring (also called gaussian smoothing) is a commonly used technique in image processing, and is mainly used to reduce noise and detail of an image. Namely, each grid point feature is amplified according to the Gaussian radius, the monocular 3D target representation range is expanded, and the features of the whole image are better learned. The larger the gaussian radius is, the more blurred the image is, and the smoother the numerical value is from the numerical point of view. Thus, the gaussian radius is set according to the monocular 3D object size.

As an example, a convolutional network includes a neural network of 2D convolution, 3D convolution, etc., which is generally composed of convolutional layers, activation layers, and pooling layers, inputs image data, and outputs a specific feature space of an image.

As an example, a new probability distribution map of the (W × H × 1) feature, that is, the 3D target is calculated by predicting the multi-modal feature map using a simple 2D convolution set, and the laser 3D probability distribution map obtained by the initial laser 3D detection is modified by the new probability distribution map. The final 3D detection target queue is obtained through confidence threshold and NMS filtering. It can be understood that a confidence threshold is obtained, points in the probability distribution map which are larger than the confidence threshold are selected, and 3D targets corresponding to the points form a 3D target queue.

As an example, the calculating the multi-modal feature map according to a convolutional network, and filtering the calculated result based on a confidence threshold to obtain a 3D object queue includes:

step S31, performing dimension raising processing on the multi-modal feature map, and distributing channel weights to features in the multi-modal feature map through an SE (selective emitter) attention mechanism;

step S32, calculating the confidence of the multi-modal feature map based on the channel weight and the convolution network;

and S33, filtering and detecting the calculated result based on the confidence coefficient threshold value to obtain a 3D target corresponding to the filtered confidence coefficient, and forming a 3D target queue.

As an example, referring to fig. 4, the feature map is upscaled, making the multi-modal feature map of 4 dimensions richer, such as upscaling to 16-dimensional features. Different weights are assigned to different channels of the feature by the SE feature attention mechanism, and the weights are adaptively obtained when the SE feature attention mechanism is used for calculating data. The features of (W × H × 1) are calculated through a series of 2D convolutions, the series of 2D convolutions include two 3 × 3 2D convolutions, and the final probability distribution map is obtained through a sigmoid operation. And replacing the initial 3D detection distribution graph with the probability distribution graph, and obtaining a final 3D detection target queue through a confidence threshold value and NMS filtering.

In the embodiment, the matching degree is higher on the geometric consistency through the fusion of the monocular 3D target and the laser 3D target, the detection performance can be improved by adopting simple 2D convolution layer operation, the 2D convolution calculates the characteristics, and the context information of the BEV characteristic space is effectively utilized. And different channels with characteristics are distributed with different weights by adopting an SE attention mechanism, so that the attention of the network to the different channels with characteristics is improved, and the learned characteristics are more effective and accurate. That is, the 3D detection performance is improved based on the target level fusion of monocular 3D and laser radar 3D detection.

The application provides a 3D target detection method, a device, equipment and a storage medium, compared with the low 3D detection performance in the current automatic driving, in the application, a monocular 3D target queue after a camera performs target detection on a 3D space is obtained, a laser 3D target queue after a laser radar performs target detection on the 3D space and corresponding laser 3D information are obtained, a matching relation between a laser 3D projection frame of a laser 3D target projected onto an image and a monocular 3D projection frame of the monocular 3D target projected onto the image is established, and accurate multimode characteristics can be obtained. In addition, the multi-modal feature map is calculated through a simple convolutional network, context information is effectively utilized for learning, and the generalization performance of target detection is improved. That is, in the present application, the 3D target detection performance is optimized by performing post-fusion on the laser 3D target and the monocular 3D target and recalculating the confidence of the 3D target.

Based on the first embodiment of the 3D object detection method, a second embodiment of the 3D object detection method is proposed.

The laser 3D target queue and the corresponding laser 3D information after the laser radar carries out target detection on the 3D space are obtained, the laser 3D target queue and the corresponding laser 3D information comprise a laser 3D probability distribution diagram and a laser 3D target information diagram, and the laser 3D target information diagram comprises:

a11, acquiring laser point cloud data of a laser radar for carrying out target detection on a 3D space;

and A12, calculating the laser point cloud data by using a VoxelNet network as a main network based on a CenterPoint frame and using a Centerhead as a detection head to obtain a laser 3D target queue and corresponding laser 3D information, including a laser 3D probability distribution graph and a laser 3D target information graph.

Calculating the laser point cloud data based on a CenterPoint model to obtain a laser 3D probability distribution graph and a laser 3D target information graph, wherein a value corresponding to each 3D target in the laser 3D probability distribution graph is the confidence coefficient of the laser 3D target, and each 3D target in the laser 3D target information graph corresponds to 3D target information of the laser 3D target;

as an example, the confidence is one-dimensional information data, the 3D target information is seven-dimensional information data, and the information comprises coordinate information, size information and yaw angle information of the 3D target, and the ith laser 3D target information

And (4) showing.

As an example, centerPoint is a model based on two-phase detection-coupled tracking of Center-based. In the first stage, the center of the 3D target is detected using a keypoint detector (e.g., lidar) and the 3D dimensions, 3D direction, and speed of the detection frame are regressed. In the second stage, a memory refining module is designed, and the detection frame generated in the first stage is refined by using additional point characteristics.

As an example, laser 3D target detection adopts a centrpoint frame, laser point cloud data of target detection performed on a 3D space by a laser radar is characterized, a Voxel feature is generated, and the feature of the Voxel feature is extracted. Illustratively, the backbone part in the CenterPoint framework uses Voxelnet as a 3D encoder to extract features, obtains feature-map, and directly outputs a probability distribution diagram (W × H × 1) marked as S through a detection head of anchor-free ^lidar . And the value corresponding to each point in the probability distribution map is the first confidence of the laser 3D target. The dimension coordinate information of the other branch regression target frame of the CenterPoint frame is 3D target information which is recorded as

And (4) showing. Wherein (x, y, z) is position information of the 3D target, (l, w, h) is length, width and height information of the 3D target, and θ is a yaw angle of the 3D target.

As an example, W and H are the dimensions of the probability distribution map, respectively, with the value of each point in the map being the confidence of the laser 3D object. Each point in the probability distribution map is projected into the BEV map, each point corresponding to a grid position of the BEV, with each grid size denoted as p. Assuming that the range of the X direction of laser detection is (-X, X), p =2X/W.

As an example, the laser 3D object detection result includes a laser 3D probability distribution map, and a laser 3D object information map. The probability distribution map corresponds to the size of the 3D object information map. Therefore, a total of (W × H) laser 3D objects obtained by laser detection form a laser 3D object queue.

In this embodiment, a laser 3D target in a 3D space is obtained through CenterPoint frame calculation, and target level fusion is performed in combination with a monocular 3D target, so as to obtain an accurate 3D target queue.

As an example, the acquiring a monocular 3D object queue after object detection is performed on the 3D space by the camera includes:

step A13, obtaining monocular 3D target queues based on a Smoke algorithm, wherein each monocular 3D target queue comprises confidence coefficient and size coordinate information.

And detecting an image acquired by the camera on the 3D space to obtain a monocular 3D target queue. Illustratively, a Smoke network framework is adopted for detecting the images, and final n monocular 3D targets are obtained through an nms algorithm to form a monocular 3D target queue. And directly calculating the confidence coefficient and the size coordinate information of each monocular 3D object through a Smoke network frame, wherein the size coordinate information of the jth monocular 3D object is used

And (4) showing.

Based on the first embodiment or the second embodiment of the 3D object detection method described above, a third embodiment of the 3D object detection method is proposed.

The calculating the intersection and parallel ratio between the laser 3D projection frame and the monocular 3D projection frame to form a first BEV map comprises:

step B11, traversing each laser 3D projection frame corresponding to the queue formed by the laser 3D target, and executing the steps on the traversed laser 3D projection frames:

step B12, calculating the intersection and parallel ratio between the laser 3D projection frame and each monocular 3D projection frame based on the size information of the monocular 3D projection frame and the laser 3D projection frame, wherein the size information is obtained by calculating the projection information of the monocular 3D target and the laser 3D target;

and step B13, selecting the intersection ratio with the maximum value in the calculation results as a target intersection ratio to form a first BEV characteristic diagram.

As an example, the intersection ratio IoU refers to an intersection ratio of a laser 3D projection frame and a 2D bounding box generated by projecting a monocular 3D projection frame onto an image, in order to establish a matching relationship between the monocular 3D projection frame and the laser 3D projection frame and obtain a 3D target, the laser 3D projection frame and all monocular 3D projection frames are respectively used to calculate an IoU value, and the larger the coincidence degree between the laser 3D projection frame and the monocular 3D projection frame is, the larger the IoU value is. Therefore, a first BEV feature map, i.e., an IoU feature map, is created by taking the IoU with the largest value among the IoU values calculated by the laser 3D projection frame and all monocular 3D projection frames.

As an example, each laser 3D projection frame is traversed, and the following steps are performed on the traversed laser 3D projection frame:

and calculating the intersection ratio between the laser 3D projection frame and each monocular 3D projection frame based on the size information of the monocular 3D projection frame and the laser 3D projection frame, wherein the size information is obtained by calculating the projection information of the monocular 3D object and the laser 3D object.

For example, a laser 3D projection frame is taken as an example for explanation, the laser 3D projection frame has 8 corner points, the projection frame has 8 projection coordinates when being projected onto an image, the coordinates of each corner point of each laser 3D object can be known according to 3D object information of the laser 3D object, an internal reference and an external reference of the image are shot by a set camera, and the 8 projection coordinates projected onto the image can be obtained. And 4 projection points on the outermost side are selected to form an outer enveloped 2D frame which is marked as a laser 3D projection frame. The specific implementation of the monocular 3D projection frame is basically the same, and will not be described herein again.

Therefore, the intersection ratio between the laser 3D projection frame and each monocular 3D projection frame is calculated through the size information of the monocular 3D projection frame and the laser 3D projection frame, and the intersection ratio with the largest value in the calculation result is selected as the target intersection ratio. And forming a first BEV feature map through at least one target intersection ratio, wherein the first BEV feature map comprises IoU feature information between the two laser 3D targets and the monocular 3D target.

In the embodiment, the problem that geometric consistency is invalid due to the problems of camera perspective, target frame shielding and the like is solved by establishing consistency association of the image and the laser for the monocular 3D projection frame and the laser 3D projection frame, and the accuracy of 3D target detection is improved.

Referring to fig. 5, fig. 5 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present application.

As shown in fig. 5, the 3D object detecting apparatus may include: a processor 1001, a memory 1005, and a communication bus 1002. The communication bus 1002 is used to enable connection communication between the processor 1001 and the memory 1005.

Optionally, the 3D object detecting device may further include a user interface, a network interface, a camera, an RF (Radio Frequency) circuit, a sensor, a WiFi module, and the like. The user interface may comprise a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional user interface may also comprise a standard wired interface, a wireless interface. The network interface may include a standard wired interface, a wireless interface (e.g., WI-FI interface).

Those skilled in the art will appreciate that the 3D object detecting device configuration shown in fig. 5 does not constitute a limitation of the 3D object detecting device and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 5, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, and a 3D object detection program. The operating system is a program that manages and controls the hardware and software resources of the 3D object detection device, supporting the execution of the 3D object detection program as well as other software and/or programs. The network communication module is used to enable communication between the various components within the memory 1005, as well as with other hardware and software in the 3D object detection system.

In the 3D object detection device shown in fig. 5, the processor 1001 is configured to execute a 3D object detection program stored in the memory 1005 to implement the steps of the 3D object detection method according to any one of the above.

The specific implementation of the 3D object detection device in the present application is substantially the same as that of each embodiment of the 3D object detection method, and is not described herein again.

The present application further provides a 3D target detection device, the device includes:

the feature extraction module is used for acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, respectively projecting the laser 3D projection frames and the monocular 3D projection frames onto an image, establishing a matching relation between the laser 3D projection frames and the monocular 3D projection frames, and combining 3D target information in the laser 3D target information graph and information of the monocular 3D target queue to obtain a multi-mode feature graph;

and/or, the target queue generating module further comprises:

And/or, the feature extraction module comprises:

the second obtaining sub-module is used for obtaining all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue to project to an image to obtain a laser 3D projection frame and a monocular 3D projection frame;

And/or, the third computation submodule further comprises:

a traversing unit, configured to traverse the laser 3D projection frames corresponding to the queue formed by the laser 3D object, and perform the steps for each traversed laser 3D projection frame:

and the data selection subunit is used for selecting the intersection-parallel ratio with the largest value in the calculation results as a target intersection-parallel ratio to form a first BEV characteristic diagram.

And/or the third obtaining sub-module further comprises:

And/or the third obtaining submodule further comprises:

And/or, the network prediction module further comprises:

the preprocessing submodule is used for performing dimensionality-up processing on the multi-modal feature map and distributing channel weights to features in the multi-modal feature map through an SE (sequence analysis) attention mechanism;

the network prediction submodule is used for calculating the confidence coefficient of the multi-modal feature map based on the channel weight and the convolution network;

The specific implementation of the 3D object detection apparatus of the present application is substantially the same as the embodiments of the 3D object detection method, and is not described herein again.

The present application provides a computer-readable storage medium, and the computer-readable storage medium stores one or more programs, which can be further executed by one or more processors for implementing the steps of the 3D object detection method described in any one of the above.

The specific implementation of the storage medium of the present application is substantially the same as the embodiments of the 3D object detection method, and is not described herein again.

The present application also provides a computer program product, comprising a computer program which, when executed by a processor, realizes the steps of the above-described 3D object detection method.

The specific implementation of the computer program product of the present application is substantially the same as that of the embodiments of the 3D object detection method, and is not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments may be implemented by a software plus hardware platform, or may be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of 3D object detection, the method comprising:

acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, respectively projecting the laser 3D projection frames and the monocular 3D projection frames onto an image, establishing a matching relation between the laser 3D projection frames and the monocular 3D projection frames, and combining 3D target information in the laser 3D target information graph and information of the monocular 3D target queue to obtain a multi-modal characteristic graph;

2. The 3D target detection method of claim 1, wherein the obtaining of the laser 3D target queue and the corresponding laser 3D information after the laser radar performs target detection on the 3D space comprises a laser 3D probability distribution map and a laser 3D target information map, and comprises:

and acquiring monocular 3D object queues based on a Smoke algorithm, wherein each monocular 3D object queue comprises confidence coefficient and size coordinate information.

3. The 3D object detection method according to claim 1, wherein the obtaining of all laser 3D objects in the laser 3D object queue and all monocular 3D objects in the monocular 3D object queue is performed by projecting onto an image, respectively, to obtain a laser 3D projection frame and a monocular 3D projection frame, establishing a matching relationship between the laser 3D projection frame and the monocular 3D projection frame, and obtaining a multimodal feature map by combining 3D object information in the laser 3D object information map and information of the monocular 3D object queue, includes:

acquiring all laser 3D targets in the laser 3D target queue and all monocular 3D targets in the monocular 3D target queue, projecting the laser 3D targets and the monocular 3D projection frame onto an image;

calculating the intersection and parallel ratio between the laser 3D projection frame and the monocular 3D projection frame to form a first BEV characteristic diagram;

and splicing the four BEV feature maps to obtain a multi-modal feature map.

4. The 3D object detection method of claim 3, wherein the calculating the intersection-to-parallel ratio between the laser 3D projection frame and the monocular 3D projection frame to form a first BEV feature map comprises:

5. The 3D object detection method of claim 3, wherein the obtaining a second BEV feature map of the confidence level of the laser 3D object comprises:

6. The 3D object detection method of claim 3, wherein the obtaining the confidence level of the monocular 3D object forms a third BEV feature map comprising:

7. The 3D object detection method of claim 1, wherein the computing the multi-modal feature map according to a convolutional network, and filtering the computed result based on a confidence threshold to obtain a 3D object queue, comprises:

8. A 3D object detection apparatus, characterized in that the apparatus comprises:

9. The 3D object detection method of claim 8, wherein the object queue generation module further comprises:

and/or, the target queue generating module further comprises:

And/or, the feature extraction module comprises:

And/or, the third computation submodule further comprises:

the data calculation subunit is configured to calculate, based on the monocular 3D projection frames and size information of the laser 3D projection frames, an intersection ratio between each of the laser 3D projection frames and each of the monocular 3D projection frames, where the size information is calculated by the monocular 3D object and projection information of the laser 3D object;

And/or the third obtaining sub-module further comprises:

And/or, the network prediction module further comprises:

and the filtering submodule is used for filtering and detecting the calculated result based on the confidence coefficient threshold value to obtain a 3D target corresponding to the filtered confidence coefficient and form a 3D target queue.

10. A 3D object detection device, characterized in that the 3D object detection device comprises a memory, a processor and a 3D object detection program stored on the memory and executable on the processor, the processor implementing the steps of the 3D object detection method according to any one of claims 1 to 7 when executing the 3D object detection program.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a 3D object detection program, which 3D object detection program, when executed by a processor, implements the steps of the 3D object detection method according to any one of claims 1 to 7.