CN112990129B

CN112990129B - Three-dimensional object detection method and system based on combination of vision and laser radar

Info

Publication number: CN112990129B
Application number: CN202110456751.4A
Authority: CN
Inventors: 李金波; 周启龙
Original assignee: Changsha Wanwei Robot Co ltd
Current assignee: Changsha Wanwei Robot Co ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-20
Anticipated expiration: 2041-04-27
Also published as: CN112990129A

Abstract

The invention discloses a three-dimensional object detection method and a system based on vision and laser radar combination, wherein the method comprises the following steps: acquiring a current point cloud frame and a current video frame; performing visual detection on the current video frame to obtain each visual detection result; carrying out depth judgment on each visual detection result to obtain visual approximate depth; converting the current point cloud frame into a sparse depth image, and extracting candidate depth detection frames in the sparse depth image according to each visual detection result; constructing a candidate point cluster of the current point cloud frame according to the candidate depth detection frame and the visual approximate depth; and constructing a three-dimensional object detection result based on each visual detection result and the corresponding candidate point cluster. The three-dimensional object detection can be carried out by distinguishing a medium-short distance and a long distance; the method can also process the errors of scene background and adjacent objects in the detection frame in the detection result of the visual two-dimensional object and the shielding phenomenon in the scene, and improve the distance range of the perception environment of the mobile platform and the detection effect of the three-dimensional object.

Description

Three-dimensional object detection method and system based on combination of vision and laser radar

Technical Field

The invention relates to the technical field of three-dimensional object detection, in particular to a three-dimensional object detection method and system based on combination of vision and laser radar.

Background

Three-dimensional object detection plays an important role in environmental perception in the application fields of automatic driving, mobile robots and the like, and three-dimensional measurement data are obtained by sensors such as a multi-line laser radar, a depth camera, stereoscopic vision, monocular vision, depth calculation and the like.

The three-dimensional point cloud obtained by multi-line laser radar measurement has space coordinates and depth information and is widely adopted in the fields of automatic driving vehicles and mobile robots, but the resolution of the laser radar point cloud in the vertical direction is low, and laser measurement points become sparse along with the increase of measurement distance, so that the laser radar point cloud is insufficient for the measurement of small objects and distant objects. The millimeter wave radar can directly measure the radial speed of an object, but measuring points are sparse, and noise factors are obvious. The depth camera directly obtains a scene depth map and a point cloud by emitting light beams in an infrared band. But is subject to sunlight outdoors and the range is typically less than twenty meters. The stereo vision obtains a dense depth map by calculating a disparity map, and the monocular vision can calculate the depth map by using a monocular image through training and reasoning of a depth neural network, but the visual depth estimation error becomes larger along with the increase of the measurement distance in a scene. Therefore, the multi-line laser radar is a sensor commonly used for environment perception of various mobile platforms.

Several recent methods exist for multiline lidar three-dimensional object detection, based on voxel, multi-view, and point features. Multi-sensors such as vision and millimeter wave radar, vision and laser radar, etc. are also used for three-dimensional object detection, but the effect of three-dimensional object detection on a long-distance object is lower than that of medium-distance and short-distance object detection. The three-dimensional object detection effects of the multi-line laser radar, the monocular vision and the binocular vision are reduced along with the increase of the measurement distance and the existence of shielding.

Disclosure of Invention

Aiming at the prior art, the invention provides a three-dimensional object detection method and a three-dimensional object detection system based on combination of vision and a laser radar, which are used for improving the distance range of a sensing environment of a mobile platform and the effect of three-dimensional object detection by combining the characteristics of two types of sensors.

In order to achieve the above object, the present invention provides a three-dimensional object detection method based on the combination of vision and laser radar, comprising the following steps:

step 1, acquiring a current point cloud frame of a multi-line laser radar and a current video frame of a vision measurement unit, wherein visual objects in the video frame correspond to laser objects in the point cloud frame one to one;

step 2, performing visual detection on a current video frame of a visual measurement unit to obtain a visual detection result of each visual object in the current video frame, wherein the visual detection result comprises a two-dimensional surrounding frame, a two-dimensional category and a two-dimensional confidence coefficient;

step 3, depth judgment is carried out on the visual detection result of each visual object to obtain the visual approximate depth of each visual object;

step 4, converting the current point cloud frame of the multi-line laser radar into a sparse depth image, and extracting a candidate depth detection frame of each laser object in the sparse depth image according to the visual detection result of each visual object;

step 5, constructing a candidate point cluster of the laser object in the current point cloud frame according to the candidate depth detection frame of each laser object and the visual approximate depth of the corresponding visual object;

and 6, constructing a three-dimensional object detection result based on the visual detection result of each visual object and the candidate point cluster corresponding to the laser object.

In one embodiment, in step 5, the constructing a candidate point cluster of the laser object in the current point cloud frame according to the candidate depth detection frame of each laser object and the visual approximate depth of the corresponding visual object specifically includes:

according to depth information of depth map pixel points in a candidate depth detection frame of a laser object, removing noise, depth pixel points belonging to a scene background and depth pixel points belonging to adjacent objects, and extracting a candidate depth pixel point set corresponding to a visual detection result;

and carrying out densification on the depth pixel point set which meets the preset quantity threshold and distance judgment conditions in the candidate depth pixel point set so as to construct a candidate point cluster of the laser object in the current point cloud frame.

In one embodiment, the removing, according to depth information of depth map pixel points in a candidate depth detection frame of a laser object, noise, depth pixel points belonging to a scene background, and depth pixel points belonging to an adjacent object, and extracting a candidate depth pixel point set corresponding to a visual detection result specifically includes:

judging whether the absolute value of the difference between the depth value of each depth pixel point in the candidate depth detection frame and the visual approximate depth of the corresponding visual object is greater than a preset depth difference threshold value or not, if so, marking the depth pixel point as a background object depth pixel point, and otherwise, marking the depth pixel point as a foreground object depth pixel point;

clustering all foreground object depth pixel points in the candidate depth detection frame, and taking an average value in a depth value cluster as a foreground object reference depth value;

judging whether the absolute value of the difference between the depth value of each depth pixel point in the candidate depth detection frame and the reference depth value of the foreground object is greater than a self-adaptive depth difference threshold value or not, if so, judging that the depth pixel point belongs to noise or belongs to a scene background or belongs to a depth pixel point of an adjacent object in the candidate depth detection frame, and marking the depth pixel point as unused;

and taking a set of depth pixel points which are not marked as unused in the candidate depth detection frame as a candidate depth pixel point set.

In one embodiment, the preset depth difference threshold is selected according to the precision of the visual approximate depth and monocular depth estimation, specifically:

if the visual approximate depth is less than or equal to 20 meters, the preset depth difference threshold value is 0.2 meter;

if the visual approximate depth is greater than 20 meters and less than or equal to 40 meters, the preset depth difference threshold value is 0.3 meter;

if the visual approximate depth is greater than 40 meters, the preset depth difference threshold value is 0.5 meters.

In one embodiment, the adaptive depth difference threshold corresponds to each depth pixel point in the candidate depth detection frame one to one, specifically: d is the depth value of the corresponding depth pixel point in the candidate depth detection frame, and s is a proportionality coefficient.

In one embodiment, the preset number threshold is the number of rows and columns of each depth pixel point in the candidate depth pixel point set in the sparse depth image;

the method comprises the following steps of performing densification on a depth pixel point set which meets preset quantity threshold and distance judgment conditions in a candidate depth pixel point set to construct a candidate point cluster of a laser object in a current point cloud frame, and specifically comprises the following steps:

and carrying out densification on the depth pixel points of which the reference depth value of the foreground object in the candidate depth pixel point set is less than or equal to 40 m and the number of lines in the sparse depth image is greater than or equal to 2 so as to construct a candidate point cluster of the laser object in the current point cloud frame.

In one embodiment, in step 6, constructing a three-dimensional object detection result based on the visual detection result of each visual object and the candidate point cluster of the corresponding laser object specifically includes:

carrying out three-dimensional object detection based on the point clusters for the candidate point clusters meeting the preset geometric and quantity threshold values, and constructing a first subset of three-dimensional object detection results;

for candidate point clusters which do not meet the preset geometric and quantity threshold, constructing a second subset of the three-dimensional object detection results based on the set information of the candidate point clusters and the corresponding visual detection results;

and taking the union of the first subset and the second subset of the three-dimensional object detection result as the three-dimensional object detection result, wherein the three-dimensional object detection result comprises the central point, the object category and the confidence coefficient information of the three-dimensional object.

In one embodiment, the preset geometric and quantity thresholds include a point cluster vertical amplitude threshold, a point cluster horizontal amplitude threshold, a point cluster limit distance threshold, and a point cluster point set quantity threshold;

for candidate point clusters meeting preset geometric and quantity thresholds, carrying out three-dimensional object detection based on the point clusters, and constructing a first subset of three-dimensional object detection results, specifically:

and performing three-dimensional object detection based on the point clusters to construct a first subset of three-dimensional object detection results, wherein the vertical amplitude of the point clusters is greater than or equal to a threshold of the vertical amplitude of the point clusters, the horizontal amplitude of the point clusters is greater than or equal to a threshold of the horizontal amplitude of the point clusters, the limit distance of the point clusters is greater than or equal to a threshold of the limit distance of the point clusters, and the number of the point clusters is greater than or equal to a threshold of the number of the point clusters.

In one embodiment, for candidate point clusters that do not satisfy the preset geometric and quantity thresholds, a second subset of three-dimensional object detection results is constructed based on the set information of the candidate point clusters and the corresponding visual detection results, specifically:

for each candidate point cluster which does not meet the preset geometric and quantity threshold, the following processing is carried out:

taking the coordinate of the center point of the current candidate point cluster as the center point of the current three-dimensional object detection, taking the two-dimensional category corresponding to the visual detection result as the object category of the three-dimensional object detection result, and taking the two-dimensional confidence corresponding to the visual detection result as confidence information of the three-dimensional object detection;

and summarizing the processing results to obtain a second subset.

In one embodiment, in step 1, the acquiring a current point cloud frame of the multiline lidar and a current video frame of the vision measurement unit specifically includes:

and aligning the time axis of the vision measuring unit and the multi-line laser radar, and acquiring the nearest current point cloud frame and current video frame on the time axis.

In order to achieve the above object, the present invention further provides a three-dimensional object detection system based on the combination of vision and lidar, which includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method when executing the computer program.

Compared with the prior art, the three-dimensional object detection method and system based on the combination of vision and laser radar provided by the invention have the following beneficial technical effects:

1. the three-dimensional object detection can be effectively carried out in a medium-short distance and a long distance;

2. the method can effectively process the errors of scene background and adjacent objects contained in the detection frame in the visual two-dimensional object detection result;

3. occlusion phenomena in the scene can be handled.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a three-dimensional object detection method based on the combination of vision and lidar in the embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that all the directional indicators (such as up, down, left, right, front, and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.

In addition, the descriptions related to "first", "second", etc. in the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "connected," "secured," and the like are to be construed broadly, and for example, "secured" may be a fixed connection, a removable connection, or an integral part; the connection can be mechanical connection, electrical connection, physical connection or wireless communication connection; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.

Fig. 1 shows a three-dimensional object detection method based on the combination of vision and lidar disclosed in this embodiment, which is mainly applied to environment sensing of mobile platforms such as autopilot and mobile robot, and the method includes the following steps:

step 1, obtaining a current point cloud frame of the multi-line laser radar and a current video frame of a vision measurement unit, wherein the vision measurement unit is a camera, internal references of the vision measurement unit can be calibrated by adopting an existing correction calibration method, external reference coordinate transformation matrixes of the vision measurement unit and the multi-line laser radar are determined by adopting an existing external reference calibration method, and the description of the external reference coordinate transformation matrixes is omitted in this embodiment. According to the field angle of the vision measuring unit, the field angle of the multi-line laser radar and the external reference coordinate transformation matrix, the overlapping field angle range of the vision measuring unit and the multi-line laser radar can be calculated. It should be noted that the calculation process of the overlapping view field angle range of the vision measurement unit and the multi-line lidar is a conventional technical means in the art, and therefore, the detailed description thereof is omitted in this embodiment.

As a preferred embodiment, a 360 degree field angle of the mobile platform perimeter may be formed by configuring a plurality of vision measurement units. Namely, the current point cloud frame and the current video frame cover the range of 360 degrees around the mobile platform.

In this embodiment, the specific process of acquiring the current point cloud frame of the multi-line laser radar and the current video frame of the vision measurement unit is as follows: and aligning a time axis of the vision measuring unit with the multi-line laser radar, and acquiring a current point cloud frame and a current video frame which are closest to each other on the time axis, so as to ensure that the current point cloud frame and the current video frame have higher matching degree.

And 2, performing visual detection on the current video frame of the visual measurement unit to obtain a visual detection result of each visual object in the current video frame, wherein the visual detection result comprises a two-dimensional surrounding frame, a two-dimensional category and a two-dimensional confidence coefficient. In a specific implementation process, the pre-trained YOLOv4 deep neural network model can be used for performing visual detection on the current video frame of the visual measurement unit, so as to obtain a two-dimensional surrounding frame, a central position, a two-dimensional category and a two-dimensional confidence of each visual object in the current video frame.

And 3, carrying out depth judgment on the visual detection result of each visual object to obtain the visual approximate depth of each visual object.

Specifically, the visual approximate depth of each visual object can be obtained according to a monocular depth estimation method FADNet for a two-dimensional visual object, or according to a depth estimation model of a visual measurement unit, or by adopting other methods capable of calculating the visual approximate depth.

More specifically, if the depth estimation model of the vision measurement unit is adopted, the vision approximate depth of each vision object is obtained. When the vision measurement unit employed is a monocular vision unit, the depth estimation model may include at least one of: and the operation quantities of the fast depth model of FastDepth, the DepthNet Nano, the pyDNet, the SGM-MDE and the like meet the monocular depth model required by real-time detection. When the adopted vision measurement unit is a binocular vision unit, the depth estimation model can use binocular ranging algorithms such as SGBM and R3SGM with the operand meeting the real-time detection requirement, and can also use a convolutional neural network to carry out binocular depth estimation, wherein the binocular depth estimation model comprises parallax estimation branches in a PSmNet pyramid binocular matching network, SDNet and YOLOStereo 3D. In this embodiment, details of a depth estimation model specific algorithm are not described.

In a specific implementation process, after the visual approximate depth of each visual object is obtained, the depth type of each visual object can be divided, so that the design of a preset depth difference threshold value in a subsequent calculation process is facilitated. For example, if the visual approximate depth is less than or equal to 20 meters, the depth type of the visual object is determined to be a near distance; if the visual approximate depth is greater than 20 meters and less than or equal to 40 meters, determining that the depth type of the visual object is a medium-short distance; if the visual approximate depth is greater than 40 meters and less than or equal to 50 meters, determining that the depth type of the visual object is a medium-distance; and if the visual approximate depth is more than 50 meters, judging that the depth type of the visual object is a long distance.

And 4, converting the current point cloud frame of the multi-line laser radar into a sparse depth image, wherein the candidate depth detection frame of each laser object in the sparse depth image can be extracted according to the visual detection result of each visual object because the visual measurement unit and the external reference coordinate transformation matrix of the multi-line laser radar are determined.

The method comprises the following steps of converting a current point cloud frame of the multi-line laser radar into a sparse depth image, specifically:

step 4.1, determining a ground point cloud subset in a current point cloud frame by using a RANSAC method;

step 4.2, filtering a ground point cloud subset in the current point cloud frame to obtain non-ground laser point cloud;

and 4.3, converting the non-ground laser point cloud into a second sparse depth image.

Step 5, constructing a candidate point cluster of the laser object in the current point cloud frame according to the candidate depth detection frame of each laser object and the visual approximate depth of the corresponding visual object, wherein the specific process is as follows:

step 5.1, according to depth information of depth map pixel points in a candidate depth detection frame of a laser object, removing noise, depth pixel points belonging to a scene background and depth pixel points belonging to an adjacent object, and extracting a candidate depth pixel point set corresponding to a visual detection result, wherein the method specifically comprises the following steps:

and 5.1.1, judging whether the absolute value of the difference between the depth value of each depth pixel point in the candidate depth detection frame and the visual approximate depth of the corresponding visual object is greater than a preset depth difference threshold value, if so, marking the depth pixel point as a background object depth pixel point, and otherwise, marking the depth pixel point as a foreground object depth pixel point. And constructing a depth image of corresponding length and width parameters according to the horizontal resolution and the vertical resolution of the multi-line laser radar, wherein the depth value of a depth pixel point refers to the distance value of a laser measuring point with the same elevation angle and horizontal angle in the multi-line laser radar.

Step 5.1.2, clustering each foreground object depth pixel point in the candidate depth detection frame to obtain a depth pixel cluster of the foreground object depth pixel points, and taking the average value in the depth value cluster of the depth pixel clusters as a foreground object reference depth value;

step 5.1.3, judging whether the absolute value of the difference between the depth value of each depth pixel point in the candidate depth detection frame and the reference depth value of the foreground object is greater than a self-adaptive depth difference threshold value, if so, judging that the depth pixel point belongs to noise or belongs to a scene background or belongs to a depth pixel point of an adjacent object in the candidate depth detection frame, and marking the depth pixel point as unused;

In step 5.1.1, the preset depth difference threshold is selected according to the precision of the visual approximate depth and monocular depth estimation, for example:

It should be noted that the preset depth difference threshold in the present embodiment may also be set in a proportional relationship with the visual approximation depth.

In step 5.1.3, the adaptive depth difference threshold corresponds to each depth pixel point in the candidate depth detection frame one to one, specifically: d × s, where d is a depth value of a corresponding depth pixel point in the candidate depth detection box, i.e., a distance value of the laser measurement point, and s is a scaling factor used for determining the adaptive depth difference threshold, for example, s is 0.01. Or s can be self-adaptively taken according to the distance value of the measuring point.

And 5.2, carrying out densification on the depth pixel point set which meets the preset quantity threshold and distance judgment conditions in the candidate depth pixel point set so as to construct a candidate point cluster of the laser object in the current point cloud frame. For example, an improved bilinear interpolation method can be used for the densification of depth pixel points.

In a specific implementation process, the preset quantity threshold is the number of rows and the number of columns of each depth pixel point in the candidate depth pixel point set in the sparse depth image. Therefore, in this embodiment, the densification is performed on the depth pixel point set in the candidate depth pixel point set that meets the preset number threshold and distance determination condition to construct a candidate point cluster of the laser object in the current point cloud frame, specifically:

and carrying out densification on the foreground object reference depth value in the candidate depth pixel point set which is less than or equal to 40 meters and the depth pixel points of which the line number is greater than or equal to 2 in the sparse depth image so as to construct a candidate point cluster of the laser object in the current point cloud frame, wherein the line number in the multi-line laser radar measurement point cloud is usually obviously less than the line number, so that the line number is limited to be greater than or equal to 2.

And 6, constructing a three-dimensional object detection result based on the visual detection result of each visual object and the candidate point cluster corresponding to the laser object, wherein the three-dimensional object detection result comprises a central point of the three-dimensional object, an object type and confidence information. The process of obtaining the detection result of the three-dimensional object specifically comprises the following steps:

and 6.1, carrying out three-dimensional object detection based on the point clusters for the candidate point clusters meeting the preset geometric and quantity thresholds, and constructing a first subset of the three-dimensional object detection result. The preset geometric and quantity threshold values comprise a point cluster vertical amplitude threshold value, a point cluster horizontal amplitude threshold value, a point cluster limiting distance threshold value and a point cluster point set quantity threshold value. For example, the threshold value of the vertical amplitude of the point cluster is 0.4 m, the threshold value of the horizontal amplitude of the point cluster is 0.3 m, the threshold value of the limiting distance of the point cluster is 40 m and the threshold value of the number of point cluster sets of the point cluster is 3 for the 16-line laser radar; and performing three-dimensional object detection based on the point clusters to construct a first subset of three-dimensional object detection results, wherein the vertical amplitude of the point clusters is greater than or equal to 0.4 m, the horizontal amplitude of the point clusters is greater than or equal to 0.3 m, the limit distance of the point clusters is greater than or equal to 40 m, and the number of the point clusters is greater than or equal to 3. In a specific implementation process, the pre-trained PointNet + + can be adopted to perform three-dimensional object detection on the candidate point cluster, so as to obtain the center point, the object category and the confidence information of the three-dimensional object.

And 6.2, for candidate point clusters which do not meet the preset geometric and quantity threshold, constructing a second subset of the three-dimensional object detection result based on the set information of the candidate point clusters and the corresponding visual detection result. In the specific implementation process, for each candidate point cluster which does not satisfy the preset geometric and quantity threshold, the following processing is performed: and taking the coordinate of the central point of the current candidate point cluster as the central point of the current three-dimensional object detection, taking the two-dimensional category corresponding to the visual detection result as the object category of the three-dimensional object detection result, and taking the two-dimensional confidence corresponding to the visual detection result as the confidence information of the three-dimensional object detection. And finally, summarizing the processing results to obtain a second subset.

And 6.3, taking the union of the first subset and the second subset of the three-dimensional object detection result as the three-dimensional object detection result.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A three-dimensional object detection method based on vision and laser radar combination is characterized by comprising the following steps:

step 3, depth judgment is carried out on the visual detection result of each visual object to obtain the visual approximate depth of each visual object, wherein the visual approximate depth is obtained according to a depth estimation model;

step 6, constructing a three-dimensional object detection result based on the visual detection result of each visual object and the candidate point cluster corresponding to the laser object;

in step 5, constructing a candidate point cluster of the laser object in the current point cloud frame according to the candidate depth detection frame of each laser object and the visual approximate depth of the corresponding visual object, specifically:

carrying out densification on a depth pixel point set which meets preset quantity threshold and distance judgment conditions in the candidate depth pixel point set so as to construct a candidate point cluster of the laser object in the current point cloud frame;

the method comprises the steps of removing noise and depth pixel points belonging to a scene background and depth pixel points belonging to adjacent objects according to depth information of depth map pixel points in a candidate depth detection frame of a laser object, and extracting a candidate depth pixel point set corresponding to a visual detection result, and specifically comprises the following steps:

2. The vision and lidar combined-based three-dimensional object detection method of claim 1, wherein the preset depth difference threshold is selected according to the precision of vision approximate depth and monocular depth estimation, specifically:

3. The vision and lidar combined-based three-dimensional object detection method according to claim 1, wherein the adaptive depth difference threshold corresponds to each depth pixel point in the candidate depth detection frame one to one, specifically: d is the depth value of the corresponding depth pixel point in the candidate depth detection frame, and s is a proportionality coefficient.

4. The vision and lidar combined-based three-dimensional object detection method of claim 1, wherein the preset number threshold is a number of rows and a number of columns of each depth pixel in the candidate depth pixel set in the sparse depth image;

5. The method for detecting a three-dimensional object based on the combination of vision and lidar according to any of claims 1 to 4, wherein in step 6, the step of constructing the three-dimensional object detection result based on the vision detection result of each visual object and the candidate point cluster corresponding to the laser object comprises:

6. The vision-and-lidar-based joint three-dimensional object detection method of claim 5, wherein the preset geometric and quantity thresholds comprise a point cluster vertical amplitude threshold, a point cluster horizontal amplitude threshold, a point cluster limit distance threshold, a point cluster point set quantity threshold;

7. The vision-lidar-based joint three-dimensional object detection method according to claim 5, wherein for candidate point clusters that do not satisfy a preset geometric and quantity threshold, a second subset of three-dimensional object detection results is constructed based on set information of the candidate point clusters and corresponding vision detection results, specifically:

and summarizing the processing results to obtain a second subset.

8. A vision-based lidar associated three-dimensional object detection system comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method according to any of claims 1 to 7.