CN113985445A

CN113985445A - 3D target detection algorithm based on data fusion of camera and laser radar

Info

Publication number: CN113985445A
Application number: CN202110972750.5A
Authority: CN
Inventors: 安旭阳; 苏治宝; 杜志岐; 李兆冬; 陈佳琪; 宋威龙; 余雪玮; 李冀川; 索旭东
Original assignee: Intelligent Mobile Robot Zhongshan Research Institute; China North Vehicle Research Institute
Current assignee: Intelligent Mobile Robot Zhongshan Research Institute; China North Vehicle Research Institute
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2022-01-28

Abstract

According to the 3D target detection algorithm based on the data fusion of the camera and the laser radar, a YOLOv5 model is selected as a 2D target detector, and target detection and tracking are carried out on a 2D image layer to obtain a 2D boundary frame for target detection; constructing a mapping relation between image data acquired by a camera and point cloud data acquired by a laser radar according to the calibration principle of the camera and the laser radar, and synchronizing the image data acquired by the camera and the point cloud data acquired by the laser radar; taking point cloud data acquired by a laser radar as input, and partitioning background point cloud and foreground point cloud in the point cloud data by adopting a ground partitioning algorithm and a point cloud cutting technology; and projecting the segmented foreground point cloud to a camera imaging plane according to the mapping relation between the image data and the point cloud data, taking the foreground point cloud in the 2D boundary frame of the target as a point cloud candidate area of the target, and extracting the 3D boundary frame of the target through Euclidean clustering to realize the 3D detection of the target. 3D target detection in a dynamic scene can be achieved on the premise of not depending on the position of a guiding person.

Description

3D target detection algorithm based on data fusion of camera and laser radar

Technical Field

The invention belongs to the technical field of ground unmanned platform environment perception, and particularly relates to a 3D target detection algorithm based on data fusion of a camera and a laser radar.

Background

The ground unmanned platform needs to follow a guide person to run in an actual application scene, a local task path can be planned in real time and an obstacle can be avoided according to the position of the guide person in the period, and the following capability of the ground unmanned platform is determined according to whether a vehicle-mounted perception algorithm can accurately output the position of the guide person. The current image-based target detection algorithm can only provide two-dimensional information and cannot be used for autonomous navigation driving of a ground unmanned platform, and binocular stereo vision is adopted, so that depth information can be perceived, the perception distance is limited and the precision is low. The laser radar is used as another typical sensor of the ground unmanned platform, scene outline and position information is provided, a series of three-dimensional convolution processing can be carried out on original point cloud of the laser radar or after the original point cloud is converted into a regular representation mode, target features are extracted, the calculation real-time performance is poor, and the high real-time performance requirement of the ground unmanned platform cannot be met. Furthermore, there are two main problems with lidar-based personnel detection: 1) the guide staff has smaller scale, and particularly the point cloud in a remote area is sparse; 2) the act of guiding the person is unpredictable and susceptible to flowers and trees in the background.

Disclosure of Invention

The invention overcomes one of the defects of the prior art, provides a 3D target detection algorithm based on data fusion of a camera and a laser radar, and realizes 3D target detection in a dynamic scene.

According to an aspect of the present disclosure, the present invention provides a 3D target detection algorithm based on camera and lidar data fusion, the algorithm comprising:

selecting a YOLOv5 model as a 2D target detector, and detecting and tracking a target in a 2D image layer to obtain a 2D boundary frame for target detection;

constructing a mapping relation between image data acquired by a camera and point cloud data acquired by a laser radar according to the calibration principle of the camera and the laser radar, and synchronizing the image data acquired by the camera and the point cloud data acquired by the laser radar;

taking the point cloud data collected by the laser radar as input, and partitioning the background point cloud and the foreground point cloud in the point cloud data by adopting a ground partitioning algorithm and a point cloud cutting technology;

and projecting the segmented foreground point cloud to a camera imaging plane according to the mapping relation between the image data and the point cloud data, taking the foreground point cloud in the 2D boundary frame of the target as a point cloud candidate area of the target, and extracting a 3D boundary frame of the target through Euclidean clustering to realize 3D detection of the target.

In a possible implementation manner, the detecting and tracking of the target in the 2D image layer to obtain the 2D bounding box for target detection includes:

constructing a target constant-speed motion model;

according to the target constant-speed motion model, fusing motion characteristics and frame characteristic information of a current frame and an adjacent next frame of the target, and calculating the similarity of a predicted track and a detected track of the target;

and performing correlation matching on the target by adopting a Hungarian algorithm according to the direction similarity of the predicted track and the detected track of the target, realizing detection and tracking of the target on a 2D image layer, and obtaining a 2D bounding box for target detection.

In one possible implementation, synchronizing the image data collected by the camera and the point cloud data collected by the lidar includes:

and taking the sampling time of the laser radar as a synchronous time node, and performing linear interpolation on the 2D bounding box of the target detection acquired based on the camera.

In one possible implementation manner, taking point cloud data acquired by the laser radar as input, segmenting a background point cloud and a foreground point cloud in the point cloud data, including:

performing dimensionality reduction on the point cloud data by adopting a voxel filtering algorithm;

calculating the included angle between each point cloud data and the X axis of the laser radar and the horizontal distance r of the center point of the laser radar, and dividing the point cloud data into fan-shaped areas according to the horizontal angular resolution of the laser radar;

dividing each sector area into a plurality of sub-areas in the radial direction according to the horizontal distance r, and orderly storing the point cloud data in a ray form;

and performing straight line fitting on the adjacent two-point cloud data of the same ray, wherein if the slope of the straight line meets the local slope threshold of the adjacent two-point cloud data, the adjacent two-point cloud data is background point cloud, and otherwise, the adjacent two-point cloud data is foreground point cloud, so that the background point cloud and the foreground point cloud in the point cloud data are segmented.

In one possible implementation, the similarity between the predicted trajectory and the detected trajectory of the target includes: the direction similarity and the position similarity of the predicted trajectory and the detected trajectory of the target.

According to the 3D target detection algorithm based on the data fusion of the camera and the laser radar, a YOLOv5 model is selected as a 2D target detector, and target detection and tracking are carried out on a 2D image layer to obtain a 2D boundary frame for target detection; constructing a mapping relation between image data acquired by a camera and point cloud data acquired by a laser radar according to the calibration principle of the camera and the laser radar, and synchronizing the image data acquired by the camera and the point cloud data acquired by the laser radar; taking the point cloud data collected by the laser radar as input, and partitioning the background point cloud and the foreground point cloud in the point cloud data by adopting a ground partitioning algorithm and a point cloud cutting technology; and projecting the segmented foreground point cloud to a camera imaging plane according to the mapping relation between the image data and the point cloud data, taking the foreground point cloud in the 2D boundary frame of the target as a point cloud candidate area of the target, and extracting a 3D boundary frame of the target through Euclidean clustering to realize 3D detection of the target. 3D target detection in a dynamic scene can be achieved on the premise of not depending on the position of a guiding person.

Drawings

The accompanying drawings are included to provide a further understanding of the technology or prior art of the present application and are incorporated in and constitute a part of this specification. The drawings expressing the embodiments of the present application are used for explaining the technical solutions of the present application, and should not be construed as limiting the technical solutions of the present application.

FIG. 1 shows a flow diagram of a 3D target detection algorithm based on camera and lidar data fusion according to an embodiment of the disclosure;

FIG. 2 shows a schematic structural diagram of the YOLOv5 model according to an embodiment of the present disclosure;

FIG. 3 shows a further defined flowchart of step S1 according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow diagram of target cascade matching according to an embodiment of the present disclosure;

FIG. 5a shows a schematic diagram of laser radar and camera synchronization according to an embodiment of the present disclosure;

FIG. 5b shows a schematic diagram of laser radar and camera synchronization according to another embodiment of the present disclosure;

fig. 6 shows a schematic diagram of transverse and radial cuts of a laser point cloud according to another embodiment of the present disclosure.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the accompanying drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the corresponding technical effects can be fully understood and implemented. The embodiments and the features of the embodiments can be combined without conflict, and the technical solutions formed are all within the scope of the present invention.

Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 shows a flow chart of a 3D target detection algorithm based on camera and lidar data fusion according to an embodiment of the disclosure. As shown in fig. 1, the algorithm may include:

step S1: and selecting a YOLOv5 model as a 2D object detector, and detecting and tracking the object in a 2D image layer to obtain a 2D boundary frame for object detection.

Wherein the 2D image layer may be an image data layer acquired with a camera.

Fig. 2 shows a schematic structural diagram of the YOLOv5 model according to an embodiment of the present disclosure.

As shown in fig. 2, the YOLOv5 model structure can be used to extract target features by using a backbone Network layer of the YOLOv5 model, which contains a Focus structure and a CSP1 structure, perform multi-scale Feature fusion by using a Feature Pyramid Network (FPN) and a Perceptual countermeasure Network (PAN), and finally suppress a target frame with low reliability by using a DIOU _ NMS, thereby realizing detection of a category attribute and a bounding box of a target in a scene.

FIG. 3 shows a further defined flowchart of step S1 according to an embodiment of the present disclosure.

In an example, as shown in fig. 3, the detecting and tracking of the target in the 2D image layer, and obtaining the 2D bounding box for target detection may include:

step S11: and constructing a target constant-speed motion model.

For example, suppose that a person is guided to move at a constant speed in a ground unmanned platform application scene, and a target constant-speed movement model is constructed.

Defining the position of a bounding box of the target to be tracked at the moment of t-1 as (c)_x，c_yW, h), the states can be represented by 8-dimensional vectors

Predicting the position of the target to be tracked at the moment t through a standard Kalman filter

As shown in formulas (1) and (2):

P’＝FP_t-1F^T+ Q formula (2),

in the formula, P_t-1Estimating covariance for t-1 time posteriori, representing state uncertainty, P' is the prior estimated covariance for t-1 time, F is the state transition matrix, Q is the process excitation noise covariance,

is an estimated value of the posterior state at the time t-1,

is the prior state estimated value at the time t.

Step S12: and predicting the track of the next frame of the target by using a Kalman filtering algorithm according to the target constant-speed motion model, fusing the motion characteristics and frame characteristic information of the current frame of the target and the adjacent next frame of the current frame of the target, and calculating the similarity of the predicted track and the detected track of the target. The similarity between the predicted track and the detected track of the target comprises the following steps: the direction similarity and the position similarity of the predicted trajectory and the detected trajectory of the target. For example, appearance (frame) features of a target can be extracted by using a residual network model trained offline on a pedestrian re-recognition data set, the appearance (frame) features are represented by 128-dimensional normalized feature descriptors, and the direction similarity of a predicted track and a detected track is measured by adopting cosine distance; and measuring the position similarity of the target by calculating the Mahalanobis distance of the predicted track and the detected track.

Step S13: and performing correlation matching on the target by adopting a Hungarian algorithm according to the direction similarity of the predicted track and the detected track of the target, realizing the detection and tracking of the target on a 2D image layer, and obtaining a 2D bounding box for target detection.

FIG. 4 shows a flow diagram of target cascade matching according to an embodiment of the present disclosure.

For example, if a new target appears in the dynamic scene, a corresponding tracker is assigned for the newly appearing target. Each tracker sets a parameter time _ update as the priority of cascade matching, if the current frame matching is successful, the time _ update is set to be 0, otherwise, 1 is added. As shown in fig. 4, when the target cascade matching is performed subsequently, the parameter time _ update is small for the priority matching, and the parameter time _ update is large for the post-matching.

In order to improve the effect of cascade matching of the target to be detected, the Intersection over Union (IoU) of the target track of the next frame and the detection bounding box of the target track of the current frame can be calculated and predicted at the same time, and the Hungarian algorithm is adopted to continue to carry out association matching and allocate corresponding IDs. Finally according to the actual detection value z at the moment t_t＝[c_x,c_y,w,h]Updating the position of the target to be tracked, as shown in formulas (3), (4) and (5):

P_t＝(I-K_tH) p' is a group represented by the formula (5),

wherein H is a conversion matrix from a state variable to an observed variable, and R is a measurement noise covariance

Difference, Z_tAs an observed value, K_tIn order to be the basis of the kalman gain,

is an estimated value of the posterior state at the time t. Tong (Chinese character of 'tong')

The 2D bounding box for detecting the target to be detected can be obtained through the steps.

Step S2: and constructing a mapping relation between image data acquired by the camera and point cloud data acquired by the laser radar according to the calibration principle of the camera and the laser radar, and synchronizing the image data acquired by the camera and the point cloud data acquired by the laser radar. The sampling time of the laser radar can be used as a synchronous time node, and linear interpolation can be performed on the 2D bounding box based on the target detection acquired by the camera.

For example, firstly, a camera imaging model and a distortion model are constructed according to the pinhole imaging principle, and are specifically shown in formulas 6-8:

wherein f is the focal length of the camera; (fx, fy) are equivalent focal lengths in the x and y directions, respectively; m₁、M₂Respectively, an internal reference matrix and an external reference matrix of the camera.

Wherein (x, y) is the coordinate point of the target actual position, r is the distance from the target image point to the central point, and k₁、k₂And k₃Is the radial distortion coefficient.

In the formula, p₁And p₃Is the tangential distortion coefficient.

For example, a checkerboard with a grid size of 50mm × 50mm and 9x7 may be selected as the internal reference calibration plate, the internal reference calibration plate is moved to the leftmost, rightmost, uppermost and lowermost sides of the camera's field of view, the distance and the tilt angle between the camera and the internal reference calibration plate are adjusted, 25 pictures are collected, and the equivalent focal length (fx, fy) and the radial distortion coefficient (k) in the camera imaging model and the distortion model are measured by the least square method₁，k₂，k₃) And tangential distortion coefficient (p)₁，p₃) And (6) solving.

Then selecting a checkerboard with the grid size of 108mm x 108mm and 9x7 as an external reference calibration board, sequentially moving the external reference calibration board to 9 positions of the near side, the middle side and the far side of the visual field of the camera, adjusting the posture of the external reference calibration board at each position, wherein the posture comprises five postures of upward pitching, downward pitching, left deviation, right deviation and center, each posture lasts for 3 seconds, ensuring that the external reference calibration board does not generate a fuzzy phenomenon, and selecting effective 20 frames of point cloud frames and image frames. And (3) utilizing the position and the direction of each frame of checkerboard captured by the AUTOWARE toolbox in a camera coordinate system, simultaneously extracting point cloud data on the checkerboard, and calculating a rotation matrix and a translation matrix of the camera relative to the laser radar according to a formula (9).

Where (α, β, γ) is the rotational euler angle of the camera to the lidar; r (α, β, γ) is the camera-to-lidar rotation matrix; (x, y, z) is the camera to lidar translation value; t (x, y, z) is a translation matrix of the camera to the lidar; p is a radical of_iThe position of the ith checkerboard in a camera coordinate system; n is_iThe normal vector of the ith checkerboard in a camera coordinate system; q. q.s_i,jIs the position of the j point of the ith checkerboard in the laser radar coordinate system.

In an example, synchronizing the camera acquired image data and lidar acquired point cloud data may include:

Fig. 5a and 5b show schematic diagrams of lidar and camera synchronization, respectively, according to an embodiment of the disclosure.

The sampling time node of the laser radar sensor is selected as a synchronous insertion point, and as the position of a target in an image acquired by the camera is continuously changed, the image data of the camera sensor can be obtained by interpolation according to the time node of the laser radar. For example, the laser radar and the camera sampling data may be sequentially placed in corresponding buffers, and the timestamp of the laser radar sampling node is used as the synchronization time, and the camera data buffer is searched for data adjacent to the synchronization time. As shown in fig. 5a, if the synchronization time synctime is less than the most forward time of the camera data buffer, the data synchronization is considered to be failed, the current lidar point cloud frame is discarded, and the next lidar data frame is waited. If the synchronization time Syntime is greater than the timestamp corresponding to the buf [1] of the camera buffer, the current buffer is considered to have enough data for synchronization, the previous data in the camera buffer is discarded circularly, and the subsequent data move forward in sequence until the data synchronization is successful.

As shown in FIG. 5b, if the synchronization time Syntime is already between buf [0] and buf [1] of the camera buffer, but the time difference between the synchronization time Syntime and buf [0] or buf [1] is greater than the maximum inference time threshold for camera target identification, it can be considered that the camera buffer has data loss, the current lidar point cloud frame and buf [0] data of the camera buffer are discarded, the subsequent data moves forward, and the next frame of lidar data is waited for synchronization. The 2D bounding box based on camera sensor target detection is finally linearly interpolated, as shown in equation 10,

in the formula, x is the coordinates of the upper left pixel and the lower right pixel of the 2D bounding box; time is the sampling time of the camera buffer; syncbuf is the result of data synchronization of the camera and the laser radar.

Step S3: and taking the point cloud data acquired by the laser radar as input, and partitioning the background point cloud and the foreground point cloud in the point cloud data by adopting a ground partitioning algorithm and a point cloud cutting technology.

Specifically, a voxel filtering algorithm is adopted to perform dimension reduction processing on the point cloud data. For example, voxel filtering may be used to segment the input point cloud data into 0.2m × 0.2m cubes, and the centroid of each small cube represents all the point clouds in the small cube, so as to perform downsampling on the input point cloud. And because the target to be tracked usually moves in front of the ground unmanned platform, point cloud clipping is carried out on the point cloud after down sampling, and the point cloud outside the road range and behind the ground unmanned platform is removed. In addition, the point cloud of the laser radar in a close range is filtered, and the reflection influence of the laser radar of the ground unmanned platform is eliminated. Then, the ground segmentation algorithm is adopted to convert unordered input point cloud into ordered output, the dimension of points (x, y, z) is reduced into (x, y), the included angle alpha between each point cloud data and the laser radar x axis and the horizontal distance r of the laser radar central point are calculated according to the formulas (11) and (12),

fig. 6 shows a schematic diagram of transverse and radial cuts of a laser point cloud according to another embodiment of the present disclosure. As shown in fig. 6, the point cloud is divided into sector areas according to the horizontal angular resolution of the laser radar, and each sector area is divided into a plurality of sub-areas according to the distance r in the radial direction, so that the point cloud is stored in order in the form of rays. And introducing a local gradient threshold value of two adjacent points of the same ray, performing straight line fitting on the two adjacent point cloud data of the same ray, if the straight line slope meets the local gradient threshold value of the two adjacent point cloud data, the two adjacent point cloud data is background point cloud (ground point cloud), and if not, the two adjacent point cloud data is foreground point cloud (non-ground point cloud), and realizing the segmentation of the background point cloud and the foreground point cloud in the point cloud data, namely completing the segmentation of the ground point cloud data and the non-ground point cloud data of the point cloud data.

Step S4: and projecting the segmented foreground point cloud to a camera imaging plane according to the mapping relation between the image data and the point cloud data, taking the foreground point cloud in the 2D boundary frame of the target as a point cloud candidate area of the target, and extracting a 3D boundary frame of the target through Euclidean clustering to realize 3D detection of the target.

Specifically, a mapping relation from a three-dimensional point cloud coordinate in a laser radar coordinate system to an image point in an image pixel coordinate system is established through the combined calibration of the laser radar and the camera in the step S2, the segmented foreground point cloud (non-ground point cloud) is projected and a data index is constructed, and the three-dimensional point cloud information is unified to the image coordinate. And temporarily assuming the point cloud falling in the 2D boundary frame of the target detection after projection as a spatial position area of the target to be tracked, carrying out Euclidean clustering on the point cloud in the current area, and simultaneously adopting a kd-tree algorithm to accelerate the data search rate so as to reduce the calculated amount. The category of the cluster center closest to the center of the 2D boundary frame of the target detection can be regarded as the point cloud of the area range of the target to be tracked, the boundary of the point cloud is the 3D boundary frame of the target to be detected, and the center position coordinate of the target to be tracked can be effectively calculated according to the mapping relation between the 2D boundary frame and the 3D boundary frame of the target detection, so that the 3D target detection is realized.

Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A 3D target detection algorithm based on camera and lidar data fusion, the algorithm comprising:

2. The 3D object detection algorithm according to claim 1, wherein the detecting and tracking the object in the 2D image layer to obtain the 2D bounding box for object detection comprises:

constructing a target constant-speed motion model;

predicting the track of the next frame of the target by using a Kalman filtering algorithm according to the target constant-speed motion model, fusing the motion characteristics and frame characteristic information of the current frame of the target and the adjacent next frame of the current frame of the target, and calculating the similarity of the predicted track and the detected track of the target;

3. The 3D target detection algorithm of claim 1, wherein synchronizing the camera acquired image data and lidar acquired point cloud data comprises:

4. The 3D target detection algorithm of claim 1, wherein the segmentation of the background point cloud and the foreground point cloud in the point cloud data using the point cloud data collected by the lidar as an input comprises:

5. The 3D object detection algorithm of claim 2, wherein the similarity between the predicted trajectory and the detected trajectory of the object comprises: the direction similarity and the position similarity of the predicted trajectory and the detected trajectory of the target.