CN109948661B

CN109948661B - 3D vehicle detection method based on multi-sensor fusion

Info

Publication number: CN109948661B
Application number: CN201910144580.4A
Authority: CN
Inventors: 蔡英凤; 张田田; 王海; 李祎承; 刘擎超; 陈小波
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2023-04-07
Anticipated expiration: 2039-02-27
Also published as: CN109948661A

Abstract

The invention discloses a 3D vehicle detection method based on multi-sensor fusion, which comprises the following steps: step 1, obtaining semantic information (namely RGB (red, green and blue) images) of a vehicle through a camera arranged on the vehicle, and scanning the surrounding environment of the vehicle through a laser radar positioned on the roof of the vehicle to obtain accurate depth information (namely laser radar point cloud) of the environment; step 2, preprocessing the laser radar point cloud, taking a Z axis [0,2.5] m according to the height of the automobile, and equally dividing the point cloud into 5 height slices along the Z axis direction; step 3, generating a 3D vehicle region of interest on the laser radar point cloud; step 4, respectively extracting the features of the processed radar point cloud and the processed RGB image and generating corresponding feature maps; step 5, respectively mapping the 3D vehicle interesting regions to a point cloud and a feature map of an RGB image; and 6, fusing the mapping part of the characteristic diagrams in the step 5, and finally realizing the 3D positioning and detection of the vehicle target.

Description

3D vehicle detection method based on multi-sensor fusion

Technical Field

The invention belongs to the field of automatic driving, and particularly relates to a vehicle 3D detection method based on multi-sensor fusion.

Background

The intelligent vehicle is a complex system comprising technologies such as sensing, decision-making and control, environment sensing provides basic information for path planning and decision-making control, automobile detection is the most critical work in the environment sensing system of the autonomous vehicle, the mainstream obstacle detection sensors are a camera and a laser radar, the existing vehicle detection based on vision has a good effect, the cost of the camera is low, the texture and the color of a target can be obtained, and therefore the camera is used in the aspect of intelligent driving, but the camera is widely sensitive to illumination and shadow parts, cannot provide accurate and sufficient position information, and often causes the problems of low real-time performance or poor robustness and the like. The laser radar can obtain the target distance and the three-position information, the detection distance is far, the influence of illumination is avoided, the texture and the color of the target cannot be determined, and therefore the requirement for autonomous driving cannot be met by a single sensor. Therefore, the laser radar and the camera are adopted to perform data fusion to complete vehicle detection and tracking tasks, the dependence on the detection effect of a single sensor in vehicle detection is reduced, and a high 3D vehicle detection rate is obtained.

Disclosure of Invention

The invention aims to better detect surrounding vehicles so as to provide basic information for intelligent vehicle path planning and decision making, and provides a 3-dimensional (3D) vehicle detection method based on sensor fusion, which can obtain higher 3D vehicle detection rate.

The technical scheme adopted by the 3D vehicle detection method based on multi-sensor fusion provided by the invention comprises the following steps:

A3D vehicle detection method based on multi-sensor fusion comprises the following steps:

step 1, obtaining semantic information (namely RGB (red, green and blue) images) of a vehicle through a camera arranged on the vehicle, scanning the surrounding environment of the vehicle through a laser radar positioned on the roof of the vehicle, and obtaining accurate depth information (namely laser radar point cloud) of the environment;

step 2, preprocessing the laser radar point cloud, establishing a coordinate system which takes a contact point of the laser radar vertically downwards and the ground as an original point, the vehicle running direction as the positive direction of an X axis, the left side of a driver as the positive direction of a Y axis and the vertical ground upwards as the positive direction of a Z axis according to the height of the automobile, taking the Z axis [0,2.5] m, and equally dividing the point cloud into 5 height slices along the Z axis direction;

step 3, generating a 3D vehicle region of interest on the laser radar point cloud;

step 4, respectively extracting the features of the processed radar point cloud and the processed RGB image and generating corresponding feature maps;

step 5, respectively mapping the 3D vehicle interesting regions to a point cloud and a feature map of an RGB image;

and 6, fusing the mapping part of the characteristic diagrams in the step 5, and finally realizing the 3D positioning and detection of the vehicle target.

Further, the preprocessing of the step 2 comprises a processing method of a point cloud aerial view (BEV):

a Bird's Eye View (BEV) of the point cloud is obtained by projecting point cloud data onto a ground (Z = 0) 2D grid, and in order to obtain more detailed height information, a left and right position [ -40,40] m and a front position [0,70] m of the BEV are taken with the laser radar position as a center point. And according to the actual height of the automobile, taking a Z axis [0,2.5] m, equally dividing the point cloud into 5 height slices along the Z axis direction, projecting each slice to a ground (Z = 0) 2D grid, and taking the height characteristic corresponding to each slice as the maximum height value of the point cloud data projected to the grid map. The point cloud density M refers to the number of point clouds of each cell, and the value of each grid is normalized:

and N is the number of point clouds in the unit grid map.

Further, the specific steps of step 3 are:

generating a 3D vehicle region of interest on a point cloud for classifying and positioning targets, taking a Bird's Eye View (BEV) as an input, generating a series of 3D candidate frames before, removing empty frames to reduce the calculation amount, assigning a binary label to the content of each remaining frame, namely a positive label represents a target vehicle, a negative label represents a background, and assigning a positive label to two types of anchor frames by calculating the IOU overlapping size between the anchor frame and a real boundary frame:

1) The anchor box with the highest IOU overlap with a real bounding box (less than 0.5),

2) An anchor box that overlaps any IOU with a true bounding box greater than 0.5.

And one real bounding box may assign positive labels to multiple anchor boxes. Assigning negative labels (backgrounds) to anchor boxes whose IOU of all real bounding boxes is lower than 0.3, and non-positive and non-negative anchor boxes have no effect on training targets, so we ignore in subsequent processing. After the above-mentioned positively labeled anchor frame is obtained, it is subjected to preliminary 3D regression optimization, assuming that each 3D prediction frame is represented by (x, y, z, h, w, D), (x, y, z) represents the center point of the frame, and (h, w, D) represents the size of the frame. The centroid and the size of the 3D frame in the laser radar coordinate system are distinguished and preliminarily positioned for the ROI generated by mapping on the feature map later by calculating the difference (delta x, delta y, delta z, delta h, delta w and delta D) between the boundary frame of the target area with the foreground and the real boundary frame in the central point and the size. For 3D anchor frame (x) _a ,y _a ,z _a ,h _a ,w _a ,d _a ) Representation, 3D true bounding Box by (x) ^* ,y ^* ,z ^* ,h ^* ,w ^* ,d ^* ) Denotes, t _i Denotes the offset of the prediction frame with respect to the 3D anchor frame, let t _i 6 parameterized coordinates as t _i ＝(t _x ,t _y ,t _z ,t _h ,t _w ,t _d ),

Represents the offset of the 3D real bounding box relative to the 3D anchor frame, and is ^ based>

A parameterized coordinate of

Then there are:

t _x ＝(x-x _a )/h _a t _y ＝(y-ya)/w _a

t _z ＝(z-z _a )/d _a t _h ＝log(h/h _a )

t _w ＝log(w/w _a )t _d ＝log(d/d _a )

for 3Dbox regression by SmoothL1 function:

the cross-entropy function was used to calculate the target object loss:

where n is the number of bounding boxes present in the target region.

And performing 3D frame regression by calculating the difference between the centroid and the size between the 3D anchor frame and the 3D real boundary frame, and finally outputting a 3D region of interest in the point cloud.

Further, the specific process of step 4 includes:

step 4.1, assuming the input RGB image or BEV image size is H × W × D, the first three convolutional layers of the VGG-16 network are used in the downsampling stage, resulting in an output signature with a resolution 8 times smaller than its corresponding input, at which stage the output size of the signature is H × W × D

And 4.2, performing 2X up-sampling on the feature map of high-level low-resolution semantic information (including laser radar point cloud and RGB (red, green and blue) images), ensuring that the feature map of a down-sampling stage corresponding to the up-sampling has the same size, and performing convolution fusion on the feature map by 3X3, thereby obtaining a full-resolution feature map in the last layer of the feature extraction framework.

Further, the specific method of steps 5 and 6 is as follows:

and (4) obtaining a 3D region of interest on a bird-eye view (BEV) of the laser radar point cloud obtained in the step (3), respectively mapping the region of interest obtained on the radar point cloud onto the feature maps of the radar point cloud and the RGB image according to the coordinate relation between the laser radar point cloud and the RGB image, and finally obtaining the coordinate position of a corresponding frame on the feature map, wherein the size of the frame finally mapped on the feature map is different, so that the fusion processing cannot be carried out, the size of the obtained feature map is fixed to be 3X3, and then the pixel average fusion is carried out on the feature maps obtained by mapping in the BEV and the RGB.

1. The invention has the beneficial effects that: the invention uses the laser radar and the camera to sense the surrounding environment, can fully utilize the data acquired by the laser radar, has accurate depth information and the camera can reserve the advantage of more detailed semantic information. The accuracy of the 3D detection of the surrounding vehicle is greatly improved.

2. The traditional method for detecting the vehicle by using the single sensor has limited acquired information and is influenced by the performance of the single sensor.

3. According to the invention, the vehicle interested target is extracted firstly, and then the pixel average fusion processing is carried out on the part of characteristic images, so that the calculated amount is greatly reduced, and the real-time property of vehicle detection is improved.

Drawings

FIG. 1 is a flow chart of a method for vehicle detection based on multi-sensor fusion in accordance with the present invention;

FIG. 2 is a Bird's Eye View (BEV) of a point cloud equally divided into 5 height slices along the Z-axis direction;

(a) shows a radar point cloud along the Z-axis [0,0.5], (b) shows a radar point cloud along the Z-axis [0.5,1.0], (c) shows a radar point cloud along the Z-axis [1.0,1.5], (d) shows a radar point cloud along the Z-axis [1.5,2.0], (e) shows a radar point cloud along the Z-axis [2.0,2.5 ];

fig. 3 is a frame for extracting features of a point cloud and RGB image.

Detailed Description

The invention will be further explained with reference to the drawings.

The invention provides a 3D vehicle detection method based on multi-sensor fusion, wherein the detection flow chart is shown in figure 1 and specifically comprises the following steps:

(1) The method comprises the steps of collecting point cloud data through a laser radar, collecting RGB image information through a camera, preprocessing the collected point cloud, inputting a Bird's Eye View (BEV) of the point cloud as point cloud data, projecting the point cloud data to a ground (Z = 0) 2D grid to obtain more detailed height information, and taking the position of the laser radar as a central point, wherein the position is the left and right positions [ -40,40] m of the BEV, and the position is the front position [0,70] m. And according to the actual height of the automobile, taking a Z axis [0,2.5] m, equally dividing the point cloud into 5 height slices along the Z axis direction, projecting each slice to a ground (Z = 0) 2D grid, and taking the height characteristic corresponding to each slice as the maximum height value of the point cloud data projected to the grid map. The point cloud density M refers to the number of point clouds of each cell, and the value of each grid is normalized:

and N is the number of point clouds in the unit grid map.

(2) Generating a 3D vehicle region of interest on the point cloud for classification and positioning of targets, taking a Bird's Eye View (BEV) as an input, generating a series of 3D candidate frames before, removing empty frames in order to reduce the calculation amount, assigning a binary label to the content of each frame left, which is the target vehicle or background, and assigning a positive label to two types of anchor frames by calculating the IOU overlap size between the anchor frame and a real boundary frame:

(1) anchor boxes (less than 0.5) with highest IOU overlap with a real bounding box;

(2) an anchor box that overlaps any IOU with a true bounding box greater than 0.5.

And one real bounding box may assign positive labels to multiple anchor boxes. Assigning negative labels (backgrounds) to anchor boxes whose IOU of all real bounding boxes is below 0.3, and non-positive and non-negative anchor boxes have no effect on the training target and are ignored in subsequent processing. After the positive labeled anchor box is obtained, it is subjected to preliminary 3D regression optimization, assuming that each 3D prediction box is represented by (x, y, z, h, w, D) tableThe (x, y, z) represents the center point of the box, and (h, w, d) represents the size of the box. The center of mass and the size of the 3D frame in the laser radar coordinate system are distinguished and preliminarily positioned for the ROI generated by mapping on the feature map later by calculating the difference (delta x, delta y, delta z, delta h, delta w, delta D) between the foreground ROI and the real mania in the central point and the size. For 3D anchor frame (x) _a ,y _a ,z _a ,h _a ,w _a ,d _a ) Representation, 3D true bounding Box by (x) ^* ,y ^* ,z ^* ,h ^* ,w ^* ,d ^* ) Denotes, t _i Representing the offset of the prediction frame relative to the 3D anchor frame, then its 6 parameterized coordinates are t _i ＝(t _x ,t _y ,t _z ,t _h ,t _w ,t _d ),

Representing the offset of the 3D real bounding box relative to the 3D anchor box, then its 6 parameterized coordinates are ^ 4>

Then there are:

t _x ＝(x-x _a )/h _a t _y ＝(y-y _a )/w _a

t _z ＝(z-z _a )/d _a t _h ＝log(h/h _a )

t _w ＝log(w/w _a ) t _d ＝log(d/d _a )

by SmoothL1 letterNumbers were used for 3Dbox regression:

calculating target object loss using a cross entropy function:

(3) A bird's-eye view (BEV) of the point cloud is input as point cloud data, which is obtained by projecting the point cloud data onto a ground (Z = 0) 2D grid, and in order to obtain more detailed height information, a left-right position [ -40,40] m and a front position [0,70] m of the BEV are taken with the laser radar position as a center point. And according to the actual height of the automobile, taking a Z axis [0,2.5] m, equally dividing the point cloud into 5 height slices along the Z axis direction, projecting each slice to a ground (Z = 0) 2D grid, and taking the height characteristic corresponding to each slice as the maximum height value of the point cloud data projected to the grid map. The point cloud density M refers to the number of point clouds of each cell, and the value of each grid is normalized:

and N is the number of point clouds in the unit grid map.

(4) As shown in fig. 3, in order to fully utilize the information of the original lowest-layer feature map, the upper-layer feature upsampling and the bottom-layer information are fused by a 3X3 convolution operation. To obtain rich feature information and high resolution images. The feature extractor is based on the VGG-16 architecture. Assuming that the input RGB image or BEV map is H × W × D in size, using the first three convolutional layers of the VGG-16 network in the downsampling stage results in an output feature map with a resolution 8 times smaller than its corresponding input, so that at this stage the output size of the feature map is

The downsampled feature map is convolved by a convolution kernel 1X1 with some channels with corresponding upsampled phase feature maps, so that a 3X3 convolution fusion can be performed, resulting in a full resolution feature map in the last layer of the feature extraction framework.

(5) And (3) respectively mapping the 3D region of interest in the point cloud (2) to the feature maps of the point cloud and the RGB image, and obtaining the coordinate position of the corresponding frame on the feature map according to the coordinate conversion relation between the BEV and the RGB image. However, since the size of the frame obtained by mapping the feature map finally is different, the fusion processing cannot be performed, so that the size of the obtained feature map is fixed to be 3X3, and then the feature maps obtained by mapping in BEV and RGB are fused. And finally determining the position and the size of the surrounding vehicle.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims

1. A3D vehicle detection method based on multi-sensor fusion is characterized by comprising the following steps:

step 1, obtaining an RGB image of a vehicle and laser radar point cloud information of the surrounding environment of the vehicle;

step 2, preprocessing the laser radar point cloud information, taking a Z axis [0,2.5] m according to the height of the automobile, and equally dividing the laser radar point cloud into 5 height slices along the Z axis direction;

step 3, generating a 3D vehicle region of interest on the laser radar point cloud; the method comprises the following specific steps:

taking a point cloud aerial view as an input, generating a series of 3D candidate frames before, removing the empty candidate frames, allocating a binary label to the content of each remaining candidate frame, namely a positive label represents a target vehicle and a negative label represents a background, and allocating a positive label to the following two types of anchor frames by calculating the IOU overlapping size between the anchor frame and a real boundary frame:

1) An anchor frame having an IOU overlap of up to less than 0.5 with a real bounding box;

2) An anchor box overlapping any real bounding box with an IOU greater than 0.5;

assigning negative labels to anchor frames with IOUs of all real bounding boxes lower than 0.3, wherein the anchor frames which are not positive or negative have no effect on the training target and are ignored in subsequent processing;

after obtaining the anchor frame with the positive label, performing preliminary 3D regression optimization on the anchor frame, and assuming that each 3D prediction frame is represented by (x, y, z, h, w, D), (x, y, z) represents the center point of the frame, and (h, w, D) represents the size of the frame; in a laser radar coordinate system, distinguishing and primarily positioning the ROI generated by mapping on a feature map later by calculating the difference between the foreground ROI and a real boundary frame in the central point and the size, namely (delta x, delta y, delta z, delta h, delta w and delta d); for 3D anchor frame (x) _a ,y _a ,z _a ,h _a ,w _a ,d _a ) Representation, 3D true bounding Box by (x) ^* ,y ^* ,z ^* ,h ^* ,w ^* ,d ^* ) Denotes, t _i Represents the offset of the prediction frame relative to the 3D anchor frame, and has 6 parameterized coordinates t _i ＝(t _x ,t _y ,t _z ,t _h ,t _w ,t _d )，

Represents the offset of the 3D real bounding box relative to the 3D anchor frame, and sets its 6 parameterized coordinates to ^ 6>

Then there are:

t _x ＝(x-x _a )/h _a t _y ＝(y-y _a )/w _a

t _z ＝(z-z _a )/d _a t _h ＝log(h/h _a )

t _w ＝log(w/w _a ) t _d ＝log(d/d _a )

used for 3Dbox regression through SmoothL1 function:

the cross-entropy function was used to calculate the target object loss:

wherein n is the number of bounding boxes in the target area;

performing 3D frame regression by calculating the difference between the centroid and the size between the 3D anchor frame and the 3D real bounding box, and finally outputting a 3D region of interest in the point cloud;

step 5, respectively mapping the 3D vehicle region of interest to a radar point cloud and a feature map of an RGB image;

and 6, fusing the mapping part of the characteristic graphs in the step 5, and finally realizing the 3D positioning and detection of the vehicle target.

2. The multi-sensor fusion-based 3D vehicle detection method according to claim 1, wherein in step 1, the RGB images are obtained by a camera mounted on a vehicle; and the laser radar point cloud scans and acquires the surrounding environment through a laser radar positioned on the roof of the vehicle.

3. The multi-sensor fusion-based 3D vehicle detection method according to claim 1, wherein the preprocessing method in step 2 comprises a processing method of a point cloud aerial view, wherein the point cloud aerial view is obtained by projecting point cloud data to a ground (Z = 0) 2D grid.

4. The multi-sensor fusion-based 3D vehicle detection method according to claim 3, wherein the processing method of the point cloud aerial view is as follows:

taking the position of a laser radar as a central point, taking the left and right positions [ -40,40] m and the front position [0,70] m of a point cloud aerial view, taking the Z axis [0,2.5] m according to the actual height of an automobile, equally dividing the point cloud into 5 height slices along the Z axis direction, projecting each slice to a ground (Z = 0) 2D grid, and taking the height characteristic corresponding to each slice as the maximum height value of the point cloud data projected to the grid map; the point cloud density M refers to the number of point clouds of each unit grid, and the value of each grid is normalized:

and N is the number of point clouds in the unit grid map.

5. The multi-sensor fusion based 3D vehicle detection method according to claim 1, wherein the specific steps of step 4 comprise:

step 4.1, assuming that the size of the input RGB image or point cloud aerial view is H × W × D, using the first three convolution layers of the VGG-16 network in the down-sampling stage results in the resolution of the output feature map being 8 times smaller than its corresponding input, at this stage, the output size of the feature map is set to be H × W × D

Step 4.2, performing 2X up-sampling on the feature map of the semantic information with high-level low resolution, ensuring that the feature map size of a down-sampling stage corresponding to the up-sampling is the same, performing convolution fusion on the feature map 3X3, and obtaining a full-resolution feature map in the last layer of the feature extraction frame; the semantic information comprises laser radar point cloud and RGB images.

6. The multi-sensor fusion-based 3D vehicle detection method according to claim 1, wherein the specific mapping method of step 5 is as follows: and respectively mapping the interested 3D areas acquired on the radar point cloud to the feature maps of the radar point cloud and the RGB image according to the corresponding coordinate relation between the radar point cloud and the RGB image.

7. The multi-sensor fusion-based 3D vehicle detection method according to claim 6, wherein the specific fusion method of step 6 comprises the following steps: and (5) carrying out pixel average fusion on the radar point cloud obtained in the step (5) and the feature image obtained by mapping the RGB image.