CN111291714A

CN111291714A - Vehicle detection method based on monocular vision and laser radar fusion

Info

Publication number: CN111291714A
Application number: CN202010124991.XA
Authority: CN
Inventors: 张立军; 孟德建; 黄露莹; 张状
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-16

Abstract

The invention relates to a vehicle detection method based on monocular vision and laser radar fusion, which comprises the following steps: s1: acquiring an image feature map; s2: acquiring a point cloud characteristic diagram; s3: respectively extracting point cloud characteristic vector f from point cloud characteristic image and image characteristic image_lidarAnd image feature vector f_RGB(ii) a S4: the point cloud feature vector f_lidarAnd image feature vector f_RGBPerforming feature fusion to obtain a fusion feature f_L(ii) a S5: according to the fusion characteristics f_LObtaining a 3D bounding box of the vehicle and obtaining corresponding category parameters; s6: compared with the prior art, the method and the device are beneficial to solving the problems that monocular vision is difficult to effectively estimate the vehicle position and the laser radar can effectively estimate the vehicle positionThe problem of missed detection caused by the sparse remote point cloud can be solved, and the detection effect of the three-dimensional target of the vehicle is further improved.

Description

Vehicle detection method based on monocular vision and laser radar fusion

Technical Field

The invention relates to the field of automatic driving environment perception, in particular to a vehicle detection method based on monocular vision and laser radar fusion.

Background

The detection of the vehicle is an indispensable component in an automatic driving environment perception system, and the target detection is also a basic problem in computer vision. Despite the tremendous advances in this area made by researchers in recent years, it remains a significant challenge to develop a high accuracy, high efficiency, robust target detection system that can be used for autonomous driving. The detection of the vehicle is realized through three-dimensional target detection, a three-dimensional bounding box is output by the three-dimensional target detection, the information comprises the position and posture information of the target vehicle in a three-dimensional environment, and the automatic driving decision system can make a driving decision further only by the information, so that the automatic driving decision system has more important significance for automatic driving.

Because various sensors have respective advantages and disadvantages, information fusion of the multi-mode sensor becomes a necessary choice, the information fusion can make up the defect that environment information acquired by a single sensor is not abundant enough, a sensing system with stronger fault-tolerant capability and higher safety is provided by fusing the advantages of different sensors, and the reliability, the accuracy and the adaptability of the vehicle environment sensing system can be greatly improved particularly under the condition of complex road traffic.

At present, aiming at a three-dimensional target detection task of a vehicle, a camera has the advantages that more details can be obtained, an image contains richer semantic information, and a point cloud obtained by a laser radar sensor is sparse compared with the image, but has high-precision three-dimensional position information. The fusion of the visible light image and the laser radar point cloud can obtain a three-dimensional sensing result with higher precision theoretically.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a vehicle detection method based on monocular vision and laser radar fusion, which is helpful for solving the problems that the monocular vision is difficult to effectively estimate the vehicle position and the laser radar is likely to miss detection due to long-distance point cloud sparsity, and further improving the three-dimensional target detection effect of the vehicle.

The purpose of the invention can be realized by the following technical scheme:

a vehicle detection method based on monocular vision and laser radar fusion comprises the following steps:

s1: acquiring an image feature map;

s2: acquiring a point cloud characteristic diagram;

s3: respectively extracting point cloud characteristic vector f from point cloud characteristic image and image characteristic image_lidarAnd image feature vector f_RGB；

S4: the point cloud feature vector f_lidarAnd image feature vector f_RGBPerforming feature fusion to obtain a fusion feature f_L；

S5: according to the fusion characteristics f_LObtaining a 3D bounding box of the vehicle and obtaining corresponding parameters;

s6: and removing the overlapped 3D bounding boxes to obtain a final 3D bounding box and corresponding parameters, and finishing vehicle detection.

The step S3 specifically includes:

s301: extracting a 3D candidate region in the point cloud feature map;

s302: respectively projecting the 3D candidate region into an image feature map and a point cloud feature map to obtain a region of interest RoI;

s303: utilizing the region of interest RoI to respectively scratch image region characteristics and point cloud region characteristics from the image characteristic diagram and the point cloud characteristic diagram;

s304: the image area characteristic and the point cloud area characteristic are zoomed to the same set size, and the point cloud characteristic vector f with the same length is obtained_lidarAnd image feature vector f_RGB。

The projecting the 3D candidate region into the image feature map specifically includes: projecting the 3D candidate area into an image characteristic diagram by using a projection formula from a point cloud coordinate to an image coordinate, projecting a point (x, y, z) in a laser radar coordinate system into an image plane, wherein the obtained image coordinate is (u, v), and the projection formula from the point cloud coordinate to the image coordinate is as follows:

wherein,

a correction rotation matrix for the image plane of reference camera 0 (left grayscale camera) to camera 2 (left color camera), size 4 × 4;

is a matrix of corrective projections of the camera 2,

is a rotation and translation matrix from the laser radar coordinate system to the camera coordinate system.

The projecting the 3D candidate region into the point cloud feature map specifically includes: firstly, projecting the 3D candidate area on a bird's-eye view, and then obtaining corresponding coordinates on the point cloud characteristic map in proportion through the bird's-eye view coordinates.

The feature fusion adopts pre-fusion, and specifically comprises the following steps: the point cloud feature vector f_lidarAnd image feature vector f_RGBAnd performing fusion in an input stage.

The formula of the feature fusion is as follows:

wherein f is_LFor fused output, { H_lL1, …, L is a characteristic transformation function,

for the fusion operation, the fusion operation includes connecting, summing or averaging by elements.

The step S5 specifically includes:

s501: fusing the features f_LInputting into a detection network;

s502: and obtaining a 3D bounding box of the vehicle, and performing regression on the 3D bounding box type, the coordinate and the size and the direction vector respectively.

The point cloud characteristic graph is obtained through a VoxelNet network, the 3D candidate area is extracted through an area candidate network of the VoxelNet network, and the detection network is three full-connection networks.

When training the model, through minimizing the loss function L_FINALEnd-to-end training, said loss function L_FINALThe expression of (a) is:

L_FINAL＝L_VoxleNet_RPN+L_DET

in the formula 1, L_{VoxleNet_RPN}Loss function for a VoxelNet area candidate network, L_DETTo detect a multitasking loss function of the network. In the formula 2, the first step is,

output of confidence maps corresponding to the anchor boxes of the positive and negative samples, respectively, N_pos,N_negThe number of positive and negative sample anchor boxes, respectively; for a vehicle target, when the intersection ratio of a certain anchor box to any truth bounding box is greater than 0.6 or the intersection ratio to a certain truth bounding box is the largest in all anchor boxes, the anchor box is considered as a positive sample; when the intersection ratio of a certain anchor box and all truth value bounding boxes is less than 0.45, the anchor box is considered as a negative sample; ignoring intersection ratios with all truth bounding boxes between anchor boxes 0.45 and 0.6; classification loss function L_clsStill adopting a cross entropy loss function; u. of_i＝(u_ix,u_iy,u_iz,u_il,u_iw,u_ih,u_iθ) Is a characterization vector that predicts the normalized difference between the bounding box and the corresponding positive sample anchor box; while

A characterization vector that is the difference between the corresponding true bounding box and the positive sample anchor box; regression loss function L_regThe Smooth L1 function is still used, and the classification loss function and the regression loss function of the positive and negative samples are balanced by a hyper-parameter α, where α is 1.5, β is 1, and in formula 3, k is the sequence number of the candidate region of the input detection network in the mini-batch_kAnd the bounding box k which is the output of the bounding box classification regression branch is the predicted probability of the vehicle.

Is a truth label, when the intersection ratio of the candidate region k and any truth bounding box is more than 0.65, the candidate region is considered as a positive sample, and

otherwise it is considered as a negative sample,

classification loss function L_clsA cross entropy loss function is still employed. To pair

Are defined as in equation 2, and are the token vector of the difference between the predicted bounding box and the corresponding positive sample candidate bounding box and the token vector of the difference between the corresponding true value bounding box and the positive sample candidate bounding box, respectively; l is_reg,L_angAll adopt Smooth L1 function, N_posNumber of anchor boxes for positive samples, v_jTo predict the direction vector difference between a bounding box and the corresponding positive sample candidate bounding box,

is the difference of direction vectors between the corresponding true bounding box and the positive sample candidate bounding box, λ is the hyperparameter for balancing the classification loss function and the regression loss function, N_clsThe sum of the number of the positive and negative sample anchor boxes in the mini-batch is used for normalizing the classification loss function.

The overlap-removing 3D bounding box specifically comprises: with 0.01 as the cross-over threshold, the overlapping 3D bounding boxes are removed using 2D non-maximum suppression in top view.

When the point cloud characteristic image extracts a 3D candidate region, and when the convolution dimensionality reduction is carried out to model training equal to the image characteristic image channel number, the 3D candidate region retains 1024 candidate regions through non-maximum suppression; during detection, the 3D candidate regions are restrained and reserved for the first 300 candidate regions through non-maximum values.

Compared with the prior art, the invention has the following advantages:

1) according to the invention, the image obtained by monocular vision and the point cloud obtained by the laser radar are fused, so that the problem that the position of the vehicle is difficult to effectively estimate by monocular vision and the detection omission of the laser radar due to the sparse remote point cloud is solved, and the detection effect of the three-dimensional target of the vehicle is further improved;

2) the invention provides a technical route for extracting image features based on a feature pyramid, extracting point cloud features based on a VoxelNet network, extracting region-of-interest features based on a candidate area network, fusing features based on a pre-fusion strategy and improving a detection result based on a non-maximum suppression method, and provides a new idea for detecting a three-dimensional target of a vehicle;

3) the invention adopts a two-stage target detection structure similar to fast R-CNN, and can better inhibit the problem of sample imbalance by utilizing a screening mechanism of a regional candidate network on a simple sample;

4) the method has the advantages that the difference of the number of the 3D candidate areas is reserved in the training process and the detection process, and the dimension reduction is carried out on the point cloud characteristic diagram, so that the accuracy and the reliability of a detection model are ensured, the time for detection operation, the occupied memory and the calculated amount can be reduced, and the vehicle detection efficiency is improved;

5) according to the invention, the overlapped redundant 3D bounding boxes are removed by a non-maximum value inhibition method, and the overlapped detection result is eliminated, so that the detection result has higher accuracy, is more practical and has stronger practicability.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is an overall flow chart of the present invention;

FIG. 3 is a schematic representation of feature fusion.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Examples

A vehicle detection method based on monocular vision and laser radar fusion comprises the following steps: extracting an image feature map by using a feature pyramid network; acquiring a point cloud characteristic map by using a VoxelNet network and extracting a 3D candidate region; extracting point cloud characteristics and image characteristics of the region of interest based on the candidate region; fusing image characteristics and point cloud characteristics by using a pre-fusion strategy to obtain fusion characteristics; estimating a target class and a 3D bounding box by using the fusion features; the overall process of the method is shown in fig. 2, and specifically includes the steps of removing the overlapped redundant bounding boxes by using a non-maximum suppression method, and the like:

step 1: normalizing the input image, firstly calculating the mean values of the training set image on three channels of R/G/B, and respectively subtracting the mean values from the pixel values of the image on the three channels of R/G/B during training and detection;

step 2: for image information, a feature pyramid is used as a feature extraction network, an image is input into the network, and an image convolution feature map is obtained, wherein the size of the image convolution feature map is 360 multiplied by 1200 multiplied by 32;

and step 3: for the laser radar point cloud, firstly, a VoxelNet network is used for obtaining a point cloud characteristic map of the laser radar point cloud, and the size of the characteristic map is 200 multiplied by 176 multiplied by 768;

and 4, step 4: extracting 3D candidate regions by using a VoxelNet network region candidate network, wherein the output candidate regions are all 0 degrees or 90 degrees, and the loss function used by the VoxelNe region candidate network is as follows:

wherein,

A characterization vector that is the difference between the corresponding true bounding box and the positive sample anchor box; regression loss function L_regThe Smooth L1 function is still used, and the classification loss function and the regression loss function of the positive and negative samples are balanced by a hyper-parameter α, which is set to α -1.5 and β -1.

In the training process, the candidate regions output by the VoxelNet are subjected to non-maximum suppression (the 2D IoU threshold value is 0.8), 1024 candidate regions are reserved and input into a subsequent detection network. In the detection process, in order to reduce the operation time, the non-maximum suppression reserves the first 300 candidate regions. Because the number of channels of the image feature map is different from that of the point cloud feature map, in order to facilitate subsequent fusion and reduce memory occupation and calculation amount during inference, the dimension of the point cloud feature map is reduced to 200 × 176 × 32 by using 1 × 1 convolution, so that the number of channels is equal to that of the image feature map.

And 5: and projecting the 3D candidate regions obtained by the VoxelNet network onto the image feature map and the point cloud feature map respectively to obtain a region of interest (RoI).

The method for projecting the image feature map comprises the following steps: using a projection formula from point cloud coordinates to image coordinates:

wherein,

is a matrix of corrective projections of the camera 2,

Because the size of the image feature map obtained by the feature pyramid network is the same as that of the original image, the image area can be directly mapped onto the feature map to obtain an image interesting area;

the method for projecting the point cloud characteristic diagram comprises the following steps: firstly, projecting the 3D candidate area on a bird's-eye view, wherein the convolution intermediate layer of the VoxelNet network mainly aggregates the features in the height direction, the space structure in the top view direction is still reserved in the subsequent convolution operation, and the corresponding coordinates on the point cloud feature map can be obtained in proportion through the bird's-eye view coordinates.

Regional features can be respectively scratched from an image feature map and a point cloud feature map by utilizing a region of interest (RoI), and because the two regional features are possibly inconsistent in size and difficult to be directly fused, the scratched regional features are respectively zoomed in to 7 multiplied by 7 by using bilinear difference values, and finally point cloud feature vectors f with equal length are respectively obtained_lidarAnd image feature vector f_RGB。

Step 6: point-to-point cloud feature vector f_lidarAnd image feature vector f_RGBFeature using pre-fusion strategyFusion, as shown in fig. 3.

The pre-fusion strategy specifically comprises: assuming the converged network has L layers, the pre-convergence will be f_lidarAnd f_RGBAnd fusion is carried out in an input stage:

wherein f is_LIs a point cloud feature vector f_lidarAnd image feature vector f_RGBFused features of post-fusion output, { H_lL ═ 1, …, L } is the feature transformation function, in this example the full link layer;

representing fusion operations (including joining, summing, etc.), the element-by-element averaging is used in this embodiment.

And 7: fusing the features f_LAnd inputting the data into three fully-connected networks, and performing regression on the coordinates and the size of the bounding box, the category and the direction vector respectively. The multi-task loss function of the detection network is as follows:

wherein k is the sequence number of the candidate area of the input detection network in the mini-batch. q. q.s_kAnd the bounding box k which is the output of the bounding box classification regression branch is the predicted probability of the vehicle.

otherwise it is considered as a negative sample,

classification loss function L_clsStill exploiting cross-entropy lossesA function.

Respectively representing the difference value between the predicted bounding box and the corresponding positive sample candidate bounding box and the difference value between the corresponding truth bounding box and the positive sample candidate bounding box; l is_reg,L_angAll adopt Smooth L1 function, N_posFor the number of anchor boxes for positive samples, v_jTo predict the direction vector difference between a bounding box and the corresponding positive sample candidate bounding box,

And 8: final overall loss function L_FINALThe sum of loss functions of a VoxelNet area candidate network and a detection network is obtained, the objective function value is minimized by optimizing the objective function, end-to-end learning is carried out, and model training is completed:

L_FINAL＝L_{VoxleNet_RPN}+L_DET

this process is the inverse algorithm of maximum likelihood estimation.

And step 9: this is generally not possible in real road scenes, as multiple candidate regions may regress to the same or overlapping bounding box regions in a top view perspective. To avoid this, the overlap detection result is eliminated by removing the 3D bounding box that is overlapped excessively using the 2D non-maximum value suppression in the top view with 0.01 as the cross-over ratio threshold, and the final vehicle detection result is obtained.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A vehicle detection method based on monocular vision and laser radar fusion is characterized by comprising the following steps:

s1: acquiring an image feature map;

s2: acquiring a point cloud characteristic diagram;

2. The method for detecting a vehicle based on the fusion of monocular vision and lidar according to claim 1, wherein the step S3 specifically comprises:

s301: extracting a 3D candidate region in the point cloud feature map;

3. The method according to claim 2, wherein the projecting the 3D candidate region into the image feature map specifically comprises: projecting the 3D candidate area into an image characteristic diagram by using a projection formula from a point cloud coordinate to an image coordinate, projecting a point (x, y, z) in a laser radar coordinate system into an image plane, wherein the obtained image coordinate is (u, v), and the projection formula from the point cloud coordinate to the image coordinate is as follows:

wherein,

a correction rotation matrix for the left grayscale camera to the left color camera image plane;

is the corrective projection matrix of the left color camera,

a rotation translation matrix from a laser radar coordinate system to a camera coordinate system;

4. The method for vehicle detection based on monocular vision and lidar fusion as claimed in claim 2, wherein the feature fusion employs pre-fusion, specifically comprising: the point cloud feature vector f_lidarAnd image feature vector f_RGBAnd performing fusion in an input stage.

5. The method for vehicle detection based on the fusion of monocular vision and lidar according to claim 4, wherein the formula of the feature fusion is:

wherein f is_LFor fused output, { H_lL1, L is a characteristic transformation function,

6. The method for detecting a vehicle based on the fusion of monocular vision and lidar according to claim 2, wherein the step S5 specifically comprises:

s501: fusing the features f_LInputting into a detection network;

s502: and obtaining a 3D bounding box of the vehicle, and performing regression on the coordinates, the sizes and the direction vectors of the 3D bounding box and the 3D bounding box respectively.

7. The vehicle detection method based on the fusion of the monocular vision and the laser radar as claimed in claim 6, wherein the point cloud feature map is obtained through a VoxelNet network, the 3D candidate area is extracted through an area candidate network of the VoxelNet network, and the detection network is three fully-connected networks.

8. The method of claim 7, wherein the model training is performed by minimizing a loss function L_FINALEnd-to-end training, said loss function L_FINALThe expression of (a) is:

L_FINAL＝L_{VoxleNet_RPN}+L_DET

wherein L is_{VoxleNet_RPN}Loss function for a VoxelNet area candidate network, L_DETIn order to detect the multi-tasking loss function of the network,

output for confidence maps corresponding to the anchor boxes of the positive and negative samples, respectively, N_pos，N_negNumber of positive and negative sample anchor boxes, L, respectively_clsAs a function of classification loss, u_i＝(u_ix，u_iy，u_iz，u_il，u_iw，u_ih，u_iθ) To predict the characterization vector for the normalized difference between the bounding box and the corresponding positive sample anchor box,

a characterization vector, L, for the difference between the corresponding true bounding box and the positive sample anchor box_regFor the regression loss function, α and β are hyper-parameters, k is the sequence number of the candidate region of the input detection network, q_kThe bounding box k that is the output of the classification regression branch for the 3D bounding box is the predicted probability of the vehicle,

is true value label, L_ang、L_regAs a function of Smooth L1, N_posFor the number of anchor boxes for positive samples, V_jTo predict the direction vector difference between a bounding box and the corresponding positive sample candidate bounding box,

is the difference of direction vectors between the corresponding true bounding box and the positive sample candidate bounding box, λ is the hyperparameter for balancing the classification loss function and the regression loss function, N_clsIs the sum of the number of the positive and negative sample anchor boxes.

9. The method for vehicle detection based on monocular vision and lidar fusion of claim 1, wherein the removing of the overlapped 3D bounding box specifically comprises: with 0.01 as the cross-over threshold, the overlapping 3D bounding boxes are removed using 2D non-maximum suppression in top view.

10. The method for vehicle detection based on monocular vision and lidar fusion of claim 1, wherein the point cloud feature map is reduced in dimension to be equal to the number of channels of the image feature map by convolution when extracting the 3D candidate region, and 1024 candidate regions are reserved in the 3D candidate region through non-maximum suppression during model training; during detection, the 3D candidate regions are restrained and reserved for the first 300 candidate regions through non-maximum values.