CN116486368A

CN116486368A - Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene

Info

Publication number: CN116486368A
Application number: CN202310357033.0A
Authority: CN
Inventors: 禹鑫燚; 杨阳; 陈昊; 沈春华; 欧林林
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-07-25

Abstract

The invention discloses a multi-mode fusion three-dimensional target robust detection method based on an automatic driving scene, which comprises the following steps: step 1: acquiring current frame point cloud data and image data; step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature. Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features. Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result. Step 5: and (3) sending the point cloud and the image characteristics output in the step (2) and the step (3) and the preliminary detection result output in the step (4) to a staggered fusion module so as to adaptively fuse the image characteristics and the point cloud characteristics. And (5) fine-tuning the preliminary detection result in the step (4) by using the fused result. Compared with the existing result, the invention realizes the complementarity of modes through the staggered fusion architecture, can show better robustness under the condition of noise of various laser radars, and has stronger recall capability, thereby improving the detection precision.

Description

Multi-mode fusion three-dimensional target robust detection method based on automatic driving scene

Technical Field

The invention relates to a computer vision and mode recognition technology, in particular to a multi-mode fusion three-dimensional target robust detection method based on an automatic driving scene.

Background

With the rapid development of sensor technology and the rapid development of deep learning, a three-dimensional target detection technology based on multi-mode fusion is remarkably improved. The target detection method based on the single sensor has certain defects under complex daily application scenes due to the imaging characteristics of the sensor. The optical camera can provide dense and rich example-level target information under good illumination, and can reflect the texture and color of an object. However, in the case of poor light conditions, such as at night or in rainy and foggy weather, the imaging effect of the optical camera is drastically reduced. Under this condition, therefore, a pure vision-based three-dimensional object detection model generally does not achieve satisfactory detection accuracy. Lidar is a laser sensor technology that can capture objects and three-dimensional surfaces in space, and has the advantage over optical camera systems that high accuracy distance information can be obtained and is not limited by lighting conditions. However, compared with a standard RGB image structure, the point cloud data of the laser radar has the characteristics of sparsity and disorder. Therefore, the convolutional neural network suitable for the traditional two-dimensional target detection cannot effectively extract the point cloud characteristics.

In order to obtain accurate three-dimensional object detection results, literature (Vora S, lang A H, helou B, et al Pointpaint: sequential fusion for 3d object detection[C ]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognment.2020:4604-4612.) proposes using features of an RGB image as a priori to find a correlation between point cloud data and the RGB image. Firstly, semantic segmentation results in an image space are obtained through a two-dimensional image semantic segmentation network, then the segmentation results are projected into a three-dimensional space by utilizing a camera projection matrix for supplementing point cloud features, and finally, the three-dimensional object detector based on the point cloud is sent to obtain detection results. Document (Xu S, zhou D, fang J, et al fusion imaging: multimodal fusion with adaptive attention for 3d object detection[C ]//2021IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021:3047-3054.) uses the semantic segmentation network of the image and the point cloud to obtain the segmentation result of both as a priori, and the adaptive fusion module fuses the dense image information into the point cloud information to obtain the detection result, thereby improving the problem of boundary blurring caused by the segmentation network. The method for detecting the target based on the semantic segmentation result and fusing the image and the point cloud improves the detection precision, but the method is too dependent on the point cloud characteristics and does not fully utilize the dense image characteristics, and if the point cloud information has noise, the whole result is greatly affected. In order to enhance the specific gravity of image features in a multi-modal fusion model, the invention proposes to use a multi-modal fusion architecture of interleaved fusion.

Disclosure of Invention

The invention provides a multimode fusion robust three-dimensional target detection method based on an automatic driving scene, which aims to overcome the defect that the prior art is too dependent on point cloud characteristics.

The invention uses a rotating lidar and 6 optical cameras surrounding the vehicle as sensors. Three-dimensional point cloud information and RGB image information are respectively acquired. Therefore, the geometric information of objects in the surrounding environment of the vehicle is effectively perceived, and the geometric information can be used for adaptively identifying targets in different road scenes, such as: common objectives are pedestrians, roadblocks, automobiles, engineering vehicles, buses, and the like.

The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene comprises the following steps:

step 1: acquiring current frame point cloud data and image data;

step 2: and inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature.

Step 3: inputting the image data into an image feature extraction network to obtain multi-scale image features.

Step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result.

Step 5: and (3) sending the images and the point cloud features output in the step (2) and the step (3) and the preliminary detection results output in the step (4) to an interleaving fusion module, so as to adaptively fuse the image features and the point cloud features. And (5) fine-tuning the preliminary detection result in the step (4) by using the fused result.

The specific flow of the step 2 is as follows:

step 2-1: voxel size and detection range are defined. The human definition of the detection range interval is set to be X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m).

Step 2-2: and (5) voxelization of the point cloud. And rasterizing the point cloud in the space according to the defined voxel size to obtain a plurality of voxels. Secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0. Performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method ⁱ ∈R ^C . Wherein C is the number of characteristic channels. In addition, 0-padding is also used for empty voxels where no point cloud exists.

Step 2-3: voxel features are extracted. Downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F _B ∈R ^C×W×H And then obtaining multi-scale BEV features by using traditional two-dimensional convolution and obtaining final BEV features by fusing multi-scale information through FPN.

The specific flow of the step 3 is as follows: first, 6 image features at this time are extracted using Resnet-50. And then, utilizing the FPN to fuse the multi-scale information and outputting feature graphs corresponding to different scales.

The specific flow of the step 4 is as follows:

step 4-1: and (5) reducing the dimension. The BEV features are dimensionality reduced using a 3 x 3 convolution, thereby saving computation.

Step 4-2: and predicting a result. The detection task is regarded as a set matching problem, a set of learnable object query vectors is given, the BEV features are decoded using a cross-attention mechanism, and the object query vectors are used as containers after the BEV features are decoded. Finally, the object query vector is input into the regression branch and the classification branch to obtain the detection result in the BEV space.

The specific flow of step 5 is as follows, the decoder in the interleave fusion module is the decoder in step 4-2:

step 5-1: initializing. The initialization process is divided into two steps: first, the object query vector in step 4-2 is used as the object query vector of the interleave fusion module. The point cloud in the preliminary detection result is sampled and used for generating an enhancement vector to be added to the object query vector. Secondly, the center point of the detection result in the step 4-2 is taken as a 'reference point' of the module.

Step 5-2: and fusing the image features. The object query vector is first fed into the fully connected layer, generating six learnable biases based on "reference points". And secondly, obtaining sampling positions through the offset calculation, so as to carry out bilinear interpolation sampling in the input multi-scale image feature map. And simultaneously, generating weights corresponding to the six points by using the full connection layer. Finally, the characteristics of the six points are weighted and summed for updating the object query vector.

Step 5-3: fusing point cloud features. First, the three-dimensional coordinates of the reference point are converted into coordinates at the BEV viewing angle. And secondly, calculating the BEV characteristic corresponding to the reference point through a bilinear sampling algorithm. Finally, the object query vector is updated.

The invention has the advantages that:

1. the invention provides a novel laser radar-camera fusion model for three-dimensional target detection. The complementarity of the modes is realized through the staggered fusion architecture. Can show better robustness under the noise condition of various laser radars.

2. The invention provides pluggable feature enhancement operation, which encourages a network to learn difficult samples by adding enhancement vectors generated from point clouds in initial detection results, thereby promoting the enhancement of perception results.

3. Compared with the traditional fusion method, the method has the advantages that dense image features are better mined and utilized, so that the network has stronger recall capability, the missing detection of the three-dimensional target is reduced, and the driving safety is better ensured.

Drawings

Fig. 1 is a schematic overall flow diagram of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings, and a flow chart of the invention is shown in fig. 1.

The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene is beneficial to better utilizing dense semantic information in the image so as to achieve better robustness and higher detection precision. The method comprises the following specific steps:

step 1: acquiring current frame point cloud data and image data;

Step 5: and (3) sending the image and the point cloud characteristics output in the step (2) and the preliminary detection result output in the step (3) to an interleaving fusion module, so as to adaptively fuse the image characteristics and the point cloud characteristics. And (3) fine-tuning the preliminary detection result in the step (3) by using the fused result.

The specific steps in the step 1 are as follows:

step 1-1: the trigger frequency of the laser radar and the six vehicle-mounted cameras is set to 20HZ by people. And finding corresponding point cloud data and image data through the time stamp to be used as model input.

Step 1-2: and obtaining an internal reference matrix of the camera and an external reference matrix from a radar coordinate system to a camera coordinate system by adopting a mature camera calibration method.

The specific steps in the step 2 are as follows:

The specific flow of the step 3 is as follows: first, 6 image features at this time are extracted using a Resnet-50 feature extractor that is pre-trained by imagenet. Then, the FPN is utilized to fuse multi-scale information, and feature graphs corresponding to different scales are outputWherein C is _s ,W _s ,H _s The number of characteristic channels of the image, the width and the height of the image at s scale are respectively represented.

The specific flow of the step 4 is as follows:

step 4-1: and (5) reducing the dimension. Because the calculation amount of the point cloud branch is huge and time-consuming, in order to reduce the calculation consumption of back propagation, the method uses 3×3 convolution to perform the dimension reduction operation on the BEV feature, thereby saving the calculation amount of the subsequent process and accelerating the reasoning speed of the model.

step 5-1: initializing. The initialization process is divided into two steps: first, the object query vector in step 4-2 is used as the object query vector of the interleave fusion module. In addition, as shown in formula (1), I three-dimensional frames are obtained through the preliminary detection result, and Z point clouds are randomly sampled from the point clouds in the framesAnd uses the mapping f _bp Obtaining d _pc Gao Weidian cloud feature P of dimension _box . Then, as shown in equation (2), an enhancement vector P is generated using a max-pooling operation _e And added to the object query vector. Finally, the center point of the detection result in step 4-2 is taken as the "reference point" of the module.

P _box ＝f _bp (P _rand ) (1)

P _e ＝MaxPool(P _box ) (2)

Step 5-2: and fusing the image features. First, as shown in equation (3), C is calculated using a calibration matrix composed of camera internal parameters and external parameters _in Each center x of (a) _i ,y _i ,z _i Projecting the two-dimensional center points u to a camera coordinate system to obtain corresponding two-dimensional center points u in the RGB image _i ,v _i Where d is a scale factor (expressed in the current condition as a depth value in the world coordinate system). Then, as shown in formula (4), the object query vector is sent to the full connection layer to generate six learnable biases Deltax based on the reference point _lkp . Next, the sampling position r is calculated by the offset _c +△x _lkp Thereby bilinear interpolation sampling in the input multi-scale image feature mapSimultaneously, the weight A corresponding to the six points is generated by utilizing the full connection layer _lkp . Finally, the characteristics of the six points are weighted and summed for updating the object query vector +.>Where M represents the number of interleaved encoder blocks. k is the index of the sample point, Δx _lkp And A _lkp Respectively representing the sampling offset and the attention weight of the kth sampling point in the ith feature layer of the jth camera.

Step 5-3: fusing point cloud features. First, the three-dimensional coordinates of the reference point are converted into coordinates at the BEV viewing angle. Second, by bilinear sampling algorithm F _L And calculating to obtain BEV characteristics corresponding to the reference points. Finally, as shown in equation (5), the feature is used to update the object query vectorWhere k is the index of the sample point, deltax _k And A _k The sample offset and the attention weight of the kth sample point are represented, respectively.

Step 6: outputting the detection result. As shown in formulas (6) and (7), the query vectors are input into classifiers f composed of linear layers, respectively _class And regression f _reg Obtaining a classification prediction result P of the three-dimensional target _class And locating the prediction result P _res 。

Step 7: when the laser radar is affected by strong light or reflective materials, distortion of laser radar data and loss of point clouds can be caused, and the quality of BEV characteristics is finally affected. In this case, the interlacing fusion module in step 5 will adaptively adjust the attention weights in step 5-2 and step 5-3, reduce the update weight of the BEV feature on the query vector in step 5-3, and increase the image feature update weight in step 5-2, thereby increasing the weight of the image modality in the model prediction process. The step utilizes abundant semantic information in the image space to compensate negative effects caused by laser radar modal data distortion, so that the prediction accuracy of the model working in a laser radar signal noise environment is improved.

The embodiments described in the present specification are merely examples of implementation forms of the inventive concept, and the protection scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the protection scope of the present invention and equivalent technical means that can be conceived by those skilled in the art based on the inventive concept.

Claims

1. The multi-mode fusion robust three-dimensional target detection method based on the automatic driving scene comprises the following specific steps:

step 1: acquiring current frame point cloud data and image data;

step 2: inputting the point cloud into a point cloud feature extraction network, and converting the point cloud of the current frame into a Bird's Eye View (BEV) feature; wherein the method can be divided into three steps

Step 2-1: defining a voxel size and a detection range; the artificial definition of the detection range interval is set as X and Y axis intervals of [ -54m,54m ], Z axis intervals of [ -5m,3m ], and the definition of the voxel size is (0.075 m,0.2 m);

step 2-2: voxelization of point cloud; performing rasterization operation on the point cloud in the space according to the defined voxel size to obtain a plurality of voxels; secondly, sampling point clouds in each non-empty voxel, randomly selecting N point clouds, and if the point clouds are insufficient, complementing the point clouds with 0; performing dimension-increasing operation on the sampled point cloud by using the fully connected neural network to obtain the point cloud characteristics in the voxel i, wherein the point cloud characteristics are as followsAnd obtaining the voxel characteristic V by using a maximum pooling method ⁱ ∈R ^C The method comprises the steps of carrying out a first treatment on the surface of the Wherein C is the number of characteristic channels; in addition, 0 complement is also used for empty voxels where no point cloud exists;

step 2-3: extracting voxel characteristics; downsampling the voxel features obtained by the steps by using sparse three-dimensional convolution to obtain BEV features F _B ∈R ^C×W×H Then, a traditional two-dimensional convolution is used for obtaining multi-scale BEV features, and final BEV features are obtained by fusing multi-scale information through FPN;

step 3: firstly, inputting image data into an image feature extraction network Resnet-50, and extracting 6 image features at the moment; then, fusion of multi-scale information by using FPN (field programmable gate array) is utilized to obtain feature graphs corresponding to different scales;

step 4: inputting the aerial view features into an aerial view detection module to obtain a preliminary three-dimensional target detection result; the steps can be specifically divided into:

step 4-1: dimension reduction; performing dimension reduction operation on BEV features by using 3X 3 convolution, so that the calculated amount is saved;

step 4-2: predicting a result; regarding the detection task as a set matching problem, giving a group of learnable object query vectors, and decoding BEV features by using a cross-attention mechanism, wherein the object query vectors are used as containers after BEV features are decoded; finally, inputting the object query vector into a regression branch and a classification branch to obtain a detection result in the BEV space;

step 5: sending the point cloud and the image characteristics output in the step 2 and the step 3 and the preliminary detection results output in the step 4 to a staggered fusion module so as to adaptively fuse the image characteristics and the point cloud characteristics; and (5) fine-tuning the preliminary detection result in the step (4) by using the fused result.

2. The multi-modal fusion robust three-dimensional object detection method based on an autopilot scenario of claim 1, wherein: the specific flow of step 5 is as follows:

step 5-1: initializing; the initialization process is divided into two steps: firstly, using the object query vector in the step 4-2 as the object query vector of the interlacing fusion module; sampling point clouds in the preliminary detection result to generate an enhancement vector and adding the enhancement vector into an object query vector; secondly, taking the center point of the detection result in the step 4-2 as a reference point of the module;

step 5-2: fusing image features; firstly, sending an object query vector into a full connection layer to generate six learnable biases based on reference points; secondly, obtaining sampling positions through the offset calculation, and performing bilinear interpolation sampling in the input multi-scale image feature map; simultaneously, generating weights corresponding to the six points by using the full connection layer; finally, the characteristics of the six points are weighted and summed to update the object query vector;

step 5-3: fusing point cloud features; firstly, converting three-dimensional coordinates of a reference point into coordinates under a BEV view angle; secondly, calculating to obtain BEV characteristics corresponding to the reference point through a bilinear sampling algorithm; finally, the object query vector is updated.