CN116778449A

CN116778449A - Detection method for improving detection efficiency of three-dimensional target of automatic driving

Info

Publication number: CN116778449A
Application number: CN202310490811.3A
Authority: CN
Inventors: 韩世豪; 曹杰程; 居世豪; 卞伟涛; 陶重犇
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2022-04-29
Filing date: 2023-05-04
Publication date: 2023-09-19

Abstract

The invention discloses a detection method for improving the detection efficiency of an automatic driving three-dimensional target, which comprises the following steps: extracting a group of key points from the original point cloud by using a furthest point sampling method; quantifying the key points to be queried into voxels, and obtaining adjacent voxel feature sets of the key points by calculating Manhattan distance; feature enhancement is performed through matching and aggregation; the original key point features are obtained by connecting the original point cloud features, the multi-scale voxel aggregation features and the bird's eye view features; optimizing key point characteristics; uniformly dividing each region of interest by using a group of virtual grid points, setting a key point threshold value and an aggregate abstract radius, and screening the grid points; correcting the direction and boundary of the proposal by fitting the minimum bounding rectangle and the weighted key point characteristics to obtain a corrected 3D frame; repeating the steps until all the key points are traversed, and obtaining a final three-dimensional target detection result. The invention improves the prediction precision and reduces the calculated amount.

Description

Detection method for improving detection efficiency of three-dimensional target of automatic driving

Technical Field

The invention relates to a three-dimensional target detection technology, in particular to a target detection method for automatic driving, and especially relates to a three-dimensional target detection method for improving detection efficiency through point cloud voxel fusion.

Background

Various modules developed by using the 3D object detection technology play an important role in the fields of autopilot, robotics, path planning, and the like. Traditional 3D object detection methods focus more on the raw point cloud acquired by the lidar sensor. Although the original point cloud can provide very accurate depth information, the sparsity of the point cloud necessarily limits the expansion of its application field. The more intense studies in recent years can be divided into three categories: a Point cloud (Point) based method, a Voxel (Voxel) based method, and a method of combining a Point cloud and a Voxel. All three methods require the use of Convolutional Neural Networks (CNNs) to extract the 3D representation, the difference between them being how to transform the 3D representation of the target object.

The Point-based method directly takes the original Point cloud as a processing object. They extract point-based features and can therefore provide accurate 3D spatial representations. A series of methods based on PointNet solves the problem of disorder in the original point cloud by applying a symmetric operation function on the transformation elements in the set, and directly extracting features from the original point cloud necessarily brings about a great amount of calculation. To solve this problem, the present-stage Point-based methods all employ a two-stage pipeline (two-stage pipeline). Taking Point R-CNN as an example (see literature: shaoshuai Shi, xaogang Wang, and Hongsheng Li, "Pointrcnn:3D Object Proposal Generation And Detection From Point Cloud", in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770-779, 2019), it first estimates a 3D object proposal from segmented foreground objects in a first stage. The 3D proposal is then refined in a second stage using more accurate Point-based features to generate the final prediction suggestion. Point-based approaches, while having high predictive performance, still fail to circumvent the expensive computational cost of extracting features from a single Point.

The method based on the pixels converts the irregular point cloud into the ordered Voxel grid, and solves the problems caused by sparsity and irregularity of the original point cloud. In early studies, the point cloud was projected into a Bird Eye View (BEV) to obtain a denser 3D representation. This approach can only obtain 3D features with higher accuracy, while the rest of the information of the target object cannot be obtained. To solve this problem, more recent approaches use 3D convolution to extract voxel features from the voxel representation. VoxelNet (see Zhou Y, tuzel o.voxelnet: end-to-End Learning For Point Cloud Based 3d Object Detection[C ]. Proceedings of the IEEE conference on computer vision and pattern recognment.2018: 4490-4499) first proposes deployment of PointNet into a lidar point cloud and joint processing through a 3D convolutional layer, a 2D backbone, and a detection head. But its disadvantages are also apparent, the VoxelNet operation speed is slow. A series of improvements have emerged hereinafter to increase the speed of operation, allowing voxel-based methods to be fully developed and deployed. However, these methods are still limited by quantization errors that occur when segmenting voxels. In particular, voxel-based methods have mainly two drawbacks: firstly, a large amount of 3D structure information with fine granularity is lost; secondly, the size of the voxel grid can greatly affect the performance of the algorithm.

In order to solve the problems of high computational cost and loss of fine-grained structure information at the same time, a recent method option combines a Point-based method with a Voxel-based method. PV R-CNN (see literature Shi S, guo C, jiang L, et al Pv-rcnn: point-Voxel Feature Set Abstraction For 3d Object Detection[C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reconnection.2020:10529-10538), extends the SECOND by adding Point-based features. PV R-CNN is also a method of projecting 2D views, but it differs in that multi-scale 3D voxel features are acquired using a voxel set abstraction module and integrated into neighboring keypoints. Finally, each 3D region proposal can extract the characteristics of the key points through the ROI grid pooling so as to refine the final proposal. The detection method combining the Point and the pixel gives consideration to the advantages of the method based on the pixel and the method based on the Point, but the problems of large calculated amount, low calculation speed and insufficient utilization of complementary information still cannot be thoroughly avoided. Also, these three-way methods have the problem of size ambiguity (Size Ambiguity Problem), i.e., they may ignore the proposed boundary size, which may lead to a degradation of algorithm performance when dealing with some complex scenarios.

3D object detection requires estimation of 7 degrees of freedom (including coordinate position, dimensions and directions) of surrounding object objects in real complex 3D space. Most 3D object detection methods in the present stage directly fit the object through the axis aligned frames, and the sparsity of the point cloud can cause a large number of areas without measured values in the 3D space, so that only one 3D frame in the output result cannot be aligned with the global coordinate system. Whereas the conventional Anchor-based method may result in misdirection of fitting the rotating object.

Therefore, in order to improve the detection efficiency of the three-dimensional target to adapt to the application requirement of automatic driving, the problems of large calculation amount, size ambiguity, angle deviation, unbalanced positive and negative samples in the generated key points and the like in the prior art need to be solved.

Disclosure of Invention

The invention aims to provide a detection method for improving the detection efficiency of an automatic driving three-dimensional target, which improves the prediction precision and reduces the calculated amount.

In order to achieve the aim of the invention, the invention adopts the following technical scheme: a detection method for improving the detection efficiency of an autopilot three-dimensional target, comprising the steps of:

(1) Acquiring original point cloud data to be detected;

(2) Extracting a group of key points P from an original point cloud by using a furthest point sampling method, wherein the key points are P _i I is the sequence number of the key point;

(3) The key point p to be queried _i Quantized into corresponding voxels, and then, by calculating Manhattan distance between non-empty voxels around the key point and the key point voxels, judging whether the non-empty voxels are adjacent voxels of the query key point, obtaining the key point p _i Is a set of adjacent voxel features of (a)Where K represents a multiple of the current downsampling;

(4) Calculating a key point p _i Matching probability of Point cloud (Point) based feature and Voxel (Voxel) based feature in its adjacent Voxel feature set, selecting k most similar Voxel features according to matching probability, aggregating them with corresponding matching probability, performing feature enhancement, and finally aggregating kThe voxel characteristic generates a key point p by a PointNet-block method _i Features of (2)K is 1X, 2X, 4X;

(5) By characterizing an original point cloudMultiscale voxel aggregation feature->Bird's eye view feature->Connecting to obtain original key point feature->

(6) Calculating key point features respectivelyThe characteristic of the original point cloud adjacent to the original point cloud>Multiscale voxel aggregation feature->The feature matching probability mean value of the database is used for weighting the initial key point features together with the prediction foreground probability to obtain updated key point features +.>

(7) Uniformly segmenting each region of interest (ROI) using a set of virtual grid points, and setting a keypoint threshold λ and a collection abstract radius r _g Screening the grid points; correcting the direction and boundary of the proposal by fitting the minimum bounding rectangle and the weighted key point characteristics to obtain a corrected 3D frame;

(8) And (3) repeating the steps (3) to (7) until all the key points are traversed, and obtaining a final three-dimensional target detection result.

In the above technical solution, step (3) includes:

and (3) the K-level voxel feature set obtained by 3D sparse convolution and corresponding actual coordinates thereof are expressed as:

wherein i represents the number of non-empty voxels in the K-th level;

the key points to be queried are taken as the origins of the local coordinate system, 3 offsets are added for the surrounding non-empty voxels,that is to say,

then sample Manhattan threshold D _K All non-empty voxels within, for any one key point p _i A set of neighboring voxel features is obtained, namely:

wherein ,representing the relative positions of semantic voxel features; kj represents the j-th adjacent non-empty voxel of the key point pi in the K-th level; />Representing manhattan distances between semantic voxel features and corresponding keypoints,

in the above technical solution, step (4) includes:

the probability of matching a Point cloud (Point) based feature of a keypoint pi with a Voxel (Voxel) based feature of its neighboring Voxel feature set is,

wherein ,f(V_n ,p _i ) Representing the key point p _i Is based on the similar features of voxels (Voxel);

selecting k most similar voxel features according to the matching probability, aggregating the k most similar voxel features with the corresponding matching probability, and finally generating key points p by the k aggregated voxel features through PointNet-block _i Is characterized in that,

wherein M (·) represents a multi-layer perceptron network for encoding key-point voxel features; max (·) represents the max pooling operation along the channel.

In the above technical solution, the key point p obtained in step (5) _i Is characterized in that,

in the step (6), for any one of the key points p _i A set of adjacent original point clouds may be obtained,

wherein ,respectively are provided withPoint cloud (Point) based features and actual coordinates representing an original Point cloud;representing the relative position of the original point cloud; r is (r) _raw Representing a set radius range; c (C) _raw ,F _raw Respectively representing a feature set of the original point cloud and corresponding actual coordinates thereof; n represents the number of adjacent original point clouds;

calculating a key point p _i The way of matching probability of the point cloud based features with the point cloud based features of its neighboring original point cloud is as follows,

the feature probability of the key point is obtained through a 3-layer multi-layer perceptron network and a Sigmod function, namely the prediction probability of the key point belonging to the foregroundCalculating feature matching probability mean value of key points and adjacent original point clouds through SAFeature matching probability mean values of key points and adjacent voxels in multiple scales are calculated through VPFA (virtual private FAs)The weights from the foreground region keypoints are then increased by averaging and weighting operations, re-weighted keypoint features +.>Expressed as:

in the above technical solution, step (7) includes:

for each 3D proposal, there are provided M x M virtual grid points uniformly distributed in length, width and height, and the coordinates of the virtual grid points are normalized with the coordinates of the real points; SA is then used for each virtual grid point to select the key point features to be aggregated, and the key point threshold lambda and the aggregate abstract radius r of the virtual grid point are set _g Aggregate all weighted keypoints within the set abstraction radius for any one virtual grid point g _i A set of neighboring weighted voxel features can be obtained,

wherein ,p_j -g _i Representing the relative positions of adjacent weighted keypoints; n' represents the total number of adjacent weighted keypoints;

if any grid point g _i By r _g If lambda weighted key points cannot be found in the sphere with the radius, deleting the virtual grid point;

the set virtual grid points are also distributed on the proposal surface, and can capture adjacent weighted feature points outside the 3D boundary box of the target proposal but inside the abstract radius of the set;

a set of virtual grid points for each proposal is obtained,

G＝{g _m ∈R ³ |m∈[0,M ³ -1]}

then fitting a minimum bounding rectangle around all the rest points in the virtual grid point set to correct the proposal direction and boundary, interpolating the features in the adjacent weighted voxel feature set to obtain the features of the virtual grid,

wherein d (·) represents the L2 distance;

the features of all virtual grid points in one proposal are added into a grid feature set, the grid feature set is passed through an MLP with channel dimension of [ C+3,256,128,128] and a global maximum pool to obtain the ROI feature of each proposal, then the cross-over ratio (Intersection over Union) estimation of each frame is obtained through another MLP, and finally the foreground confidence of the target proposal and the optimized 3D frame are respectively predicted by using a 2-layer multi-layer perceptron.

The key point threshold may be selected empirically, and the preferred solution is that the key point threshold λ=3.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

1. aiming at the problem of larger calculated amount in the traditional voxel-point feature aggregation, the invention provides a feature aggregation method utilizing offset and feature matching probability, queries neighbor voxels through the offset, performs voxel feature screening by utilizing the voxel-key point matching probability, improves the prediction precision and reduces the calculated amount.

2. Aiming at the problems of size ambiguity and angle deviation existing in a large number of target detection algorithms, a pooling module based on virtual grid point groups is provided, and boundary and angle optimization is carried out on the generated boundary frame proposal through a group of uniformly distributed virtual grid point groups.

3. Aiming at the problem of unbalance of positive and negative samples in generated key points, a key point analysis module based on multi-scale probability weighting is provided, and whether the key points exist in the background or the foreground is judged by calculating the average value of matching probabilities in the multi-scale, and finally the problem of unbalance of the positive and negative samples is solved.

Drawings

FIG. 1 is a method framework diagram of a first embodiment of the invention;

FIG. 2 is a schematic diagram of a neighboring voxel query in accordance with an embodiment one;

FIG. 3 is a schematic diagram of a KP Analysis module in accordance with example one;

FIG. 4 is a schematic diagram of a common problem of a 3D object detection algorithm;

fig. 5 is a diagram illustrating virtual grid point screening according to the first embodiment;

FIG. 6 is an example of qualitative results on a KITTI data set in embodiment one;

FIG. 7 is an example of qualitative results on the NuScens dataset in example one;

FIG. 8 shows the effect of the number of keypoints on 3D mAP,Inference time;

fig. 9 shows the effect of different keypoint thresholds on the PR curve (IoU =0.7);

fig. 10 is a graph showing performance comparisons performed at different distances on an actual test platform in the first embodiment.

Detailed Description

The invention is further described below with reference to the accompanying drawings and examples:

embodiment one: a detection method for improving the detection efficiency of an autopilot three-dimensional target is called APVR. The original point cloud is used as input, 3D sparse convolution is used as a backbone network to extract voxel characteristics, then offset addition and matching probability calculation are utilized to fully utilize complementary information, and then an IOU-Grid mapping module is applied to generate further refinement of the proposal according to the screened key point characteristics.

Referring to fig. 1, the frame may be expressed as:

the specific modules involved therein are as follows:

1. VPFA module

The conventional SA-based local feature extraction method is widely used in 3D object detection algorithms in recent years. Although this method can extract multi-scale features only by designing different radius thresholds, it still cannot avoid the problem of large calculation amount. Unlike the original point cloud which is irregularly distributed in the quantization space, voxels are generally regularly arranged. The calculation amount is large by using a sphere search method similar to SA. To further reduce the computational cost, the method chooses to add an offset to query neighboring voxels.

The original point cloud is taken as input and a set of keypoints P is screened from the original point cloud by furthest point sampling (Furtherest Point Sampling, FPS). As shown in fig. 2, the keypoints that need to be queried are first quantized to the corresponding voxels. And then judging whether the non-empty voxels are adjacent voxels of the query key point by calculating Manhattan distance between the non-empty voxels around the key point and the key point voxels.

Specifically, the actual coordinates corresponding to the K-th voxel feature set obtained by 3D sparse convolution are expressed as:

where i represents the number of non-empty voxels in the K-th level.

The key points to be queried are taken as the origins of the local coordinate system, 3 offsets are added for the surrounding non-empty voxels,

then sample Manhattan threshold D _K All non-empty voxels within. For any one key point p _i A set of neighboring voxel features is obtained, namely:

wherein ,representing the relative positions of semantic voxel features; kj represents the j-th adjacent non-empty voxel of the key point pi in the K-th level; />Representation ofManhattan distances between semantic voxel features and corresponding keypoints.

Compared with the traditional SA method, the method has the advantage that the calculated amount is one order of magnitude lower. And therefore may be more efficient and less time-complex in performing voxel queries than conventional methods.

The original point cloud is step down-sampled after using 3D sparse convolution into 1×,2×,4× 3D features (Feature Volumes). Feature aggregation can be performed in voxels of different scales by setting 3 different manhattan thresholds.

To further enhance the relevant voxel features, a keypoint p is also calculated _i Matching probabilities of the Point-based features and the Voxel-based features in its neighboring Voxel feature set,

wherein ,f(V_n ,p _i ) Representing the key point p _i Is based on the similar features of the Voxel.

The most similar k voxel features are selected according to the matching probabilities and aggregated with the corresponding matching probabilities. The aggregate features contain more accurate fine-grained information than the original Voxel-based features. The aggregated features have feature enhanced functionality and may provide a better discernable 3D representation. Finally, generating key points p by using k aggregate voxel features through PointNet-block _i Is characterized in that,

wherein M (·) represents a multi-layer perceptron network for encoding key-point voxel features; max (·) represents the Max Pooling operation Along the Channel (Along-Channel Max-Pooling).

The VPFA module can enable each key point to obtain context information of 3 scales by setting different Manhattan thresholds, and the aggregation feature enables the key point feature to obtain more abundant fine-grained information.

2. KP Analysis module

Key point features are derived from original point cloud featuresMulti-scale voxel aggregation featuresBird's eye view feature->And connecting the two parts. The aerial view features are mainly obtained by bilinear interpolation on an aerial view feature map obtained by downsampling. So key point p _i May be characterized by the features of (a) to (c),

the selection of different features of the keypoints for connection can enable the features of the keypoints to contain a large amount of fine-grained information, so that 3D structure information of the whole target scene can be reserved. In order to solve the problem of foreground and background keypoints mixing, the invention proposes a keypoint analysis module (KP Analysis Module).

The original point cloud generated by the laser radar is distributed on the surfaces of the background and foreground objects, so that a multi-scale matching probability weighted key point analysis module is designed. Specifically, unlike the traditional mode of using only a multi-layer perceptron, the method needs to calculate the matching probability of the key point and the original point cloud characteristic respectively, and the multi-scale voxel characteristic is subjected to further characteristic enhancement to judge that the key point exists in the foreground or the background. The real environment around the keypoints distributed in the background area is more complex than in the foreground area. The mean of the matching probabilities for the keypoints in the foreground region is higher than for the background region.

The original features of the keypoints are obtained using a conventional SA module, so for any one keypoint p _i A set of adjacent original point clouds may also be obtained,

wherein ,point-based features and actual coordinates of the original Point cloud are represented respectively;representing the relative position of the original point cloud; r is (r) _raw Representing a set radius range; c (C) _raw ,F _raw Respectively representing a feature set of the original point cloud and corresponding actual coordinates thereof; n represents the number of neighboring origin clouds.

Then calculate the key point p _i The probability of a Point-based feature match of a Point-based feature of its neighboring origin cloud,

the KP Analysis module is shown in fig. 3. The module can make full use of complementary information between the Voxel-based and Point-based. Specifically, different key point information is acquired by the following 3 ways: the feature probability of the key point is obtained through a 3-layer multi-layer perceptron network and a Sigmod function, namely the prediction probability of the key point belonging to the foregroundCalculating feature matching probability mean value of key points and adjacent original point clouds through SA>Calculating feature matching probability mean value of neighboring voxels in the key point and multiple scales through VPFA>And then, the weight of key points from the foreground area is increased through averaging and weighting operation, and finally, the aim of optimizing proposal is achieved.

Re-weighted keypoint featureCan be expressed as:

compared with the mode of using a multi-layer perceptron only, the method of the invention has larger improvement on the accuracy of judging the position of the key point or the resolution of the key point characteristics.

In the training stage, the invention selects the group-trunk 3D Box to be used for generating the segmentation labels for each key point. And Fcoral Loss is selected to solve the problem of unbalanced number of foreground and background key points.

3. IoU-Grid Pooling module

As shown in fig. 4, there are generally two problems with current 3D object detection: size ambiguity and incorrect directional fitting problems caused by the inability of the 3D frame to align with the global coordinate system. First, consider that what is used herein is not a traditional 2D image, but a pseudo 2D bird's eye view of a 3D feature transformed by 3D sparse convolution downsampling. Because of sparsity and irregularity of the original point cloud, position information as dense as a 2D image cannot be provided. This can result in proposals for which accurate dimensional information is not available. Specifically, as shown in (a) of fig. 4, the middle box and the outer box extract the same points, so they have similar feature representations, but the problem of lack of size information may cause the same set of points to appear to have different classification and regression targets.

The inventionThe conventional RPN module is selected when generating the ROI (region of interest). 2 different objects are designed for each classIs a 3D anchor frame of (2). Whereas on the aerial view only the two directions 0 ° and 90 ° are evaluated. But this approach may result in a wrong direction of rotation of the object, i.e. it may have a higher performance in facing objects moving along a straight line. If the object is rotated, then the Anchor-based method cannot fit the axis-aligned bounding box.

In order to solve the problems, the invention provides a IoU-Grid Pooling module. After a previous series of modular processing, a set of weighted key point features has been obtainedAnd a 3D ROI resulting from the 3D sparse convolution. Considering that the conventional RCNN model ignores point-to-point spacing information, the present invention makes use of this ignored information by setting virtual grid points.

For each 3D proposal, there are provided M x M virtual grid points uniformly distributed in length, width and height, and the coordinates of the virtual grid points are normalized with the coordinates of the real points. SA is then used for each virtual grid point to select the key point features that need to be aggregated. However, unlike the conventional method, a critical point threshold lambda and an aggregate abstract radius r of the virtual grid point are set _g . Specifically, all weighted keypoints within the radius are aggregated, so for any one virtual grid point g _i A set of neighboring weighted voxel features can be obtained,

wherein ,p_j -g _i Representing the relative positions of adjacent weighted keypoints; n' represents the total number of neighboring weighted keypoints.

As shown in fig. 5, ifAny grid point g _i By r _g λ weighted keypoints cannot be found within a sphere of radius, and the virtual grid point is deleted. In practical applications, the key point threshold needs to be adjusted empirically, and the key point threshold λ=3 selected in this embodiment. In particular, the virtual grid points set by the method are also distributed on the proposal surface, so that adjacent weighted feature points outside the 3D boundary box of the target proposal but within the abstract radius of the set can be captured through IoU-Grid Pooling Moudle. After virtual grid point screening, the invention can obtain a virtual grid point set of each proposal,

G＝{g _m ∈R ³ |m∈[0,M ³ -1]}

then, a minimum bounding rectangle is fitted around all the rest points in the virtual grid point set, so that the proposal direction and boundary can be corrected. By interpolating features in the set of adjacently weighted voxel features, features of the virtual grid can be obtained,

wherein d (·) represents the L2 distance.

The invention adds the features of all virtual grid points in one proposal to the grid feature set and passes it through an MLP with channel dimension of [ C+3,256,128,128] and global maximum pool to obtain the ROI feature of each proposal. Then, intersection Over Union (IOU) estimates for each box are obtained by another MLP. Finally, respectively predicting the foreground confidence of the target proposal and optimizing the 3D frame by using a 2-layer multi-layer perceptron.

In general, by adding virtual grid points, the text can obtain accurate boundary information of a proposal by determining whether the number of neighboring weighted keypoints of the proposal boundary grid points is above a keypoint threshold, and correct the 3D frame direction by fitting a minimum bounding rectangle.

Training targets

The proposed model was trained using 4 Terms (terminals), including regression (L _reg ) Classification (L) _cls ) Region of interest (L _rpn ) And grid update (L) _grid )，

wherein ,N_fg Representing the number of anchors generated to belong to the foreground region; λ represents balance parameters corresponding to different loss functions; λreg=λ rpn =λcls=λgrid=1.

In order to solve the problem of unbalanced foreground and background areas of the generated key points, focal Loss is used as a classification item L _cls . And for the frame regression term L _reg And RPN loss term L _rpn The Huber Loss is selected for use.

For the grid update item, the L2 loss function is chosen to be used,

wherein, |·| denotes the L2 norm.

The APVR method of this embodiment was used to perform the test on data set Kitti, nuScenes, waymo.

Introduction of related datasets:

(1) Kitti dataset: kitti has received extensive attention as the most classical 3D object detection benchmark. The Kitti dataset provides 7481 training samples and 7518 test samples to meet the detection needs of different subjects such as automobiles, pedestrians and bicycles. In the Kitti dataset, the objects will be divided into occlusion levels and truncation levels according to object size: easy, medium and difficult three levels. In most cases, the raw training data will be divided into a 3712 sample training set 3769 sample validation set. But the invention randomly selects 80% of samples in the training samples as the training set and 20% of samples as the verification set.

(2) NuScenes dataset: nuScenes dataset as the latest large-scale autopilot dataset 1000 complex driving scenarios from boston and singapore were collected. These scenarios can be split into 700 training scenarios, 150 verification scenarios and 150 test scenarios. Unlike the Kitti dataset, the Nuscenes dataset was data acquired with 6 multiview cameras and one 32-line LiDAR.

(3) Waymo dataset: the Waymo dataset was taken as the current largest open dataset, comprising 798 training sequences and 202 validation sequences. The dataset contains RGB images obtained by a high resolution camera and a 3D point cloud generated by 64 line LiDAR. And the authorities also provide performance subdivision for level_1 (more than 5 lidar points in the box) and level_2 (at least one lidar point in the box).

Network architecture:

the 3D backbone of this embodiment has 3 levels, each with a characteristic dimension of 16,32,64. And their manhattan thresholds used in the VPFA module are 2,4,6, respectively. The set abstraction radius used in the original point cloud is (0.4 m,0.8 m). In the IOU-Grid Pooling module, uniform sampling for each proposal 5X 5 virtual Grid points. And the set virtual grid point key point threshold lambda=3, and the set abstract radius rg=0.6m.

The range of point clouds in the 3D scene of the KITTI dataset is specified at [ (0,70.4) m, (-40, 40) m, (-3, 1) m ], and (0.1 m,0.15 m) is used as voxel size. For Nuscenes dataset, the 3D point cloud ranges of X-axis, Y-axis, Z-axis are clipped, respectively. Wherein, the X axis and the Y axis are both [ -49.6, 49.6] m, and the Z axis is [ -5,3] m. Voxel size was (0.1 m,0.2 m). Whereas for the Waymo dataset, the detection ranges for both the X-axis and the Y-axis are [ -75.2,75.2] m, and the Z-axis is [ -2,4] m. Voxel size was (0.1 m ).

Training and presumption:

all models of this embodiment are end-to-end trained with PyTorch. For the KITTI data set, the network trains 100 epochs, and the Batch Size of the network is 8. Whereas for Nuscenes dataset, the network trained 80 epochs, batch size 6. The learning rate for both data sets is initialized to 3e-3. The learning rate is then updated herein by a cosine annealing technique.

In the reasoning phase, the IOU threshold is set to 0.7 in the RPN. After optimization, the redundant prediction block is removed using Non-maximum suppression (Non-maximum suppression, NMS). Notably, different IOU thresholds are designed for three different classes of class-aware predictions, car, pedestrian, and bicycle. The IOU thresholds for cars, pedestrians, and bicycles are 0.7,0.5,0.5, respectively.

Experimental results on KITTI dataset:

the APVR framework of this embodiment is first evaluated through the KITTI dataset. As shown in table 1, APVR was compared to several other latest approaches in the KITTI test set. By fully using complementary information based on Voxel and Point, APVR is significantly improved in performance. In order to compare the proposed model performance, the latest Voxle-based approach, point-based approach, and RGB-and LiDAR-fusion-based approach have also been reported. From the table, it can be observed that: (1) The present embodiment method is fastest in all two-stage methods. Specifically, the real-time processing frame rate (the real time processing frame rate) of the present embodiment method is improved by more than 3 times over the PV-RCNN. (2) The average accuracy (average precision, AP) of the method of the present embodiment at all difficulty levels of different categories is significantly better than other methods. For example, the moderate and difficult levels are raised by 5.07%,4.25%,5.04%, respectively, relative to the simple, medium and difficult levels of the HVPR class of automobiles. This demonstrates that model performance in complex environments can be effectively improved by making full use of complementary information for 3D object detection.

Table 1 comparison of Performance on a KItti test set

Fig. 6 is an example of qualitative results of APVR on a KITTI dataset (Example qualitative results). Wherein the actual bounding boxes and the predicted bounding boxes for the pedestrian and car categories, respectively, are displayed. And also provides a 2D bounding box projected from the 3D detection result.

FIG. 6 qualitative result example on KITTI dataset

Experimental results on NuScenes dataset:

this example was also tested with the more challenging NuScenes dataset. The effectiveness and generalization of APVR can be further demonstrated by test results in different data sets. In NuScenes dataset, this embodiment uses a completely new measurement index, namely NuScenes detection score (NuScenes detection score, NDS).

Table 2 shows the comparison data between the method of this example on the NuScenes dataset and the other latest methods. The terms "CV", "ped", "moto", "bicy", "TC", "Bar" respectively mean an engineering vehicle, a tricycle, a motorcycle, a bicycle, a traffic cone, and an obstacle. The detailed AP data and other index data for all categories are given in table 2. Compared with other latest laser radar methods, APVR is greatly improved in mean Average Precision (mAP) and NDS. Specifically, the present embodiment method improves AP in the class of automobiles by approximately 2%, achieves 2.8% improvement in mAP and 1.7% improvement in NDS, as compared to CVCNet.

TABLE 2 comparison of Performance on NuScens test set

Fig. 7 is an example of qualitative results of APVR on NuScenes dataset.

Experimental results on the Waymo dataset:

APVR was evaluated on LEVEL 1 and LEVEL 2 subjects and compared to the latest methods on other Waymo datasets. In table 3, the present embodiment uses 3D bounding box average accuracy (3D bounding box mean average precision,mAP) and mAP weighted by heading accuracy (mAP weighted by heading accuracy, mAPH) for evaluation. The method of this embodiment has a significant improvement over both level_1 and level_2. Specifically, table 3 reports the detection results of 3D vehicle and pedestrian categories on Test sequences (Test sequences). Compared with the latest centrPoint method, the method of the embodiment respectively improves the mAP of the pedestrian category 2 difficulties by 3.8 percent and 2.7 percent. This shows that the method of the present embodiment also performs well in detecting small objects other than vehicle categories.

TABLE 3 vehicle and pedestrian category 3D detection results on Waymo dataset test

Table 4 shows the vehicle class detection results of the proposed method on the validation sequence of the Waymo dataset. And detecting the vehicle category of the verification sequence through the 3D AP and the BEV AP. Under the level_13DmAP index, the method of this embodiment achieves the most advanced performance with 77.2% mAP. Note that within the 50m-Inf distance range, the method of this embodiment has significant improvement in both level_1 and level_2. The high detection performance at a far distance proves that the proposed method can effectively detect objects with sparse points.

Table 4 Vehicle detection results on the Waymo dataset validation sequences

Ablation experiment:

the performance of each proposed module is analyzed in detail below by showing more experimental data. All models were evaluated on val set of the car class of the KITTI dataset.

VPFN module this embodiment proposes a completely new module for local feature fusion using offset and feature matching probabilities. The performance of the VPFN using different configurations is shown in table 5. It can be found from the table that the method of performing voxel query around the keypoint by adding an offset can reduce the calculation cost while maintaining the detection accuracy, compared with the method using the conventional SA. And the improvement of the performance can be observed after the feature matching probability is added, and the performance is improved by 1.91 percent especially on the difficult AP.

TABLE 5 Effect of the Voxel-Point Feature Aggregation Module on KITTIval set

The KP Analysis module re-weights the key point features through the original point cloud matching probability, the feature probability and the multi-scale matching probability. The number of keypoints selected in the KITTI data set is 2048. As shown in fig. 8, the detection performance cannot be significantly improved by changing the number of key points of sampling. Specifically, when the number of the key points is too large, the performance is not greatly improved, but the calculation amount is greatly improved. While decreasing the number of keypoints results in reduced performance. The present embodiment designs KP Analysis modules to improve performance.

Table 6 shows the impact of adding different matching probabilities and re-weighting the keypoint signature on vehicle class detection performance. Compared with PV-RCNN, the method of the embodiment improves the 3D AP with simple, medium and difficult difficulties by 2.93%,0.87% and 1.16% respectively. This suggests that the matching probability by combining different information helps to improve the AP performance of the algorithm.

TABLE 6 influence of KPAnalysis Module on vehicle class detection Performance on KITTI value

The IoU-Grid Pooling module reports the effect of different keypoint thresholds λ on the PR curve in FIG. 9. PR curves are standard methods for assessing predictive quality. From the PR curve, it can be observed that the quality of the predicted object is significantly improved when λ=3. While too large or too small a keypoint threshold can negatively impact the quality of the prediction.

The effect of the virtual grid point interpolation feature and the minimum bounding rectangle constraint is shown in table 7. The performance can be further improved by inserting different sub-modules in the IoU-Grid Pooling module without changing the keypoint threshold.

TABLE 7 sub-module effectiveness investigation

In the ablation experiments, the effectiveness of all individual modules was investigated. In table 8, the performance differences of the proposed method under different configurations are summarized. Specifically, the independent modules proposed in this embodiment can be raised by 4.14%, 4.75%, 3.84% at three levels of difficulty, simple, medium, and difficult, respectively, and can also reach 27.4Hz at real-time processing frame rates.

TABLE 8 Performance gap using different configurations on KITTIval set

Practical platform research:

in order to test the effectiveness of the method of the present embodiment in a real scenario, a series of tests were performed on an actual vehicle-mounted platform. This is a platform that is integrated by multiple sensors. In addition to the basic 16-wire radar, a Tele-15 radar and a millimeter wave radar are added. The three types of radars have different purposes, wherein the basic 16-wire radar is mainly responsible for detection tasks; the Tele-15 radar is used as an additional safety redundancy to increase the safety of the vehicle; the millimeter wave radar is responsible for measuring the speed and distance of the vehicle in front.

On an experimental platform, the performance of APVR and the existing 3D detection method in a real complex environment is mainly tested. FIG. 10 shows the 3D mAP and bird's eye view mAP differences between the method of the present embodiment and the most representative PointPillar, PV RCNN at different distance ranges. It can be observed from the figure that the 3D mAP and the bird's eye view mAP can be improved by about 2% in all distance ranges compared with the PV RCNN.

In order to intuitively exhibit the advantages of the proposed method, the sensor factors are additionally considered. It can be seen from table 9 that the method of this example was higher than the other comparative methods in terms of detection of different vehicle classes, 88.25%, 81.26% and 75.38%, respectively. These data demonstrate that the method herein can still retain advantages in the different sensor methods.

Table 9 3D test performance comparison for car, van and bus

Experiments on 3 published data sets demonstrate that the APVR of this example has better generalization and portability. Compared with other 3D target detection methods, the method has obvious improvement in detection accuracy and calculation efficiency.

Claims

1. A detection method for improving the detection efficiency of an autopilot three-dimensional target, comprising the steps of:

(1) Acquiring original point cloud data to be detected;

(4) Calculating a key point p _i Matching of point cloud based features with voxel based features in its neighboring voxel feature setProbability, selecting k similar voxel features according to the matching probability, aggregating the k similar voxel features with the corresponding matching probability, carrying out feature enhancement, and finally generating key points p by the k aggregated voxel features through a PointNet-block method _i Features of (2)K is 1X, 2X, 4X;

(6) Calculating key point features respectivelyThe characteristic of the original point cloud adjacent to the original point cloud>Multi-scale voxel aggregation featuresThe feature matching probability mean value of the database is used for weighting the initial key point features together with the prediction foreground probability to obtain updated key point features +.>

(7) Uniformly dividing each with a set of virtual grid pointsA region of interest, and a key point threshold lambda and a collection abstract radius r are set _g Screening the grid points; correcting the direction and boundary of the proposal by fitting the minimum bounding rectangle and the weighted key point characteristics to obtain a corrected 3D frame;

2. The detection method for improving detection efficiency of an automated driving three-dimensional object according to claim 1, wherein step (3) comprises:

wherein i represents the number of non-empty voxels in the K-th level;

3. the detection method for improving detection efficiency of an automated driving three-dimensional object according to claim 1, wherein step (4) comprises:

the probability of matching a point cloud based feature of a keypoint pi with a voxel based feature of its neighboring voxel feature set is,

wherein ,f(V_n ,p _i ) Representing the key point p _i Is based on the n-th voxel-based similarity feature;

4. The detection method for improving the detection efficiency of an autopilot three-dimensional object according to claim 1, characterized by: the key point p obtained in the step (5) _i Is characterized in that,

5. the detection method for improving the detection efficiency of an autopilot three-dimensional object according to claim 1, characterized by: in the step (6), for any one of the key points p _i A set of adjacent original point clouds may be obtained,

wherein ,the characteristics and the actual coordinates of the original point cloud based on the point cloud are represented respectively; />Representing the relative position of the original point cloud; r is (r) _raw Representing a set radius range; c (C) _raw ,F _raw Respectively representing a feature set of the original point cloud and corresponding actual coordinates thereof; n represents the number of adjacent original point clouds;

6. the detection method for improving the detection efficiency of an automated driving three-dimensional object according to claim 1, wherein step (7) comprises:

if any grid point g _i By r _g Lambda cannot be found in a sphere of radiusThe virtual grid point is deleted if the key points are weighted;

a set of virtual grid points for each proposal is obtained,

G＝{g _m ∈R ³ |m∈[0,M ³ -1]}

wherein d (·) represents the L2 distance;

the features of all virtual grid points in one proposal are added into a grid feature set, the grid feature set is passed through an MLP with channel dimension of [ C+3,256,128,128] and a global maximum pool to obtain the region of interest feature of each proposal, then the cross ratio estimation of each frame is obtained through another MLP, and finally the foreground confidence of the target proposal and the optimized 3D frame are respectively predicted by using a 2-layer multi-layer perceptron.

7. The detection method for improving detection efficiency of an autopilot three-dimensional object according to claim 6, characterized by: the keypoint threshold λ=3.