CN114155524A

CN114155524A - Single-stage 3D point cloud target detection method and device, computer equipment and medium

Info

Publication number: CN114155524A
Application number: CN202111271651.0A
Authority: CN
Inventors: 王伟平; 李鸿宇; 周宇
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-03-08

Abstract

The invention discloses a single-stage 3D point cloud target detection method and device, computer equipment and media. The method comprises the following steps: according to different distances between a target in the 3D point cloud and a sampling sensor for acquiring the 3D point cloud, setting different clustering radius parameters to perform clustering operation on the 3D point cloud, using a three-dimensional surrounding frame to contain the same type of point cloud to obtain a minimum three-dimensional surrounding frame, and performing down-sampling on a point set in each minimum three-dimensional surrounding frame to obtain point cloud data after data augmentation; sampling point cloud data after data augmentation based on the farthest distance in the sphere, and extracting the characteristics of points obtained by sampling to obtain semantic characteristics and spatial characteristics; predicting the prediction score of each point according to the spatial features and the semantic features, and then completing point sampling from high to low according to the prediction scores; fusing the characteristics of the sampling points to obtain a fused characteristic diagram; and predicting by the regression prediction network according to the fusion feature map to obtain the position and the category of the target in the point cloud.

Description

Single-stage 3D point cloud target detection method and device, computer equipment and medium

Technical Field

The invention relates to a single-stage 3D point cloud target detection method and device based on DBSCAN cluster data augmentation, computer equipment and a medium, and belongs to the technical field of computer software.

Background

Target detection and identification of outdoor point cloud scenes are a research hotspot in recent years, wherein the target detection is a core part of the whole process and has the task of correctly spatially positioning and identifying the category of a three-dimensional target from point clouds which are irregularly distributed and sparsely uneven. With the development of deep learning, the point cloud-based 3D target detection method is rapidly improved, and is inspired by a target detection method in a 2D picture, and the current mainstream point cloud 3D target detection method focuses on how to extract features with representation force and arrangement rules from a point cloud through a backbone network, so that a convolution processing method suitable for a 2D picture can be directly suitable for 3D point cloud features. And inputting rich semantic features obtained by the backbone network into the head network, predicting the spatial position and the category of the target in the point cloud, calculating the related loss with the 3D label, and providing a supervision signal for the neural network to finish training.

The above conventional scheme has the following disadvantages:

the existing method has a good detection and identification effect on objects with close distance and dense point distribution, and has a poor identification effect on objects with long distance and sparse point distribution.

2, the number of points in the point cloud is large, and if corresponding features are all extracted, the consumed time and the required calculation amount are large. In order to balance performance and required resources, the existing single-stage point cloud 3D target detection method needs a sampling process, only the maximum modulo feature is selected as a candidate feature in the encoding process, and abundant context information exists among the features obtained by spherical sampling without considering, so that semantic features with stronger representation cannot be obtained.

And 3, the final detection of the existing method is based on high-level semantic features, so that the identification performance is better, but the prediction performance for the spatial position and the rotation direction is poor because the loss of low-level spatial features is more in the convolution process.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to provide a single-stage 3D point cloud target detection method and device based on DBSCAN clustering data augmentation, computer equipment and a medium, wherein the single-stage 3D point cloud target detection method comprises a brand-new data augmentation method, a brand-new mask sampling method and a brand-new feature fusion module, and the 3D point cloud target detection effect is further improved.

The technical scheme of the invention is as follows:

a single-stage 3D point cloud target detection method comprises the following steps:

training the backbone network:

for 3D point clouds in a training sample set, a data augmentation module sets different clustering radius parameters to cluster the 3D point clouds according to different distances between a target example in the 3D point clouds and a sampling sensor for acquiring the 3D point clouds, uses a three-dimensional surrounding frame to contain the same type of point clouds to obtain a minimum three-dimensional surrounding frame, and puts the minimum three-dimensional surrounding frame into a 3D three-dimensional frame set db _ boxes; calculating the intersection and parallel ratio of each minimum stereo enclosure frame of the 3D stereo frame set db _ boxes and the real 3D stereo frame gt _ boxes of the corresponding target example in the point cloud, if the intersection and parallel ratio is larger than a set threshold, storing the corresponding minimum stereo enclosure frame into a saved _ boxes set, and removing points in the corresponding minimum stereo enclosure frame from the 3D point cloud; then, down-sampling points in the saved _ boxes set to obtain point cloud data after data augmentation;

the main network samples the point cloud data after data augmentation based on the farthest distance in the sphere, extracts the characteristics of the points obtained by sampling, and inputs the extracted semantic characteristics and spatial characteristics into a mask sampling module of the main network;

the mask sampling module predicts the prediction score of each point according to the input spatial features and semantic features, completes point sampling from high to low according to the prediction scores and inputs the point sampling to the feature fusion module of the backbone network;

the characteristic fusion module is used for decoupling the characteristics of the input sampling points to obtain semantic characteristics and spatial information; performing convolution processing on the semantic features obtained through decoupling to obtain compressed features of the semantic features, and inputting the compressed features into a Sigmoid function to obtain a semantic attention diagram; performing convolution processing on the spatial information obtained through decoupling to obtain compression characteristics of the spatial information, and inputting the compression characteristics into a Sigmoid function to obtain a spatial attention diagram; adding the semantic attention diagrams and the spatial attention diagrams bit by bit to obtain a compression attention diagram; multiplying the compressed attention diagram point by point with the characteristics of the input sampling points to obtain an activated characteristic diagram; adjusting the dimension of the characteristic diagram of the input sampling point to make the dimension of the characteristic diagram consistent with the dimension of the activated characteristic diagram, and then adding bit by bit to obtain a fused characteristic diagram;

the regression prediction network of the backbone network predicts according to the fusion feature map to obtain the position and the category of a target instance in the point cloud; then updating parameters of the backbone network based on the prediction result and the set loss function;

an application stage:

inputting the 3D point cloud to be processed into the trained backbone network; the main network samples the 3D point cloud data to be processed based on the farthest distance in the sphere, extracts the characteristics of the points obtained by sampling, and processes the extracted semantic characteristics and spatial characteristics through the mask sampling module and the characteristic fusion module in sequence to obtain a fusion characteristic diagram; and then the regression prediction network predicts according to the fusion feature map to obtain the position of the target in the 3D point cloud to be processed.

Optionally, the mask sampling module is a two-class network formed by a feature encoding module and a feature decoding module.

Optionally, the mask sampling module inputs a feature map obtained by splicing the input spatial features and semantic features according to dimensions into the feature coding module for down-sampling to obtain down-sampled feature maps at different stages; inputting the downsampling feature map of the last stage into a feature decoding module for upsampling, splicing the upsampling feature map of the obtained stage with the downsampling feature map of the same stage, then performing upsampling of the next stage, predicting the prediction score of each point according to the splicing result of the upsampling feature map of the last stage and the downsampling feature map of the first stage, and then completing point sampling from high to low according to the prediction scores.

Optionally, the clustering operation is a DBSCAN clustering operation.

Optionally, the feature coding module includes two 3x3 convolution units connected in sequence, where each 3x3 convolution unit is sequentially connected with a batch normalization processing unit, a linear correction unit, and a maximum pooling downsampling unit; the feature coding module comprises two 2x2 transposition convolution units and two 3x3 convolution units, the first 2x2 transposition convolution unit processes input information of the feature coding module and inputs the processed information into the first 3x3 convolution unit, information processed by the first 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 3x3 convolution unit, and information processed by the second 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 2x2 transposition convolution unit.

Optionally, the sampling sensor is a laser radar sensor.

Optionally, the regression prediction network is an anchor-frame-free regression head network.

The invention also provides a single-stage 3D point cloud target detection device, which is characterized by comprising the following components:

the data augmentation module is used for setting different clustering radius parameters to cluster the 3D point clouds according to different distances between a target example in the 3D point clouds and a sampling sensor for acquiring the 3D point clouds, using a three-dimensional surrounding frame to contain the same type of point clouds to obtain a minimum three-dimensional surrounding frame, and putting the minimum three-dimensional surrounding frame into a 3D three-dimensional frame set db _ boxes; calculating the intersection and parallel ratio of each minimum stereo enclosure frame of the 3D stereo frame set db _ boxes and the real 3D stereo frame gt _ boxes of the corresponding target example in the point cloud, if the intersection and parallel ratio is larger than a set threshold, storing the corresponding minimum stereo enclosure frame into a saved _ boxes set, and removing points in the corresponding minimum stereo enclosure frame from the 3D point cloud; then, down-sampling points in the saved _ boxes set to obtain point cloud data after data augmentation;

the characteristic extraction module is used for sampling point cloud data after data augmentation based on the farthest distance in the sphere, extracting the characteristics of points obtained by sampling, and inputting the extracted semantic characteristics and spatial characteristics into the mask sampling module;

the mask sampling module is used for predicting the prediction score of each point according to the input spatial feature and semantic feature, completing point sampling from high to low according to the prediction scores and inputting the point sampling into the feature fusion module;

and the regression prediction network is used for predicting according to the fusion characteristic graph to obtain the position and the category of the target in the point cloud.

The invention also provides a computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method as described above.

The invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program realizes the steps of the above-mentioned method when being executed by a processor.

The invention has the following advantages:

points in a 3D point cloud target detection task are irregular in spatial distribution, the density difference of the points in different areas is large, the density of the points is large for a target close to a sampling sensor, and the spatial characteristic representation is good; for a target far away from a sampling sensor, the density of points is low, the spatial feature characterization is poor, and the poor target detection and identification of the part is the bottleneck of the performance of the current 3D point cloud target detection. The invention provides a brand-new data augmentation module based on DBSCAN, which obtains different density region categories through point cloud clustering, obtains a corresponding minimum bounding box through point collection of the same category, and then simulates sparse distribution of remote point clouds after down-sampling operation is carried out on the point clouds in the minimum three-dimensional bounding box, thereby improving the robustness of the model to the point clouds.

And a mask sampling module based on a U-Net framework is used for completing the sampling of the point cloud midpoint. The sampled points do not have the repeated condition in the traditional method, and more context information can be provided for subsequent feature processing.

A feature fusion module based on a compression-activation architecture can better fuse low-level position information and high-level semantic features. Namely, the module can fuse the spatial features and the semantic features, better saves the spatial features beneficial to spatial regression and the semantic features beneficial to recognition, and provides more representative features for the 3D point cloud target detection process of the head network.

Experiments show that the method can obtain more excellent performance on the KITTI data set, and can be seamlessly embedded into the existing 3D point cloud target detection method taking the point cloud as input.

Drawings

Fig. 1 is a diagram of a network architecture.

FIG. 2 shows the data augmentation downsampling result of DBSCAN clustering.

FIG. 3 is a diagram of a sensitivity-SA block structure.

Detailed Description

The invention will be described in further detail with reference to the following drawings, which are given by way of example only for the purpose of illustrating the invention and are not intended to limit the scope of the invention.

The invention provides a Density-sensing-based 3D Point cloud target detection method (sensitivity-Net), the overall structure of which is shown in FIG. 2, the sensitivity-Net is a Point-Net-based 3D target detection model, and three brand-new modules are introduced:

(1) and the data amplification module is based on DBSCAN clustering.

(2) Mask sampling module based on U-Net

(3) Compression-activation based feature fusion module

Data augmentation module based on DBSCAN clustering:

the laser radar adopts a rotating mode to obtain laser signals returned by different distance points and then forms a 360-degree point cloud. The DBSCAN clustering data augmentation module sets different clustering radius parameters to perform clustering operation on point clouds according to different distances between a target and a laser radar sensor, then uses a three-dimensional surrounding frame to contain the same point clouds to obtain a minimum three-dimensional surrounding frame, and performs down-sampling on a point set in the frame to obtain point cloud data after data augmentation, wherein the specific operation is as follows:

the algorithm is as follows: DBSCAN data augmentation

Inputting:

DBSCAN clustering radius _ scales [ s ]₁，...，s_M1}

Scaling factor s of cluster radius₁s_M1Down-sampling coefficient ds_scales＝{ds₁，...，ds_M2}，ds∈[0，1]0 indicates total deletion, and 1 indicates total dot retention

Point cloud set P ═ { P ═ P₁，...，p_N}

Real 3D stereo frame tag G ═ { G) of object in point cloud₁，....，g_N}

And (3) outputting:

the point cloud set C after data augmentation is { C ═ C₁，...，c_N}

Training process:

(1) only the foreground part (namely the visual angle of a driver) is reserved for the point cloud of 360 degrees, and points in the point cloud are divided into a plurality of sectors according to the distances from the laser radar and the distances from the points in the point cloud to the laser radar are respectively (0, 20m ], (20m, 40m ], (40m +).

(2) And setting the clustering radius _ scales as the radius of the DBSCAN clustering, setting the minimum clustering point number in the DBSCAN to be 5, and then carrying out DBSCAN clustering operation to obtain the clustering information of the points in the corresponding sector range. And constructing a 3D stereo frame according to the obtained points of the same category, enclosing all the points of the same category, and then obtaining the minimum stereo enclosing frame. If the center coordinate of the bounding box is within the current sector, the minimum bounding box is retained, otherwise, it is not retained. After the obtained clustering points are processed as above, a plurality of 3D stereo frame sets db _ boxes can be obtained.

(3) Calculating the intersection-parallel ratio (namely Jacard coefficient or Jaccard coefficient) of the 3D stereoscopic frame set db _ boxes obtained from the step (1) and the real 3D stereoscopic frame gt _ boxes of the target instance in the point cloud, wherein the Jacard coefficient is more than 0.01, reserving the 3D stereoscopic frame in the saved _ boxes, removing the points in the 3D stereoscopic frame from the point cloud set formed by the points collected by the sensor, and reducing the sampling coefficient ds of the points in the saved _ boxes_scalesAnd sequentially finishing down-sampling, and then adding the points reserved after down-sampling into the point cloud.

(4) Thus, the point cloud C after data augmentation is obtained and used as the input of the model. The result after point cloud down-sampling can be shown in fig. 1, wherein the size and position of the solid frame generated by DBSCAN clustering can be divided into total down-sampling and partial down-sampling. By the method, more point cloud distributions of remote examples can be simulated, samples which can participate in training are increased, and the robustness of the model is improved.

Mask sampling (MaskSample) module based on U-Net:

in order to achieve balance between performance and operation cost, point cloud is required to be sampled down, the current mainstream method is to complete sampling by constructing farthest distance sampling in a sphere, but the method has a poor effect on sampling of long-distance points because the long-distance points are sparsely distributed and are fewer in number. Many points are discarded in the process of down-sampling, and the processing has smaller influence on target examples which are close to the sensor and contain more points, but has larger influence on target examples which are far away from the sensor and contain less points, and the points obtained by final sampling are not the points corresponding to the target examples. The points obtained by the sampling method are subjected to feature extraction operation in the backbone network to obtain semantic features and spatial features, and the features do not have good distinguishing force, so that the 3D stereoscopic frame position regression accuracy of the final head network prediction and the recall rate of target example points can be reduced.

In the current single-stage 3D point cloud target detector, point sampling is mostly based on the farthest distance sampling in the sphere, and this sampling method firstly constructs a solid sphere with a known radius and carries out the farthest point sampling on the included points. In a scene with dense points, a plurality of different points with good characterization capability can be extracted; however, for a scene with sparse points, the number of points is much smaller than the number of required point samples, and at this time, the general processing method is repeated sampling, and the obtained point characterization capability is poor. It is statistically known that the points obtained by resampling in this way have a 12% repetition, wherein the points belonging to the target instance only account for 33.28% of the entire sample point.

The invention constructs a mask sampling module based on U-Net, the whole structure of which is shown as the upper half network structure in figure 1, and the invention is a two-classification network composed of a feature coding module and a feature decoding module, outputs the prediction score of whether each point is a point in a 3D real solid frame, and then completes point sampling from high to low according to the prediction score (namely selects N points with the highest prediction scores). The mask sampling module has the following structure: the data processing flow of the mask sampling module comprises the following steps: firstly, splicing the spatial features and the semantic features according to dimensionality, inputting an obtained new feature map into an encoder, wherein the encoder consists of two 3x3 convolutions, each convolution is followed by Batch Normalization (BN), a linear correction unit (RELU) and 2x2 maximum pooling downsampling operation with the stride of 2, and the feature maps of downsampling at different stages are obtained through the encoder. The decoder convolutional layer consists of two 2x2 transposed convolutional layers and a 3x3 convolutional unit, each 3x3 convolution is followed by a Batch Normalization (BN), linear correction unit (ReLU), and upsampled feature maps of different stages are obtained by the decoder.

The processing flow of the mask sampling module is as follows: and splicing the spatial features and the semantic features to be used as the input of the convolution of 3x3 in the encoder, performing Batch Normalization (BN), linear correction unit (ReLU) and 2x2 maximum pool operation with the stride of 2 on the output features, and repeating the operation twice to obtain the output of the encoder. The output characteristic diagram of the encoder is used as the input of 2x2 transposition convolution in the decoder, the output characteristic and the characteristic of the encoder in the same stage are spliced (the width, the height and the dimension information of the characteristic diagram in the same stage are the same), then the output characteristic is input into 3x3 convolution to obtain the output characteristic, and the operation is repeated twice to obtain the output of the encoder. Finally, the number of required sampling points is mapped through a 1x1 convolution, and then the first 1024 points are selected according to the prediction score of each candidate point.

Residual structure based compression-activated feature fusion (Density-SA) module:

in the collection extraction module in PointNet + +, the collection extraction module mainly comprises three layers: sampling layer, integration layer and PointNet layer. The method aims to obtain better high-level semantic features and simultaneously store more accurate spatial features, so that more information with distinguishing force can be provided during classification and position regression. Inspired by a compression-activation architecture, a new compression-activation feature fusion extraction module based on a residual error structure is designed, and is named as sensitivity-SA, and the specific structure is shown in FIG. 3. Firstly, decoupling processing is carried out on semantic features and spatial information in initial features, namely, the spatial position information of points and the semantic information corresponding to the points are split, then the spatial position information and the semantic information are respectively input into 1x1 convolutions, compression features corresponding to output space and semantics are obtained, and Batch Normalization (BN) and linear correction unit (ReLU) operations follow each 1x1 convolution. And then inputting the spatial and semantic compression features into a Sigmoid function to respectively obtain a spatial attention diagram and a semantic attention diagram, and adding the spatial attention diagram and the semantic attention diagram bit by bit to obtain a final compression attention diagram. And multiplying the obtained compressed attention diagram and the initial feature point by point to obtain an activated feature diagram, inputting the initial feature into a 1x1 convolution, and adjusting the dimension of the initial feature diagram to be consistent with the dimension of the activated feature diagram. Finally, the activated feature map and the initial feature are added bit by bit to obtain a final fusion feature map.

The whole process of the invention is as follows:

firstly, laser reflection signals of a target example are obtained through a laser radar sensor, and a point cloud data structure is constructed.

The input point cloud is processed by a DBSCAN clustering augmentation method to obtain point cloud data after data augmentation, and the point cloud data which can be used for training are increased.

Inputting the point cloud into a backbone network, and completing the sampling of the point cloud by using a mask sampling (MaskSample) module. The sampled point cloud is subjected to compression-activation feature fusion (Density-SA) based on a residual structure to extract fused spatial and semantic features, and the fused spatial and semantic features are input into a regression head network without an anchor frame (3DSSD CVPR 2020).

The anchor-frame-free regression head network predicts the position and the type of a 3D stereo frame in a point cloud, and the whole model supervises the training of the network by calculating the Cross Entropy loss (Cross Entropy), the spatial position smoothing L1 loss (Smooth L1), the target object size smoothing L1 loss (Smooth L1), the mask sampling two-classification Cross Entropy loss (Binary Cross Entropy) and the angle regression loss of a target object type. Wherein the angle regression loss is as follows:

L_angle＝L_c(d_c,t_c)+D(d_r,t_r)

firstly, uniformly dividing 360 degrees into 12 interval categories, firstly predicting the interval category corresponding to the angle, and then predicting the distance from the center in the intervalI.e. the residual. Wherein d is_cAnd d_rRespectively representing the predicted angle class and the corresponding residual, t_cAnd t_rAre their corresponding target values. L is_cRepresents the Cross Entropy loss (Cross Entropy), and D represents the spatial position smoothing L1 loss (Smooth L1).

We performed extensive experiments to evaluate the effect of Density-Net. Our model was trained on a widely used 3D outdoor autopilot target detection dataset KITTI, and tested on a KITTI validation dataset. There are 7481 training samples and 7518 testing samples in the KITTI, wherein the 7481 training samples are further divided into a training set (3712 samples) and a validation set (3769 samples). We mainly evaluate the performance of the model on the automobile class, follow the official evaluation standard of KITTI data set, and consider that the target is correctly detected if the spatial intersection ratio of the model and a real 3D frame is more than 0.7 and the category is accurate.

Table 1 shows the results of comparative experiments for each module

Table 1 shows the effect comparison among the modules of our model, and the result proves that the augmented DBSCAN cluster data provided by the present invention can increase the sparse low density area in the point cloud, simulate the remote target point cloud distribution, and improve the model performance; the proposed mask sampling module finishes the non-repeated down-sampling of point cloud by sequencing the numerical values in the mask from high to low; the proposed sensitivity-SA module can obtain optimized characteristics with more distinguishing force. The performance of the baseline can be improved on the KITTI data set, and the effectiveness of the method is proved.

Table 2 shows the comparison of the effect of the present invention on the KITTI test dataset with other mainstream methods, and the detection result of the present invention on the KITTI dataset can find that the present invention can accurately detect and identify a distant target.

Table 2 shows the performance comparison of Density-Net and other methods on KITTI data set test set

Where'-' indicates that this category was not tested.

Although specific embodiments of the invention have been disclosed for purposes of illustration, and for purposes of aiding in the understanding of the contents of the invention and its implementation, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A single-stage 3D point cloud target detection method comprises the following steps:

training the backbone network:

an application stage:

2. The method of claim 1, wherein the mask sampling module is a binary network of feature encoding modules and feature decoding modules.

3. The method according to claim 2, wherein the mask sampling module inputs feature maps obtained by splicing input spatial features and semantic features according to dimensionality into a feature coding module for down-sampling to obtain down-sampled feature maps of different stages; inputting the downsampling feature map of the last stage into a feature decoding module for upsampling, splicing the upsampling feature map of the obtained stage with the downsampling feature map of the same stage, then performing upsampling of the next stage, predicting the prediction score of each point according to the splicing result of the upsampling feature map of the last stage and the downsampling feature map of the first stage, and then completing point sampling from high to low according to the prediction scores.

4. The method according to claim 2 or 3, wherein the feature coding module comprises two 3x3 convolution units connected in sequence, wherein each 3x3 convolution unit is connected with a batch normalization processing unit, a linear modification unit and a maximum pooling downsampling unit in sequence; the feature coding module comprises two 2x2 transposition convolution units and two 3x3 convolution units, the first 2x2 transposition convolution unit processes input information of the feature coding module and inputs the processed information into the first 3x3 convolution unit, information processed by the first 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 3x3 convolution unit, and information processed by the second 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 2x2 transposition convolution unit.

5. The method of claim 1, wherein the clustering operation is a DBSCAN clustering operation.

6. The method of claim 1, wherein the sampling sensor is a lidar sensor.

7. The method of claim 1, wherein the regression prediction network is an anchor-free regression header network.

8. A single stage 3D point cloud target detection apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.