CN114155524A - Single-stage 3D point cloud target detection method and device, computer equipment and medium - Google Patents

Single-stage 3D point cloud target detection method and device, computer equipment and medium Download PDF

Info

Publication number
CN114155524A
CN114155524A CN202111271651.0A CN202111271651A CN114155524A CN 114155524 A CN114155524 A CN 114155524A CN 202111271651 A CN202111271651 A CN 202111271651A CN 114155524 A CN114155524 A CN 114155524A
Authority
CN
China
Prior art keywords
sampling
point
point cloud
module
spatial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111271651.0A
Other languages
Chinese (zh)
Inventor
王伟平
李鸿宇
周宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202111271651.0A priority Critical patent/CN114155524A/en
Publication of CN114155524A publication Critical patent/CN114155524A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a single-stage 3D point cloud target detection method and device, computer equipment and media. The method comprises the following steps: according to different distances between a target in the 3D point cloud and a sampling sensor for acquiring the 3D point cloud, setting different clustering radius parameters to perform clustering operation on the 3D point cloud, using a three-dimensional surrounding frame to contain the same type of point cloud to obtain a minimum three-dimensional surrounding frame, and performing down-sampling on a point set in each minimum three-dimensional surrounding frame to obtain point cloud data after data augmentation; sampling point cloud data after data augmentation based on the farthest distance in the sphere, and extracting the characteristics of points obtained by sampling to obtain semantic characteristics and spatial characteristics; predicting the prediction score of each point according to the spatial features and the semantic features, and then completing point sampling from high to low according to the prediction scores; fusing the characteristics of the sampling points to obtain a fused characteristic diagram; and predicting by the regression prediction network according to the fusion feature map to obtain the position and the category of the target in the point cloud.

Description

Single-stage 3D point cloud target detection method and device, computer equipment and medium
Technical Field
The invention relates to a single-stage 3D point cloud target detection method and device based on DBSCAN cluster data augmentation, computer equipment and a medium, and belongs to the technical field of computer software.
Background
Target detection and identification of outdoor point cloud scenes are a research hotspot in recent years, wherein the target detection is a core part of the whole process and has the task of correctly spatially positioning and identifying the category of a three-dimensional target from point clouds which are irregularly distributed and sparsely uneven. With the development of deep learning, the point cloud-based 3D target detection method is rapidly improved, and is inspired by a target detection method in a 2D picture, and the current mainstream point cloud 3D target detection method focuses on how to extract features with representation force and arrangement rules from a point cloud through a backbone network, so that a convolution processing method suitable for a 2D picture can be directly suitable for 3D point cloud features. And inputting rich semantic features obtained by the backbone network into the head network, predicting the spatial position and the category of the target in the point cloud, calculating the related loss with the 3D label, and providing a supervision signal for the neural network to finish training.
The above conventional scheme has the following disadvantages:
the existing method has a good detection and identification effect on objects with close distance and dense point distribution, and has a poor identification effect on objects with long distance and sparse point distribution.
2, the number of points in the point cloud is large, and if corresponding features are all extracted, the consumed time and the required calculation amount are large. In order to balance performance and required resources, the existing single-stage point cloud 3D target detection method needs a sampling process, only the maximum modulo feature is selected as a candidate feature in the encoding process, and abundant context information exists among the features obtained by spherical sampling without considering, so that semantic features with stronger representation cannot be obtained.
And 3, the final detection of the existing method is based on high-level semantic features, so that the identification performance is better, but the prediction performance for the spatial position and the rotation direction is poor because the loss of low-level spatial features is more in the convolution process.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a single-stage 3D point cloud target detection method and device based on DBSCAN clustering data augmentation, computer equipment and a medium, wherein the single-stage 3D point cloud target detection method comprises a brand-new data augmentation method, a brand-new mask sampling method and a brand-new feature fusion module, and the 3D point cloud target detection effect is further improved.
The technical scheme of the invention is as follows:
a single-stage 3D point cloud target detection method comprises the following steps:
training the backbone network:
for 3D point clouds in a training sample set, a data augmentation module sets different clustering radius parameters to cluster the 3D point clouds according to different distances between a target example in the 3D point clouds and a sampling sensor for acquiring the 3D point clouds, uses a three-dimensional surrounding frame to contain the same type of point clouds to obtain a minimum three-dimensional surrounding frame, and puts the minimum three-dimensional surrounding frame into a 3D three-dimensional frame set db _ boxes; calculating the intersection and parallel ratio of each minimum stereo enclosure frame of the 3D stereo frame set db _ boxes and the real 3D stereo frame gt _ boxes of the corresponding target example in the point cloud, if the intersection and parallel ratio is larger than a set threshold, storing the corresponding minimum stereo enclosure frame into a saved _ boxes set, and removing points in the corresponding minimum stereo enclosure frame from the 3D point cloud; then, down-sampling points in the saved _ boxes set to obtain point cloud data after data augmentation;
the main network samples the point cloud data after data augmentation based on the farthest distance in the sphere, extracts the characteristics of the points obtained by sampling, and inputs the extracted semantic characteristics and spatial characteristics into a mask sampling module of the main network;
the mask sampling module predicts the prediction score of each point according to the input spatial features and semantic features, completes point sampling from high to low according to the prediction scores and inputs the point sampling to the feature fusion module of the backbone network;
the characteristic fusion module is used for decoupling the characteristics of the input sampling points to obtain semantic characteristics and spatial information; performing convolution processing on the semantic features obtained through decoupling to obtain compressed features of the semantic features, and inputting the compressed features into a Sigmoid function to obtain a semantic attention diagram; performing convolution processing on the spatial information obtained through decoupling to obtain compression characteristics of the spatial information, and inputting the compression characteristics into a Sigmoid function to obtain a spatial attention diagram; adding the semantic attention diagrams and the spatial attention diagrams bit by bit to obtain a compression attention diagram; multiplying the compressed attention diagram point by point with the characteristics of the input sampling points to obtain an activated characteristic diagram; adjusting the dimension of the characteristic diagram of the input sampling point to make the dimension of the characteristic diagram consistent with the dimension of the activated characteristic diagram, and then adding bit by bit to obtain a fused characteristic diagram;
the regression prediction network of the backbone network predicts according to the fusion feature map to obtain the position and the category of a target instance in the point cloud; then updating parameters of the backbone network based on the prediction result and the set loss function;
an application stage:
inputting the 3D point cloud to be processed into the trained backbone network; the main network samples the 3D point cloud data to be processed based on the farthest distance in the sphere, extracts the characteristics of the points obtained by sampling, and processes the extracted semantic characteristics and spatial characteristics through the mask sampling module and the characteristic fusion module in sequence to obtain a fusion characteristic diagram; and then the regression prediction network predicts according to the fusion feature map to obtain the position of the target in the 3D point cloud to be processed.
Optionally, the mask sampling module is a two-class network formed by a feature encoding module and a feature decoding module.
Optionally, the mask sampling module inputs a feature map obtained by splicing the input spatial features and semantic features according to dimensions into the feature coding module for down-sampling to obtain down-sampled feature maps at different stages; inputting the downsampling feature map of the last stage into a feature decoding module for upsampling, splicing the upsampling feature map of the obtained stage with the downsampling feature map of the same stage, then performing upsampling of the next stage, predicting the prediction score of each point according to the splicing result of the upsampling feature map of the last stage and the downsampling feature map of the first stage, and then completing point sampling from high to low according to the prediction scores.
Optionally, the clustering operation is a DBSCAN clustering operation.
Optionally, the feature coding module includes two 3x3 convolution units connected in sequence, where each 3x3 convolution unit is sequentially connected with a batch normalization processing unit, a linear correction unit, and a maximum pooling downsampling unit; the feature coding module comprises two 2x2 transposition convolution units and two 3x3 convolution units, the first 2x2 transposition convolution unit processes input information of the feature coding module and inputs the processed information into the first 3x3 convolution unit, information processed by the first 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 3x3 convolution unit, and information processed by the second 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 2x2 transposition convolution unit.
Optionally, the sampling sensor is a laser radar sensor.
Optionally, the regression prediction network is an anchor-frame-free regression head network.
The invention also provides a single-stage 3D point cloud target detection device, which is characterized by comprising the following components:
the data augmentation module is used for setting different clustering radius parameters to cluster the 3D point clouds according to different distances between a target example in the 3D point clouds and a sampling sensor for acquiring the 3D point clouds, using a three-dimensional surrounding frame to contain the same type of point clouds to obtain a minimum three-dimensional surrounding frame, and putting the minimum three-dimensional surrounding frame into a 3D three-dimensional frame set db _ boxes; calculating the intersection and parallel ratio of each minimum stereo enclosure frame of the 3D stereo frame set db _ boxes and the real 3D stereo frame gt _ boxes of the corresponding target example in the point cloud, if the intersection and parallel ratio is larger than a set threshold, storing the corresponding minimum stereo enclosure frame into a saved _ boxes set, and removing points in the corresponding minimum stereo enclosure frame from the 3D point cloud; then, down-sampling points in the saved _ boxes set to obtain point cloud data after data augmentation;
the characteristic extraction module is used for sampling point cloud data after data augmentation based on the farthest distance in the sphere, extracting the characteristics of points obtained by sampling, and inputting the extracted semantic characteristics and spatial characteristics into the mask sampling module;
the mask sampling module is used for predicting the prediction score of each point according to the input spatial feature and semantic feature, completing point sampling from high to low according to the prediction scores and inputting the point sampling into the feature fusion module;
the characteristic fusion module is used for decoupling the characteristics of the input sampling points to obtain semantic characteristics and spatial information; performing convolution processing on the semantic features obtained through decoupling to obtain compressed features of the semantic features, and inputting the compressed features into a Sigmoid function to obtain a semantic attention diagram; performing convolution processing on the spatial information obtained through decoupling to obtain compression characteristics of the spatial information, and inputting the compression characteristics into a Sigmoid function to obtain a spatial attention diagram; adding the semantic attention diagrams and the spatial attention diagrams bit by bit to obtain a compression attention diagram; multiplying the compressed attention diagram point by point with the characteristics of the input sampling points to obtain an activated characteristic diagram; adjusting the dimension of the characteristic diagram of the input sampling point to make the dimension of the characteristic diagram consistent with the dimension of the activated characteristic diagram, and then adding bit by bit to obtain a fused characteristic diagram;
and the regression prediction network is used for predicting according to the fusion characteristic graph to obtain the position and the category of the target in the point cloud.
The invention also provides a computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method as described above.
The invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program realizes the steps of the above-mentioned method when being executed by a processor.
The invention has the following advantages:
points in a 3D point cloud target detection task are irregular in spatial distribution, the density difference of the points in different areas is large, the density of the points is large for a target close to a sampling sensor, and the spatial characteristic representation is good; for a target far away from a sampling sensor, the density of points is low, the spatial feature characterization is poor, and the poor target detection and identification of the part is the bottleneck of the performance of the current 3D point cloud target detection. The invention provides a brand-new data augmentation module based on DBSCAN, which obtains different density region categories through point cloud clustering, obtains a corresponding minimum bounding box through point collection of the same category, and then simulates sparse distribution of remote point clouds after down-sampling operation is carried out on the point clouds in the minimum three-dimensional bounding box, thereby improving the robustness of the model to the point clouds.
And a mask sampling module based on a U-Net framework is used for completing the sampling of the point cloud midpoint. The sampled points do not have the repeated condition in the traditional method, and more context information can be provided for subsequent feature processing.
A feature fusion module based on a compression-activation architecture can better fuse low-level position information and high-level semantic features. Namely, the module can fuse the spatial features and the semantic features, better saves the spatial features beneficial to spatial regression and the semantic features beneficial to recognition, and provides more representative features for the 3D point cloud target detection process of the head network.
Experiments show that the method can obtain more excellent performance on the KITTI data set, and can be seamlessly embedded into the existing 3D point cloud target detection method taking the point cloud as input.
Drawings
Fig. 1 is a diagram of a network architecture.
FIG. 2 shows the data augmentation downsampling result of DBSCAN clustering.
FIG. 3 is a diagram of a sensitivity-SA block structure.
Detailed Description
The invention will be described in further detail with reference to the following drawings, which are given by way of example only for the purpose of illustrating the invention and are not intended to limit the scope of the invention.
The invention provides a Density-sensing-based 3D Point cloud target detection method (sensitivity-Net), the overall structure of which is shown in FIG. 2, the sensitivity-Net is a Point-Net-based 3D target detection model, and three brand-new modules are introduced:
(1) and the data amplification module is based on DBSCAN clustering.
(2) Mask sampling module based on U-Net
(3) Compression-activation based feature fusion module
Data augmentation module based on DBSCAN clustering:
the laser radar adopts a rotating mode to obtain laser signals returned by different distance points and then forms a 360-degree point cloud. The DBSCAN clustering data augmentation module sets different clustering radius parameters to perform clustering operation on point clouds according to different distances between a target and a laser radar sensor, then uses a three-dimensional surrounding frame to contain the same point clouds to obtain a minimum three-dimensional surrounding frame, and performs down-sampling on a point set in the frame to obtain point cloud data after data augmentation, wherein the specific operation is as follows:
the algorithm is as follows: DBSCAN data augmentation
Inputting:
DBSCAN clustering radius _ scales [ s ]1,...,sM1}
Scaling factor s of cluster radius1sM1Down-sampling coefficient dsscales={ds1,...,dsM2},ds∈[0,1]0 indicates total deletion, and 1 indicates total dot retention
Point cloud set P ═ { P ═ P1,...,pN}
Real 3D stereo frame tag G ═ { G) of object in point cloud1,....,gN}
And (3) outputting:
the point cloud set C after data augmentation is { C ═ C1,...,cN}
Training process:
(1) only the foreground part (namely the visual angle of a driver) is reserved for the point cloud of 360 degrees, and points in the point cloud are divided into a plurality of sectors according to the distances from the laser radar and the distances from the points in the point cloud to the laser radar are respectively (0, 20m ], (20m, 40m ], (40m +).
(2) And setting the clustering radius _ scales as the radius of the DBSCAN clustering, setting the minimum clustering point number in the DBSCAN to be 5, and then carrying out DBSCAN clustering operation to obtain the clustering information of the points in the corresponding sector range. And constructing a 3D stereo frame according to the obtained points of the same category, enclosing all the points of the same category, and then obtaining the minimum stereo enclosing frame. If the center coordinate of the bounding box is within the current sector, the minimum bounding box is retained, otherwise, it is not retained. After the obtained clustering points are processed as above, a plurality of 3D stereo frame sets db _ boxes can be obtained.
(3) Calculating the intersection-parallel ratio (namely Jacard coefficient or Jaccard coefficient) of the 3D stereoscopic frame set db _ boxes obtained from the step (1) and the real 3D stereoscopic frame gt _ boxes of the target instance in the point cloud, wherein the Jacard coefficient is more than 0.01, reserving the 3D stereoscopic frame in the saved _ boxes, removing the points in the 3D stereoscopic frame from the point cloud set formed by the points collected by the sensor, and reducing the sampling coefficient ds of the points in the saved _ boxesscalesAnd sequentially finishing down-sampling, and then adding the points reserved after down-sampling into the point cloud.
(4) Thus, the point cloud C after data augmentation is obtained and used as the input of the model. The result after point cloud down-sampling can be shown in fig. 1, wherein the size and position of the solid frame generated by DBSCAN clustering can be divided into total down-sampling and partial down-sampling. By the method, more point cloud distributions of remote examples can be simulated, samples which can participate in training are increased, and the robustness of the model is improved.
Mask sampling (MaskSample) module based on U-Net:
in order to achieve balance between performance and operation cost, point cloud is required to be sampled down, the current mainstream method is to complete sampling by constructing farthest distance sampling in a sphere, but the method has a poor effect on sampling of long-distance points because the long-distance points are sparsely distributed and are fewer in number. Many points are discarded in the process of down-sampling, and the processing has smaller influence on target examples which are close to the sensor and contain more points, but has larger influence on target examples which are far away from the sensor and contain less points, and the points obtained by final sampling are not the points corresponding to the target examples. The points obtained by the sampling method are subjected to feature extraction operation in the backbone network to obtain semantic features and spatial features, and the features do not have good distinguishing force, so that the 3D stereoscopic frame position regression accuracy of the final head network prediction and the recall rate of target example points can be reduced.
In the current single-stage 3D point cloud target detector, point sampling is mostly based on the farthest distance sampling in the sphere, and this sampling method firstly constructs a solid sphere with a known radius and carries out the farthest point sampling on the included points. In a scene with dense points, a plurality of different points with good characterization capability can be extracted; however, for a scene with sparse points, the number of points is much smaller than the number of required point samples, and at this time, the general processing method is repeated sampling, and the obtained point characterization capability is poor. It is statistically known that the points obtained by resampling in this way have a 12% repetition, wherein the points belonging to the target instance only account for 33.28% of the entire sample point.
The invention constructs a mask sampling module based on U-Net, the whole structure of which is shown as the upper half network structure in figure 1, and the invention is a two-classification network composed of a feature coding module and a feature decoding module, outputs the prediction score of whether each point is a point in a 3D real solid frame, and then completes point sampling from high to low according to the prediction score (namely selects N points with the highest prediction scores). The mask sampling module has the following structure: the data processing flow of the mask sampling module comprises the following steps: firstly, splicing the spatial features and the semantic features according to dimensionality, inputting an obtained new feature map into an encoder, wherein the encoder consists of two 3x3 convolutions, each convolution is followed by Batch Normalization (BN), a linear correction unit (RELU) and 2x2 maximum pooling downsampling operation with the stride of 2, and the feature maps of downsampling at different stages are obtained through the encoder. The decoder convolutional layer consists of two 2x2 transposed convolutional layers and a 3x3 convolutional unit, each 3x3 convolution is followed by a Batch Normalization (BN), linear correction unit (ReLU), and upsampled feature maps of different stages are obtained by the decoder.
The processing flow of the mask sampling module is as follows: and splicing the spatial features and the semantic features to be used as the input of the convolution of 3x3 in the encoder, performing Batch Normalization (BN), linear correction unit (ReLU) and 2x2 maximum pool operation with the stride of 2 on the output features, and repeating the operation twice to obtain the output of the encoder. The output characteristic diagram of the encoder is used as the input of 2x2 transposition convolution in the decoder, the output characteristic and the characteristic of the encoder in the same stage are spliced (the width, the height and the dimension information of the characteristic diagram in the same stage are the same), then the output characteristic is input into 3x3 convolution to obtain the output characteristic, and the operation is repeated twice to obtain the output of the encoder. Finally, the number of required sampling points is mapped through a 1x1 convolution, and then the first 1024 points are selected according to the prediction score of each candidate point.
Residual structure based compression-activated feature fusion (Density-SA) module:
in the collection extraction module in PointNet + +, the collection extraction module mainly comprises three layers: sampling layer, integration layer and PointNet layer. The method aims to obtain better high-level semantic features and simultaneously store more accurate spatial features, so that more information with distinguishing force can be provided during classification and position regression. Inspired by a compression-activation architecture, a new compression-activation feature fusion extraction module based on a residual error structure is designed, and is named as sensitivity-SA, and the specific structure is shown in FIG. 3. Firstly, decoupling processing is carried out on semantic features and spatial information in initial features, namely, the spatial position information of points and the semantic information corresponding to the points are split, then the spatial position information and the semantic information are respectively input into 1x1 convolutions, compression features corresponding to output space and semantics are obtained, and Batch Normalization (BN) and linear correction unit (ReLU) operations follow each 1x1 convolution. And then inputting the spatial and semantic compression features into a Sigmoid function to respectively obtain a spatial attention diagram and a semantic attention diagram, and adding the spatial attention diagram and the semantic attention diagram bit by bit to obtain a final compression attention diagram. And multiplying the obtained compressed attention diagram and the initial feature point by point to obtain an activated feature diagram, inputting the initial feature into a 1x1 convolution, and adjusting the dimension of the initial feature diagram to be consistent with the dimension of the activated feature diagram. Finally, the activated feature map and the initial feature are added bit by bit to obtain a final fusion feature map.
The whole process of the invention is as follows:
firstly, laser reflection signals of a target example are obtained through a laser radar sensor, and a point cloud data structure is constructed.
The input point cloud is processed by a DBSCAN clustering augmentation method to obtain point cloud data after data augmentation, and the point cloud data which can be used for training are increased.
Inputting the point cloud into a backbone network, and completing the sampling of the point cloud by using a mask sampling (MaskSample) module. The sampled point cloud is subjected to compression-activation feature fusion (Density-SA) based on a residual structure to extract fused spatial and semantic features, and the fused spatial and semantic features are input into a regression head network without an anchor frame (3DSSD CVPR 2020).
The anchor-frame-free regression head network predicts the position and the type of a 3D stereo frame in a point cloud, and the whole model supervises the training of the network by calculating the Cross Entropy loss (Cross Entropy), the spatial position smoothing L1 loss (Smooth L1), the target object size smoothing L1 loss (Smooth L1), the mask sampling two-classification Cross Entropy loss (Binary Cross Entropy) and the angle regression loss of a target object type. Wherein the angle regression loss is as follows:
Langle=Lc(dc,tc)+D(dr,tr)
firstly, uniformly dividing 360 degrees into 12 interval categories, firstly predicting the interval category corresponding to the angle, and then predicting the distance from the center in the intervalI.e. the residual. Wherein d iscAnd drRespectively representing the predicted angle class and the corresponding residual, tcAnd trAre their corresponding target values. L iscRepresents the Cross Entropy loss (Cross Entropy), and D represents the spatial position smoothing L1 loss (Smooth L1).
We performed extensive experiments to evaluate the effect of Density-Net. Our model was trained on a widely used 3D outdoor autopilot target detection dataset KITTI, and tested on a KITTI validation dataset. There are 7481 training samples and 7518 testing samples in the KITTI, wherein the 7481 training samples are further divided into a training set (3712 samples) and a validation set (3769 samples). We mainly evaluate the performance of the model on the automobile class, follow the official evaluation standard of KITTI data set, and consider that the target is correctly detected if the spatial intersection ratio of the model and a real 3D frame is more than 0.7 and the category is accurate.
Table 1 shows the results of comparative experiments for each module
Figure BDA0003328997320000081
Table 1 shows the effect comparison among the modules of our model, and the result proves that the augmented DBSCAN cluster data provided by the present invention can increase the sparse low density area in the point cloud, simulate the remote target point cloud distribution, and improve the model performance; the proposed mask sampling module finishes the non-repeated down-sampling of point cloud by sequencing the numerical values in the mask from high to low; the proposed sensitivity-SA module can obtain optimized characteristics with more distinguishing force. The performance of the baseline can be improved on the KITTI data set, and the effectiveness of the method is proved.
Table 2 shows the comparison of the effect of the present invention on the KITTI test dataset with other mainstream methods, and the detection result of the present invention on the KITTI dataset can find that the present invention can accurately detect and identify a distant target.
Table 2 shows the performance comparison of Density-Net and other methods on KITTI data set test set
Figure BDA0003328997320000082
Figure BDA0003328997320000091
Where'-' indicates that this category was not tested.
Although specific embodiments of the invention have been disclosed for purposes of illustration, and for purposes of aiding in the understanding of the contents of the invention and its implementation, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (10)

1. A single-stage 3D point cloud target detection method comprises the following steps:
training the backbone network:
for 3D point clouds in a training sample set, a data augmentation module sets different clustering radius parameters to cluster the 3D point clouds according to different distances between a target example in the 3D point clouds and a sampling sensor for acquiring the 3D point clouds, uses a three-dimensional surrounding frame to contain the same type of point clouds to obtain a minimum three-dimensional surrounding frame, and puts the minimum three-dimensional surrounding frame into a 3D three-dimensional frame set db _ boxes; calculating the intersection and parallel ratio of each minimum stereo enclosure frame of the 3D stereo frame set db _ boxes and the real 3D stereo frame gt _ boxes of the corresponding target example in the point cloud, if the intersection and parallel ratio is larger than a set threshold, storing the corresponding minimum stereo enclosure frame into a saved _ boxes set, and removing points in the corresponding minimum stereo enclosure frame from the 3D point cloud; then, down-sampling points in the saved _ boxes set to obtain point cloud data after data augmentation;
the main network samples the point cloud data after data augmentation based on the farthest distance in the sphere, extracts the characteristics of the points obtained by sampling, and inputs the extracted semantic characteristics and spatial characteristics into a mask sampling module of the main network;
the mask sampling module predicts the prediction score of each point according to the input spatial features and semantic features, completes point sampling from high to low according to the prediction scores and inputs the point sampling to the feature fusion module of the backbone network;
the characteristic fusion module is used for decoupling the characteristics of the input sampling points to obtain semantic characteristics and spatial information; performing convolution processing on the semantic features obtained through decoupling to obtain compressed features of the semantic features, and inputting the compressed features into a Sigmoid function to obtain a semantic attention diagram; performing convolution processing on the spatial information obtained through decoupling to obtain compression characteristics of the spatial information, and inputting the compression characteristics into a Sigmoid function to obtain a spatial attention diagram; adding the semantic attention diagrams and the spatial attention diagrams bit by bit to obtain a compression attention diagram; multiplying the compressed attention diagram point by point with the characteristics of the input sampling points to obtain an activated characteristic diagram; adjusting the dimension of the characteristic diagram of the input sampling point to make the dimension of the characteristic diagram consistent with the dimension of the activated characteristic diagram, and then adding bit by bit to obtain a fused characteristic diagram;
the regression prediction network of the backbone network predicts according to the fusion feature map to obtain the position and the category of a target instance in the point cloud; then updating parameters of the backbone network based on the prediction result and the set loss function;
an application stage:
inputting the 3D point cloud to be processed into the trained backbone network; the main network samples the 3D point cloud data to be processed based on the farthest distance in the sphere, extracts the characteristics of the points obtained by sampling, and processes the extracted semantic characteristics and spatial characteristics through the mask sampling module and the characteristic fusion module in sequence to obtain a fusion characteristic diagram; and then the regression prediction network predicts according to the fusion feature map to obtain the position of the target in the 3D point cloud to be processed.
2. The method of claim 1, wherein the mask sampling module is a binary network of feature encoding modules and feature decoding modules.
3. The method according to claim 2, wherein the mask sampling module inputs feature maps obtained by splicing input spatial features and semantic features according to dimensionality into a feature coding module for down-sampling to obtain down-sampled feature maps of different stages; inputting the downsampling feature map of the last stage into a feature decoding module for upsampling, splicing the upsampling feature map of the obtained stage with the downsampling feature map of the same stage, then performing upsampling of the next stage, predicting the prediction score of each point according to the splicing result of the upsampling feature map of the last stage and the downsampling feature map of the first stage, and then completing point sampling from high to low according to the prediction scores.
4. The method according to claim 2 or 3, wherein the feature coding module comprises two 3x3 convolution units connected in sequence, wherein each 3x3 convolution unit is connected with a batch normalization processing unit, a linear modification unit and a maximum pooling downsampling unit in sequence; the feature coding module comprises two 2x2 transposition convolution units and two 3x3 convolution units, the first 2x2 transposition convolution unit processes input information of the feature coding module and inputs the processed information into the first 3x3 convolution unit, information processed by the first 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 3x3 convolution unit, and information processed by the second 3x3 convolution unit is processed by the batch standardization unit and the linear correction unit in sequence and then inputs into the second 2x2 transposition convolution unit.
5. The method of claim 1, wherein the clustering operation is a DBSCAN clustering operation.
6. The method of claim 1, wherein the sampling sensor is a lidar sensor.
7. The method of claim 1, wherein the regression prediction network is an anchor-free regression header network.
8. A single stage 3D point cloud target detection apparatus, the apparatus comprising:
the data augmentation module is used for setting different clustering radius parameters to cluster the 3D point clouds according to different distances between a target example in the 3D point clouds and a sampling sensor for acquiring the 3D point clouds, using a three-dimensional surrounding frame to contain the same type of point clouds to obtain a minimum three-dimensional surrounding frame, and putting the minimum three-dimensional surrounding frame into a 3D three-dimensional frame set db _ boxes; calculating the intersection and parallel ratio of each minimum stereo enclosure frame of the 3D stereo frame set db _ boxes and the real 3D stereo frame gt _ boxes of the corresponding target example in the point cloud, if the intersection and parallel ratio is larger than a set threshold, storing the corresponding minimum stereo enclosure frame into a saved _ boxes set, and removing points in the corresponding minimum stereo enclosure frame from the 3D point cloud; then, down-sampling points in the saved _ boxes set to obtain point cloud data after data augmentation;
the characteristic extraction module is used for sampling point cloud data after data augmentation based on the farthest distance in the sphere, extracting the characteristics of points obtained by sampling, and inputting the extracted semantic characteristics and spatial characteristics into the mask sampling module;
the mask sampling module is used for predicting the prediction score of each point according to the input spatial feature and semantic feature, completing point sampling from high to low according to the prediction scores and inputting the point sampling into the feature fusion module;
the characteristic fusion module is used for decoupling the characteristics of the input sampling points to obtain semantic characteristics and spatial information; performing convolution processing on the semantic features obtained through decoupling to obtain compressed features of the semantic features, and inputting the compressed features into a Sigmoid function to obtain a semantic attention diagram; performing convolution processing on the spatial information obtained through decoupling to obtain compression characteristics of the spatial information, and inputting the compression characteristics into a Sigmoid function to obtain a spatial attention diagram; adding the semantic attention diagrams and the spatial attention diagrams bit by bit to obtain a compression attention diagram; multiplying the compressed attention diagram point by point with the characteristics of the input sampling points to obtain an activated characteristic diagram; adjusting the dimension of the characteristic diagram of the input sampling point to make the dimension of the characteristic diagram consistent with the dimension of the activated characteristic diagram, and then adding bit by bit to obtain a fused characteristic diagram;
and the regression prediction network is used for predicting according to the fusion characteristic graph to obtain the position and the category of the target in the point cloud.
9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111271651.0A 2021-10-29 2021-10-29 Single-stage 3D point cloud target detection method and device, computer equipment and medium Pending CN114155524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111271651.0A CN114155524A (en) 2021-10-29 2021-10-29 Single-stage 3D point cloud target detection method and device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111271651.0A CN114155524A (en) 2021-10-29 2021-10-29 Single-stage 3D point cloud target detection method and device, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN114155524A true CN114155524A (en) 2022-03-08

Family

ID=80459112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111271651.0A Pending CN114155524A (en) 2021-10-29 2021-10-29 Single-stage 3D point cloud target detection method and device, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN114155524A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627346A (en) * 2022-03-15 2022-06-14 电子科技大学 Point cloud data down-sampling method capable of retaining important features

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627346A (en) * 2022-03-15 2022-06-14 电子科技大学 Point cloud data down-sampling method capable of retaining important features

Similar Documents

Publication Publication Date Title
CN111652217B (en) Text detection method and device, electronic equipment and computer storage medium
CN113780296B (en) Remote sensing image semantic segmentation method and system based on multi-scale information fusion
CN111126202A (en) Optical remote sensing image target detection method based on void feature pyramid network
CN110675408A (en) High-resolution image building extraction method and system based on deep learning
CN110210431B (en) Point cloud semantic labeling and optimization-based point cloud classification method
CN110705566B (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN112085072B (en) Cross-modal retrieval method of sketch retrieval three-dimensional model based on space-time characteristic information
CN112905828B (en) Image retriever, database and retrieval method combining significant features
CN114049356B (en) Method, device and system for detecting structure apparent crack
CN113012177A (en) Three-dimensional point cloud segmentation method based on geometric feature extraction and edge perception coding
CN114463736A (en) Multi-target detection method and device based on multi-mode information fusion
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
CN114155524A (en) Single-stage 3D point cloud target detection method and device, computer equipment and medium
CN117011274A (en) Automatic glass bottle detection system and method thereof
CN113408651B (en) Unsupervised three-dimensional object classification method based on local discriminant enhancement
CN116071557A (en) Long tail target detection method, computer readable storage medium and driving device
CN114494893B (en) Remote sensing image feature extraction method based on semantic reuse context feature pyramid
CN115937520A (en) Point cloud moving target segmentation method based on semantic information guidance
CN114882490A (en) Unlimited scene license plate detection and classification method based on point-guided positioning
CN114241470A (en) Natural scene character detection method based on attention mechanism
CN112598055A (en) Helmet wearing detection method, computer-readable storage medium and electronic device
CN113569600A (en) Method and device for identifying weight of object, electronic equipment and storage medium
Zhou et al. FENet: Fast Real-time Semantic Edge Detection Network
CN113822375B (en) Improved traffic image target detection method
CN115359346B (en) Small micro-space identification method and device based on street view picture and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination