CN113706480B - Point cloud 3D target detection method based on key point multi-scale feature fusion - Google Patents

Point cloud 3D target detection method based on key point multi-scale feature fusion Download PDF

Info

Publication number
CN113706480B
CN113706480B CN202110928928.6A CN202110928928A CN113706480B CN 113706480 B CN113706480 B CN 113706480B CN 202110928928 A CN202110928928 A CN 202110928928A CN 113706480 B CN113706480 B CN 113706480B
Authority
CN
China
Prior art keywords
point
voxel
point cloud
feature
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110928928.6A
Other languages
Chinese (zh)
Other versions
CN113706480A (en
Inventor
张旭
柏琳娟
杨艳
廖敏
张振杰
冯梅
李济
万勤
苟宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Productivity Promotion Center
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing Productivity Promotion Center
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Productivity Promotion Center, Chongqing University of Post and Telecommunications filed Critical Chongqing Productivity Promotion Center
Priority to CN202110928928.6A priority Critical patent/CN113706480B/en
Publication of CN113706480A publication Critical patent/CN113706480A/en
Application granted granted Critical
Publication of CN113706480B publication Critical patent/CN113706480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of 3D target detection, and particularly relates to a point cloud 3D target detection method based on key point multi-scale feature fusion, which comprises the following steps: acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result; the distance sampling global feature and the extraction algorithm of the feature sampling global feature are improved in the point cloud 3D target detection model, so that the target detection efficiency and accuracy are improved; the invention adds a characteristic farthest point sampling sequence extraction module, uses farthest point sampling based on characteristics to act on different voxel sparse convolution layers to obtain characteristics of different scales, and reduces the influence of background points and target detection.

Description

Point cloud 3D target detection method based on key point multi-scale feature fusion
Technical Field
The invention belongs to the field of 3D target detection, and particularly relates to a point cloud 3D target detection method based on key point multi-scale feature fusion.
Background
With the rapid development of 3D scene acquisition technologies, 3D detectors such as 3D scanners, radar detectors, and depth cameras have become more inexpensive and superior, which provides sufficient advantages for the mass use of 3D detectors in the field of autopilot. A laser radar (LIDAR) sensor enters the field of view of a person. Large-scale data collected using LIDAR sensors is referred to as a point cloud, and the data set typically includes a laser beam emitted by the LIDAR to locate the three-dimensional coordinates of surrounding objects and the beam return laser intensity.
In recent years, two-dimensional (2D) object detection under camera systems has achieved extraordinary success, but object detection using pictures also has some problems such as: the quality of the picture is limited by the weather state, the environment state, the light state and the like when the picture is collected, the laser radar is insensitive to the change of the weather state, the environment state and the light state, the laser radar light beam can easily penetrate rain fog, dust and the like, and the laser radar can work in the daytime and at night even under the conditions of glare and shadow.
Point cloud-based object detection methods have been extensively studied. A typical VoxelNet network eliminates the need for manual feature engineering of 3D point clouds, unifying feature extraction and target box prediction into a single-stage, end-to-end trainable deep network. The point cloud is divided into 3D voxels with equal intervals, a group of points in each voxel are converted into a uniform feature representation by introducing a voxel feature coding layer, and then the uniform feature representation is connected to a region generation network to generate a candidate frame. The SECOND provides 3D sparse convolution on the basis of VoxelNet to avoid the condition that 3D convolution characteristic diffusion is carried out after empty voxels exist due to the fact that point cloud voxelization intervals are too small.
Another specific representative PointNet proposes that a neural network is used for directly extracting unordered point features, the unordered point features take original point clouds as input, a multilayer perceptron is used for mapping low-dimensional features to a high-dimensional feature space to ensure network translation invariance, and the F-PointNet firstly applies the PointNet to three-dimensional target detection based on a two-dimensional image boundary box to cut the point clouds; and the 3DSSD selects key point samples of the up-sampling feature distance from the point cloud to classify and position the target frame respectively.
Although these methods have made remarkable progress, the detection accuracy of the sample is not high when the method is applied to a sparse point cloud target detection scene. The main reasons are 1 when the point cloud is sampled, 2 for distinguishing information of foreground points and background points, 3 for ignoring the correlation under different scale characteristics and detecting accuracy on severely shielded objects.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a point cloud 3D target detection method based on key point multi-scale feature fusion, which comprises the following steps: acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result;
the process of training the point cloud 3D target detection model comprises the following steps:
s1: acquiring original point cloud data, and selecting the original point cloud data by adopting a farthest distance sampling method to obtain a point cloud sequence;
s2: dividing original point cloud data into voxel blocks with equal intervals, and extracting initial features of the voxel blocks;
s3: inputting the initial characteristics of the point cloud sequence and the voxel block into a 3D sparse convolution neural network to obtain a voxel characteristic space; mapping the position information of the key points in the point cloud sequence to the voxel characteristic space of the corresponding position of each layer of sparse convolution, and updating the position information of the key points;
s4: extracting the characteristics of the key points in the characteristic space of each layer body element by adopting a distance farthest point sampling sequence extraction method to obtain the distance sampling local characteristics of the point cloud sequence;
s5: sampling distance sampling local features of the point cloud sequence by adopting a feature farthest point sampling method to obtain local feature key point features;
s6: fusing the distance sampling local features of each sparse convolution layer by adopting a fusion strategy to obtain a distance sampling global feature; fusing the local feature key point features of each sparse convolution layer by adopting a fusion strategy to obtain feature sampling global features;
s7: converting the voxel characteristic space into a 2D aerial view, and extracting dense characteristics of the aerial view by adopting a bilinear interpolation method; processing the dense features by adopting a regional feature extraction method to generate a 3D suggestion frame;
s8: performing sensitive area pooling on the distance sampling feature and the feature sampling feature according to the 3D suggestion frame to obtain a target detection result;
s9: and calculating a loss function of the model according to the obtained result, adjusting parameters of the model, and finishing the training of the model when the loss function is minimum.
And after the target detection result is obtained, updating the grid points under the distance key points and the feature key points according to the target detection to obtain a regression target frame and a classification target frame for the next target detection.
Preferably, the process of distance feature sampling of the original point cloud data includes: randomly initializing a point in the original point cloud data, and acquiring distance key points from all the point cloud data by taking the point as the initial point and adopting a distance farthest point sampling method to obtain a point cloud sequence.
Further, a spatial distance measurement formula between two points in the point cloud sequence is as follows:
Figure BDA0003210403100000031
where D-Distance represents the L2 Distance between two points, X and Y represent the reflection intensities of the coordinates of the two points, and Sqrt represents a non-negative square root function.
Preferably, the process of extracting the initial feature of the voxel block includes: equally dividing the input point cloud into voxel blocks with equal intervals, wherein the length, width and height of each voxel block are L, W and H respectively; and calculating the distance average value and the reflection intensity average value of each point in each voxel block, and taking the distance average value and the reflection intensity average value of each point as the initial characteristics of the voxel block.
Preferably, the process of acquiring the voxel feature space includes: allocating a buffer area in advance according to the number of the divided voxel blocks; traversing the point cloud sequence, distributing each point cloud to a corresponding associated voxel, and storing the voxel coordinates and the point number of each voxel; establishing a hash table in the iterative process of traversing the point cloud sequence, and checking whether a point cloud exists in a voxel through the hash table; if a voxel relevant to a certain point exists, the number of points in the voxel is increased by one, and if the voxel does not exist, other points are reselected for query; obtaining actual voxel number according to the obtained coordinates of all voxels and the number of midpoints of each voxel; detecting the obtained voxels, and deleting all empty voxels to obtain dense voxels; and carrying out convolution operation on the dense voxels by adopting the GEMM to obtain a voxel characteristic space.
Preferably, the process of obtaining the distance sampling local features of the point cloud sequence includes: obtaining the position information of a key point dp through distance sampling, mapping the position information of the key point dp to a voxel characteristic space of a corresponding position of each sparse convolution according to the index of the key point position information so as to ensure that the key point has one and only one corresponding voxel at different layers, and updating the position information of the key point according to the characteristic of the voxel; abstracting each voxel into a point, and extracting the voxel characteristics by adopting a PointNet + + sequence extraction method to obtain the characteristics of distance key points after sparse convolution; and fusing the features subjected to sparse convolution by adopting a local feature fusion strategy to obtain the distance sampling local features.
Preferably, the process of obtaining the local feature key point feature includes: mapping position information of a key point dp obtained by distance sampling to a voxel characteristic space of a corresponding position of each sparse convolution to ensure that the key point has one and only one corresponding voxel at different layers; obtaining length q using characteristic farthest point samplingAnd the characteristic key point sequence fp satisfies the constraint condition
Figure BDA0003210403100000041
Abstracting each voxel into a point, and extracting the voxel characteristics by adopting a PointNet + + sequence extraction method to obtain the characteristics of characteristic key points after sparse convolution; and fusing the features subjected to sparse convolution by using a feature fusion formula to obtain the local feature key point features.
Preferably, the dense features are obtained from the bird's-eye view features using bilinear interpolation: projecting the voxel characteristic space to a 2D aerial view through a Z axis, and performing interpolation operation by using adjacent voxel characteristics, wherein the operation formula is as follows:
Figure BDA0003210403100000042
where f (x, y) represents the feature in the current interpolated coordinate, x represents the abscissa of the point, y represents the ordinate of the point, f (Q) 11 ) Represents Q 11 Features in coordinates, Q 11 、Q 21 、Q 12 、Q 22 Respectively, representing neighboring voxel characteristics.
Preferably, the fusion strategy comprises a feature key point fusion strategy and a feature splicing strategy of the feature key points;
the feature key point fusion strategy is as follows:
fp=fp conv1 ∪fp conv2 ∪fp conv3 ∪fp conv4 ∪fp bev
the characteristic splicing of the characteristic key points comprises the following steps:
Figure BDA0003210403100000043
wherein fp represents the union of the characteristic sampling points of each sparse convolution layer, fp conv1 Representing the feature key points, fp, sampled after the first layer of sparse convolution bev Representation of the acquisition of thickness from bird's-eye view features by bilinear interpolationThe feature key points sampled after the dense feature, ff represents the global feature of the feature key points,
Figure BDA0003210403100000051
representing the local key point characteristics sampled after the first layer of sparse convolution,
Figure BDA0003210403100000052
and the local key point characteristics sampled after dense characteristics are obtained from the bird's-eye view characteristics through bilinear interpolation are shown.
Preferably, the process of pooling the distance sampling feature and the feature sampling feature in the region of interest includes: dividing the distance sampling global features and the feature sampling global features by adopting a 3D suggestion frame, and generating 6 x 6 grid points at equal intervals in each 3D suggestion frame, wherein each grid point is used as
Figure BDA0003210403100000053
Representing; acquiring the characteristics of grid points from the key points by adopting sequence extraction operation; and obtaining a target frame regression result and a target frame classification prediction result according to the characteristics of the grid points.
Preferably, the loss function of the model is: the loss function of the model comprises a suggested frame generation network loss function and a network target frame loss function;
the expression of the proposed box to generate the network loss function is:
Figure BDA0003210403100000054
wherein L is rpn Representing the suggestion box to generate a network loss function, L cls Represents the classification loss calculated by using the Focal loss, x, y, z respectively represent the three-dimensional coordinates of the target frame, l, h, w respectively represent the length, width and height of the target frame, theta represents the direction angle of the target frame,
Figure BDA0003210403100000058
representing the Smooth-L1 loss calculation method,
Figure BDA0003210403100000055
denotes the classified prediction residual, Δ r a Representing regression residuals;
the expression of the network target box loss function is:
Figure BDA0003210403100000056
wherein L is rcnn Representing the network target Box loss function, L iou The prediction and truth boxes are shown to calculate the loss using the Focal loss,
Figure BDA0003210403100000057
representing the predicted target frame residual, Δ r p Representing the regression residual.
The invention has the advantages that:
1) The method adds a Feature FPS (Voxel Set extraction) sequence extraction Module), uses Feature-based farthest point sampling to act on different Voxel sparse convolution layers to obtain features of different scales, and reduces the influence of background points and target detection;
2) The invention designs a key point-based multi-scale feature fusion method, which is used for carrying out 3D target detection on a point cloud scene and is beneficial to detecting samples which are difficult to detect.
Drawings
FIG. 1 is a network flow diagram of the present invention;
FIG. 2 is a model overview framework diagram of the present invention;
FIG. 3 is a graph of the test results of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the field of point cloud target detection, a point cloud scene contains tens of thousands of points, and huge resource and time waste is caused by directly using all the points to carry out model prediction and regression. In most target detection algorithms, pointNet + + farthest-distance point sampling (FPS) is used iteratively to generate key points, and the adjacency relation between the key points and surrounding points is used to generate feature vectors. However, the points sampled according to distance contain a large number of background points, lacking useful foreground points. The key points containing background points may play a promoting role in the classification of the target box and a negative role in the regression of the target. Therefore, the formulation of the point selection strategy is a key problem for improving the target detection accuracy.
In the field of point cloud target detection, different models have different feature fusion strategies, and the most common feature fusion strategy is to combine features under all different convolution layers or features under different viewing angles, or features obtained under different modes. The lack of strong rationale for these approaches suggests that employing their fusion strategy results in a large number of calculations.
A point cloud 3D target detection method based on key point multi-scale feature fusion comprises the following steps: and acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result.
As shown in fig. 1, the process of training the point cloud 3D target detection model includes:
s1: acquiring original point cloud data, and selecting the original point cloud data by adopting a farthest distance sampling method to obtain a point cloud sequence;
s2: dividing original point cloud data into voxel blocks with equal intervals, and extracting initial features of the voxel blocks;
s3: inputting the initial characteristics of the point cloud sequence and the voxel block into a 3D sparse convolution neural network to obtain a voxel characteristic space; mapping the position information of the key points in the point cloud sequence to the voxel characteristic space of the corresponding position of each layer of sparse convolution, and updating the position information of the key points;
s4: extracting the characteristics of the key points in the characteristic space of each layer body element by adopting a distance farthest point sampling sequence extraction method to obtain the distance sampling local characteristics of the point cloud sequence;
s5: sampling distance sampling local features of the point cloud sequence by adopting a feature farthest point sampling method to obtain local feature key point features;
s6: fusing the distance sampling local features of each sparse convolution layer by adopting a fusion strategy to obtain distance sampling global features; fusing the local characteristic key point characteristics of each sparse convolution layer by adopting a fusion strategy to obtain a characteristic sampling global characteristic;
s7: converting the voxel characteristic space into a 2D aerial view, and extracting dense characteristics of the aerial view by adopting a bilinear interpolation method; processing the dense features by adopting a regional feature extraction method to generate a 3D suggestion frame;
s8: performing sensitive area pooling on the distance sampling feature and the feature sampling feature according to the 3D suggestion frame to obtain a target detection result;
s9: and calculating a loss function of the model according to the obtained result, adjusting parameters of the model, and finishing the training of the model when the loss function is minimum.
The data set employed by the present invention is a KITTI data set, which is a computer vision algorithm assessment data set widely used in the field of autonomous driving. The data set contains a plurality of tasks such as 3D object detection and multi-object tracking and segmentation. The 3D object detection reference consists of 7481 training images and 7518 test images and corresponding point clouds. The training samples are roughly divided into a training set (3712 samples) and a validation set (3769 samples).
The execution of the gradient descent algorithm by the model on all the training data is called a round, the parameters of the model are updated every round, and the maximum number of rounds is set to 80 rounds. During the 80 iterations of training the model, the model and its parameters that achieved the least error on the test data set are saved.
The model structure comprises an original point cloud data acquisition module, a characteristic farthest point sampling sequence extraction module, a farthest point sampling sequence extraction module, a 3D voxelization module, a 3D sparse convolution module, a bird's-eye view projection and suggestion frame generation module and a feeling region pooling module; the connection of the various modules is shown in figure 2.
In the process of sampling key points of the input point cloud P by adopting a distance characteristic sampling method, a point cloud sequence is selected from the point cloud by using a distance farthest point sampling method (D-FPS). The measurement mode of the space distance is as follows:
Figure BDA0003210403100000081
wherein, D-Distance represents L2 Distance between two points, L2 represents point cloud Distance, X and Y represent coordinates and reflection intensity of different points, sqrt represents mathematical square root,
Figure BDA0003210403100000082
representing a spatial dimension of
Figure BDA0003210403100000083
At any point of the time-series connection,
Figure BDA0003210403100000084
representing a spatial dimension other than X of
Figure BDA0003210403100000085
I, j represent the index of the point, P represents the point cloud,
Figure BDA0003210403100000086
representing the spatial dimensions of the point cloud.
After D-FPS calculation is carried out, a distance keypoint sequence dp = { p } with the length p can be obtained 1 ,p 2 ,p 3 …p p }。
The input point cloud P is equally divided into voxel blocks L multiplied by W multiplied by H with equal intervals, and L, W and H respectively represent the length, width and height of the voxel blocks. The average of the distance and reflection intensity that fall into different points of each voxel block is used as the initial characteristic of that voxel block. The formula for calculating the initial features of the voxel blocks is:
Figure BDA0003210403100000087
wherein [ X, V, Z]Representing the three-dimensional coordinates of the point cloud falling within the voxel, R representing the sum of the reflection intensities of the point cloud falling within the voxel,
Figure BDA0003210403100000088
representing the average three-dimensional coordinates of the point cloud falling within the voxel,
Figure BDA0003210403100000089
representing the mean reflection intensity of the point cloud falling into the voxel and T representing the transpose.
The process of using the voxelized point cloud as point cloud feature extraction using a 3D sparse convolutional neural network comprises: pre-allocating a buffer according to the voxel number limit; and traversing the point cloud, distributing the points to the corresponding voxels associated with the points, and storing the coordinates of the voxels and the point number of each voxel. A hash table is built up in an iterative process to check for the presence of points in the voxels. If a voxel associated with a certain point exists, the number of points in the voxel is incremented by one. And finally obtaining the coordinates of all voxels and the number of points in each voxel to obtain the actual voxel number. The sparsity of the point cloud cannot avoid the existence of empty voxels. And performing aggregation operation on the sparse voxels to obtain dense voxel characteristics, namely deleting empty voxels. Then carrying out convolution operation on the dense voxels by using the GEMM to obtain dense output characteristics; and mapping the dense output features to the sparse output features through the constructed input-output index rule matrix.
The location information of the keypoints obtained by distance sampling is indexed to the voxel feature space of the sparse convolution corresponding location of each layer to ensure that the keypoints have one and only one corresponding voxel at different layers. And updating the feature of the keypoint based on the feature of the voxel. And regarding each voxel as a point, and applying a sequence extraction method proposed by the PointNet + + idea to the aggregation of voxel direction features.
The process of sampling the key points in each layer of body feature space by adopting a distance farthest point sampling method to obtain the distance sampling local features of the point cloud sequence comprises the following steps: keypoint dp = { p) obtained by distance sampling 1 ,p 2 ,p 3 ,...p p Mapping the position information to a voxel characteristic space of a corresponding position of each sparse convolution through an index so as to ensure that key points have one and only one corresponding voxel at different layers, updating the characteristics of the key points according to the characteristics of the voxels, abstracting each voxel into one point, and using a sequence extraction method proposed by PointNet + + for extracting the characteristics of the voxels to obtain the characteristics of the characteristic key points after sparse convolution; and fusing the characteristics of the distance key points after sparse convolution by adopting a local characteristic fusion strategy to obtain the distance sampling local characteristics. The formula of the local feature fusion strategy is as follows:
Figure BDA0003210403100000091
wherein the content of the first and second substances,
Figure BDA0003210403100000092
features representing the ith distance keypoint of the kth layer,
Figure BDA0003210403100000093
features representing the p-th distance keypoints of the k-th layer after 3D sparse convolution,
Figure BDA0003210403100000094
represents the mapping of the distance key points on the k layer voxel space, r k Representing the fixed radius of feature extraction.
Using PointNet to generate the characteristics of the distance key points after sparse convolution:
Figure BDA0003210403100000101
wherein the content of the first and second substances,
Figure BDA0003210403100000102
representing the sequence extraction characteristics of ith distance key points of the kth layer after multilayer sparse convolution,
Figure BDA0003210403100000103
representing a fixed number of distance keypoint features representing random samples,
Figure BDA0003210403100000104
representing features representing ith distance key point of kth layer, G representing feature coding by using multilayer perceptron, (l) k ) Denotes the k-th layer and max (.) denotes the use of the maximum pooling function.
And acquiring the characteristic key points by using the characteristic farthest point sampling on the basis of acquiring the distance key points and the characteristics thereof. And acquiring the characteristic key points by using the characteristic farthest point sampling on the basis of acquiring the distance key points and the characteristics thereof. Specifically, a distance key point is initialized randomly, and a point cloud sequence is selected from the distance key points based on the distance key point by using a characteristic farthest point sampling method (F-FPS) in an iteration mode.
The spatial feature measurement mode is as follows:
Figure BDA0003210403100000105
where F-Distance represents the L2 feature Distance between the keypoints of two feature samples. And X and Y represent features from different distances and key points at different scales extracted by the sparse convolution sequence. Obtaining a characteristic key point sequence fp = { p) with the length of q through characteristic farthest point sampling (F-FPS) 1 ,p 2 ,p 3 …p q }。
The process of obtaining the local feature of the feature sampling by using the feature key points comprises the following steps: keypoint dp = { p) obtained by distance sampling 1 ,p 2 ,p 3 ...p p The position information of the points is mapped to the voxel characteristic space of the corresponding position of each sparse convolution by indexes so as to ensure that key points have and only have at different layersA corresponding voxel is subjected to characteristic farthest point sampling (F-FPS) to obtain a characteristic key point sequence fp = { p } with the length of q 1 ,p 2 ,p 3 …p q And satisfy constraint conditions
Figure BDA0003210403100000106
The characteristic key points belong to distance key point subsets, each voxel is abstracted into one point, and the voxel characteristics are extracted by adopting a PointNet + + sequence extraction method to obtain the characteristics of the characteristic key points after sparse convolution; and fusing the features after sparse convolution by adopting a feature fusion formula to obtain the key point features of the local features. The feature fusion formula is:
Figure BDA0003210403100000111
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003210403100000112
features representing the ith feature keypoint of the kth layer,
Figure BDA0003210403100000113
features representing distance keypoints of the kth layer after 3D sparse convolution,
Figure BDA0003210403100000114
represents the mapping of the distance key points on the k layer body space, r k And expressing the fixed radius of the feature extraction, and generating the feature of the feature key point after sparse convolution by using PointNet.
The formula for extracting the characteristics of the sequence of the characteristic key points after multilayer thin convolution is as follows:
Figure BDA0003210403100000115
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003210403100000116
representing the sequence extraction characteristics of the ith characteristic key point of the kth layer after multilayer thin convolution,
Figure BDA0003210403100000117
representing randomly sampled feature key point features of a fixed quantity, G (-) representing feature coding by using a multilayer perceptron, and max (-) representing using a maximum pooling function.
And projecting the 3D voxel characteristics after the multi-layer sparse convolution into a 2D bird's eye view through a Z axis. Dense features are obtained from the bird's-eye-view features using bilinear interpolation. And generating a 3D suggestion frame by adopting a SECOND regional feature extraction method. The formula of the bilinear interpolation operation is as follows:
Figure BDA0003210403100000118
wherein f (x, y) represents the feature in the current interpolation coordinate, x represents the abscissa of the point, y represents the ordinate of the point, and f (Q) 11 ) Represents Q 11 Features in coordinates, Q 11 、Q 21 、Q 12 、Q 22 Respectively, representing neighboring voxel characteristics.
The 3D proposal frame RPN architecture consists of three phases. Each stage starts with a downsampled convolutional layer, after each convolutional layer, the BatchNorm and ReLU layers are used. The output of each stage is then upsampled to the same size and the profiles are concatenated into one. And finally, performing category prediction and position regression prediction on each voxel by using a full-connection layer. And selecting the suggestion boxes with high intersection areas of the Top-k suggestion boxes and the truth boxes and high classification confidence as candidate boxes.
Distance sampling and feature sampling key points are obtained, and the following feature fusion strategies are respectively used. For the distance key points, splicing the characteristics of each layer; for the feature keypoints, the following representation is used as the feature keypoint sequence of different layers, taking 4-layer convolution and 1-layer projection as an example, the obtained result is:
fp conv1 ={p 1 ,p 2 ,p 3 …p q }
fp conv2 ={p 1 ,p 2 ,p 3 …p q }
fp conv3 ={p 1 ,p 2 ,p 3 …p q }
fp conv4 ={p 1 ,p 2 ,p 3 …p q }
fp bev ={p 1 ,p 2 ,p 3 …p q }
the feature key point fusion strategy is as follows:
fp=fp conv1 ∪fp conv2 ∪fp conv3 ∪fp conv4 ∪fp bev
the characteristic splicing of the characteristic key points is as follows:
Figure BDA0003210403100000121
wherein fp represents the union of the characteristic sampling points of each sparse convolution layer, fp conv1 Representing the feature key points, fp, sampled after the first layer of sparse convolution bev Representing feature key points sampled after obtaining dense features from the bird's-eye view features by bilinear interpolation, ff representing global features of the feature key points,
Figure BDA0003210403100000122
representing the local key point characteristics sampled after the first layer of sparse convolution,
Figure BDA0003210403100000123
and representing the local key point characteristics sampled after dense characteristics are obtained from the bird's-eye view characteristics through bilinear interpolation.
The fusion strategy of the distance feature key points is the same as the fusion strategy of the feature key points.
And obtaining distance sampling characteristics and characteristic sampling characteristics through multi-scale key point characteristic fusion, and pooling an interested region between two different key point sequences. And respectively adopting a multi-layer perceptron (MLP) as class prediction and target box regression for the obtained ff and df. Specifically, the confidence coefficient prediction is performed on the type of the target frame by using the 3-layer fully-connected network, the position of the target frame is regressed by using the 3-layer fully-connected network, and x, y, z, l, h, w, and θ respectively represent the center coordinates of the target frame, the length, width, and height of the target frame, and the direction angle of the target frame in the bird's eye view.
The loss function of the model comprises a suggestion box generation network loss function and a network target box loss function;
the expression of the proposed box to generate the network loss function is:
Figure BDA0003210403100000131
wherein L is rpn Representing the suggestion box to generate a network loss function, L cls Represents the classification loss calculated by using the Focal loss, x, y, z represent the three-dimensional coordinates of the target frame, l, h, w represent the length, width and height of the target frame, theta represents the direction angle of the target frame,
Figure BDA0003210403100000132
showing the Smooth-L1 loss calculation method,
Figure BDA0003210403100000133
representing the classified prediction residual, Δ r a Representing regression residuals;
the expression of the network target box loss function is:
Figure BDA0003210403100000134
wherein L is rcnn Representing the network target Box loss function, L iou The prediction and truth boxes are shown to calculate the loss using the Focal loss,
Figure BDA0003210403100000135
representing the predicted target frame residual, Δ r p Representing the regression residual.
Evaluation indexes are as follows: the average accuracy (mAP) with 40 recall locations was used to evaluate model performance on three difficulty levels: "simple level", "medium level", and "difficult level". To evaluate the degree of overlap of the detection target box with the truth box, we used the same criteria as the official evaluation. Specifically, for an automobile, the overlap of the bounding box over simple, medium and difficult level objects requires 70%,50% and 50%, respectively. For pedestrians and cyclists, the overlap of the bounding box on easy, medium and difficult level objects requires 50%,25% and 25%, respectively.
As shown in FIG. 3, the results of selecting and directly performing target detection in the KITTI test set are as follows, and it can be seen from the results that the model can perform good target detection under the shielded object and the difficult sample.
In this embodiment, a Python programming language is adopted, and the Python programming language can be run on a mainstream computer platform. The operating system used in the implementation is CentOS 6.5, the CPU is required to be Intel i7, the memory is more than 16GB, the hard disk space is required to be more than 60GB, the GPU is NVIDIA GTX 1080Ti, and the video memory is 11G.
The present invention was carried out based on the PyTorch 1.0 framework.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A point cloud 3D target detection method based on key point multi-scale feature fusion is characterized by comprising the following steps: acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result;
the process of training the point cloud 3D target detection model comprises the following steps:
s1: acquiring original point cloud data, and selecting the original point cloud data by adopting a farthest distance sampling method to obtain a point cloud sequence;
s2: dividing original point cloud data into voxel blocks with equal intervals, and extracting initial features of the voxel blocks;
s3: inputting the initial characteristics of the point cloud sequence and the voxel block into a 3D sparse convolution neural network to obtain a voxel characteristic space; mapping the position information of the key points in the point cloud sequence to the voxel characteristic space of the corresponding position of each layer of sparse convolution, and updating the key position information;
s4: extracting the characteristics of the key points in the characteristic space of each layer body element by adopting a distance farthest point sampling sequence extraction method to obtain the distance sampling local characteristics of the point cloud sequence; obtaining the position information of a key point dp through distance sampling, mapping the position information of the key point dp to a voxel characteristic space of a corresponding position of each sparse convolution according to the index of the key point position information so as to ensure that the key point has one and only one corresponding voxel at different layers, and updating the position information of the key point according to the characteristic of the voxel; abstracting each voxel into a point, and extracting voxel characteristics by adopting a PointNet + + sequence extraction method to obtain characteristics of distance key points after sparse convolution; fusing the sparse convolved features by adopting a local feature fusion strategy to obtain distance sampling local features;
s5: sampling distance sampling local features of the point cloud sequence by adopting a feature farthest point sampling method to obtain local feature key point features; mapping position information of a key point dp obtained by distance sampling to a voxel characteristic space of a corresponding position of each sparse convolution to ensure that the key point has one and only one corresponding voxel at different layers; obtaining a characteristic key point sequence fp with the length of q by adopting characteristic farthest point sampling, wherein the characteristic key point sequence meets constraint conditions
Figure FDA0003918686820000011
Abstracting each voxel into a point, and extracting the voxel characteristics by adopting a PointNet + + sequence extraction method to obtain the characteristics of characteristic key points after sparse convolution; using feature fusion formulasFusing the features after sparse convolution to obtain local feature key point features;
s6: fusing the distance sampling local features of each sparse convolution layer by adopting a fusion strategy to obtain distance sampling global features; fusing the local feature key point features of each sparse convolution layer by adopting a fusion strategy to obtain feature sampling global features;
s7: converting the voxel characteristic space into a 2D aerial view, and extracting dense characteristics of the aerial view by adopting a bilinear interpolation method; processing the dense features by adopting a regional feature extraction method to generate a 3D suggestion frame;
s8: performing region-of-interest pooling on the distance sampling global features and the feature sampling global features according to the 3D suggestion frame to obtain a target detection result;
s9: and calculating a loss function of the model according to the obtained result, adjusting parameters of the model, and finishing the training of the model when the loss function is minimum.
2. The method for detecting the point cloud 3D target based on the multi-scale feature fusion of the key points as claimed in claim 1, wherein the process of distance feature sampling of the original point cloud data comprises: randomly initializing a point in original point cloud data, and acquiring distance key points from all the point cloud data by adopting a distance farthest point sampling method by taking the point as the initial point to obtain a point cloud sequence; the formula of the farthest-distance point sampling method is as follows:
Figure FDA0003918686820000021
wherein D-Distance represents L2 Distance between two points, X and Y represent coordinates and reflection intensity of the two points, sqrt represents a square root function of a nonnegative number,
Figure FDA0003918686820000022
the representation space dimension is
Figure FDA0003918686820000023
At any point of the time-series connection,
Figure FDA0003918686820000024
representing a spatial dimension other than X of
Figure FDA0003918686820000025
I, j represent the index of the point, P represents the point cloud,
Figure FDA0003918686820000026
representing the spatial dimensions of the point cloud.
3. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the process of extracting the initial features of the voxel block comprises: equally dividing the input point cloud into voxel blocks with equal intervals, wherein the length, width and height of each voxel block are L, W and H respectively; and calculating the distance average value and the reflection intensity average value of each point in each voxel block, and taking the distance average value and the reflection intensity average value of each point as the initial characteristics of the voxel block.
4. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the process of obtaining the voxel feature space comprises: pre-allocating a buffer area according to the number of the divided voxel blocks; traversing the point cloud sequence, distributing each point cloud to a corresponding associated voxel, and storing the voxel coordinates and the point number of each voxel; establishing a hash table in the iterative process of traversing the point cloud sequence, and checking whether a point cloud exists in a voxel through the hash table; if a voxel relevant to a certain point exists, the number of points in the voxel is added by one, and if the voxel does not exist, other points are reselected for query; obtaining actual voxel number according to the obtained coordinates of all voxels and the number of midpoints of each voxel; detecting the obtained voxels, and deleting all empty voxels to obtain dense voxels; and carrying out convolution operation on the dense voxels by adopting the GEMM to obtain a voxel characteristic space.
5. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the dense features are obtained from the bird's-eye view features by using bilinear interpolation: projecting the voxel characteristic space to a 2D aerial view through a Z axis, and performing interpolation operation by using adjacent voxel characteristics, wherein the operation formula is as follows:
Figure FDA0003918686820000031
wherein f (x, y) represents the feature in the current interpolation coordinate, x represents the abscissa of the point, y represents the ordinate of the point, and f (Q) 11 ) Represents Q 11 Features in coordinates, Q 11 、Q 21 、Q 12 、Q 22 Respectively, representing the coordinates of the features of neighboring voxels.
6. The method for detecting the point cloud 3D target based on the multi-scale feature fusion of the key points as claimed in claim 1, wherein the fusion strategy comprises a feature key point fusion strategy and a feature splicing strategy of the feature key points;
feature key point fusion strategy:
fp=fp conv1 ∪fp conv2 ∪fp conv3 ∪fp conv4 ∪fp bev
characteristic splicing of characteristic key points:
Figure FDA0003918686820000032
where fp represents the union of the sparse convolution layer characteristic sampling points, fp conv1 Representing the feature key points, fp, sampled after the first layer of sparse convolution bev Representing feature key points sampled after obtaining dense features from the bird's-eye view features by bilinear interpolation, ff representing global features of the feature key points,
Figure FDA0003918686820000033
representing the local key point characteristics sampled after the first layer of sparse convolution,
Figure FDA0003918686820000041
and representing the local key point characteristics sampled after dense characteristics are obtained from the bird's-eye view characteristics through bilinear interpolation.
7. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the process of pooling the region of interest of the distance sampling global feature and the feature sampling global feature comprises: dividing the distance sampling global features and the feature sampling global features by adopting a 3D suggestion frame, and generating 6 x 6 grid points at equal intervals in each 3D suggestion frame, wherein each grid point is used as
Figure FDA0003918686820000042
Representing; acquiring the characteristics of grid points from the key points by adopting sequence extraction operation; and obtaining a target frame regression result and a target frame classification prediction result according to the characteristics of the grid points.
8. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the loss function of the model comprises a suggestion box generation network loss function and a network target box loss function;
the expression of the proposed box to generate the network loss function is:
Figure FDA0003918686820000043
wherein L is rpn Representing the suggestion box to generate a network loss function, L cls Represents the classification loss calculated by Focal loss, x, y, z represent the three-dimensional coordinates of the target frame, l, h, w represent the length, width and height of the target frame, and theta represents the direction of the target frameThe angle of the first and second side walls is,
Figure FDA0003918686820000044
showing the Smooth-L1 loss calculation method,
Figure FDA0003918686820000045
denotes the classified prediction residual, Δ r a Representing regression residuals;
the expression of the network target box loss function is:
Figure FDA0003918686820000046
wherein L is rcnn Representing the loss function of the network target box, L iou The prediction and truth boxes are shown to calculate the loss using the Focal loss,
Figure FDA0003918686820000047
representing the predicted target frame residual, Δ r p The regression residuals are indicated.
CN202110928928.6A 2021-08-13 2021-08-13 Point cloud 3D target detection method based on key point multi-scale feature fusion Active CN113706480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110928928.6A CN113706480B (en) 2021-08-13 2021-08-13 Point cloud 3D target detection method based on key point multi-scale feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110928928.6A CN113706480B (en) 2021-08-13 2021-08-13 Point cloud 3D target detection method based on key point multi-scale feature fusion

Publications (2)

Publication Number Publication Date
CN113706480A CN113706480A (en) 2021-11-26
CN113706480B true CN113706480B (en) 2022-12-09

Family

ID=78652592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110928928.6A Active CN113706480B (en) 2021-08-13 2021-08-13 Point cloud 3D target detection method based on key point multi-scale feature fusion

Country Status (1)

Country Link
CN (1) CN113706480B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114266992A (en) * 2021-12-13 2022-04-01 北京超星未来科技有限公司 Target detection method and device and electronic equipment
CN114299243A (en) * 2021-12-14 2022-04-08 中科视语(北京)科技有限公司 Point cloud feature enhancement method and device based on multi-scale fusion
CN114359660B (en) * 2021-12-20 2022-08-26 合肥工业大学 Multi-modal target detection method and system suitable for modal intensity change
CN114494609B (en) * 2022-04-02 2022-09-06 中国科学技术大学 3D target detection model construction method and device and electronic equipment
CN114913519B (en) * 2022-05-16 2024-04-19 华南师范大学 3D target detection method and device, electronic equipment and storage medium
CN115375731B (en) * 2022-07-29 2023-07-04 大连宗益科技发展有限公司 3D point cloud single-target tracking method for association points and voxels and related device
CN115578393B (en) * 2022-12-09 2023-03-10 腾讯科技(深圳)有限公司 Key point detection method, key point training method, key point detection device, key point training device, key point detection equipment, key point detection medium and key point detection medium
CN116665003B (en) * 2023-07-31 2023-10-20 安徽大学 Point cloud three-dimensional target detection method and device based on feature interaction and fusion

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160214A (en) * 2019-12-25 2020-05-15 电子科技大学 3D target detection method based on data fusion
CN111199206A (en) * 2019-12-30 2020-05-26 上海眼控科技股份有限公司 Three-dimensional target detection method and device, computer equipment and storage medium
CN111242041A (en) * 2020-01-15 2020-06-05 江苏大学 Laser radar three-dimensional target rapid detection method based on pseudo-image technology
CN111429514A (en) * 2020-03-11 2020-07-17 浙江大学 Laser radar 3D real-time target detection method fusing multi-frame time sequence point clouds
WO2020151109A1 (en) * 2019-01-22 2020-07-30 中国科学院自动化研究所 Three-dimensional target detection method and system based on point cloud weighted channel feature
CN111968133A (en) * 2020-07-31 2020-11-20 上海交通大学 Three-dimensional point cloud data example segmentation method and system in automatic driving scene
CN112347987A (en) * 2020-11-30 2021-02-09 江南大学 Multimode data fusion three-dimensional target detection method
CN112731339A (en) * 2021-01-04 2021-04-30 东风汽车股份有限公司 Three-dimensional target detection system based on laser point cloud and detection method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020151109A1 (en) * 2019-01-22 2020-07-30 中国科学院自动化研究所 Three-dimensional target detection method and system based on point cloud weighted channel feature
CN111160214A (en) * 2019-12-25 2020-05-15 电子科技大学 3D target detection method based on data fusion
CN111199206A (en) * 2019-12-30 2020-05-26 上海眼控科技股份有限公司 Three-dimensional target detection method and device, computer equipment and storage medium
CN111242041A (en) * 2020-01-15 2020-06-05 江苏大学 Laser radar three-dimensional target rapid detection method based on pseudo-image technology
CN111429514A (en) * 2020-03-11 2020-07-17 浙江大学 Laser radar 3D real-time target detection method fusing multi-frame time sequence point clouds
CN111968133A (en) * 2020-07-31 2020-11-20 上海交通大学 Three-dimensional point cloud data example segmentation method and system in automatic driving scene
CN112347987A (en) * 2020-11-30 2021-02-09 江南大学 Multimode data fusion three-dimensional target detection method
CN112731339A (en) * 2021-01-04 2021-04-30 东风汽车股份有限公司 Three-dimensional target detection system based on laser point cloud and detection method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space;Charles R. Qi 等;《31st Conference on Neural Information Processing Systems (NIPS 2017)》;20171231;全文 *
PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection;Shaoshuai Shi 等;《2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;20200619;全文 *

Also Published As

Publication number Publication date
CN113706480A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN113706480B (en) Point cloud 3D target detection method based on key point multi-scale feature fusion
Fernandes et al. Point-cloud based 3D object detection and classification methods for self-driving applications: A survey and taxonomy
CN111027401B (en) End-to-end target detection method with integration of camera and laser radar
CN110032962B (en) Object detection method, device, network equipment and storage medium
CN111242041B (en) Laser radar three-dimensional target rapid detection method based on pseudo-image technology
CN113412505B (en) Processing unit and method for ordered representation and feature extraction of a point cloud obtained by a detection and ranging sensor
EP3252615A1 (en) Method and system for determining cells crossed by a measuring or viewing axis
CN115082674A (en) Multi-mode data fusion three-dimensional target detection method based on attention mechanism
CN113761999A (en) Target detection method and device, electronic equipment and storage medium
Shi et al. An improved lightweight deep neural network with knowledge distillation for local feature extraction and visual localization using images and LiDAR point clouds
CN115861619A (en) Airborne LiDAR (light detection and ranging) urban point cloud semantic segmentation method and system of recursive residual double-attention kernel point convolution network
Waqas et al. Deep learning-based obstacle-avoiding autonomous UAVs with fiducial marker-based localization for structural health monitoring
CN116246119A (en) 3D target detection method, electronic device and storage medium
Kukolj et al. Road edge detection based on combined deep learning and spatial statistics of LiDAR data
Ballouch et al. Toward a deep learning approach for automatic semantic segmentation of 3D lidar point clouds in urban areas
Qayyum et al. Deep convolutional neural network processing of aerial stereo imagery to monitor vulnerable zones near power lines
Diaz et al. Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)
CN116503602A (en) Unstructured environment three-dimensional point cloud semantic segmentation method based on multi-level edge enhancement
Caros et al. Object segmentation of cluttered airborne lidar point clouds
Acun et al. D3NET (divide and detect drivable area net): deep learning based drivable area detection and its embedded application
US20230105331A1 (en) Methods and systems for semantic scene completion for sparse 3d data
CN115240168A (en) Perception result obtaining method and device, computer equipment and storage medium
Sajjad et al. A Comparative Analysis of Camera, LiDAR and Fusion Based Deep Neural Networks for Vehicle Detection
Huu et al. Development of Volumetric Image Descriptor for Urban Object Classification Using 3D LiDAR Based on Convolutional Neural Network
Foster et al. RGB pixel-block point-cloud fusion for object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant