CN113378854A - Point cloud target detection method integrating original point cloud and voxel division - Google Patents
Point cloud target detection method integrating original point cloud and voxel division Download PDFInfo
- Publication number
- CN113378854A CN113378854A CN202110651776.XA CN202110651776A CN113378854A CN 113378854 A CN113378854 A CN 113378854A CN 202110651776 A CN202110651776 A CN 202110651776A CN 113378854 A CN113378854 A CN 113378854A
- Authority
- CN
- China
- Prior art keywords
- point cloud
- point
- voxel
- layer
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 81
- 238000000605 extraction Methods 0.000 claims abstract description 49
- 230000006870 function Effects 0.000 claims abstract description 38
- 238000000034 method Methods 0.000 claims abstract description 37
- 230000008447 perception Effects 0.000 claims abstract description 18
- 230000004927 fusion Effects 0.000 claims abstract description 8
- 238000012546 transfer Methods 0.000 claims description 33
- 238000005070 sampling Methods 0.000 claims description 23
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 230000001737 promoting effect Effects 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 2
- 238000005192 partition Methods 0.000 claims 1
- 239000000284 extract Substances 0.000 description 4
- 230000004931 aggregating effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000002310 reflectometry Methods 0.000 description 1
- 239000000523 sample Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a point cloud target detection method fusing original point cloud and voxel division. Firstly, extracting local detail features and semantic features of point cloud by utilizing a lossless feature extraction network Pointnet + +, then constructing a loss function to further improve the perception capability of the lossless feature extraction network Pointnt + + on local neighborhood information, embedding the local detail features and the semantic features without information loss into a point cloud target detection network based on voxel division at a voxel feature initialization stage and a sparse convolution perception stage by adopting trilinear interpolation, and finally classifying and regressing a preset detection anchor frame through a two-dimensional RPN to obtain a final detection target. The invention enables the detection network to have multi-scale and multi-level information fusion perception capability by embedding the lossless coding multi-scale of the point cloud into the voxel method, and integrates two types of point cloud target detection methods based on original point cloud and voxel division, and simultaneously has high-efficiency point cloud perception capability and lossless feature coding capability.
Description
Technical Field
The invention belongs to the technical field of 3D point cloud target detection, and particularly relates to a point cloud target detection method fusing original point cloud and voxel division.
Background
With the continuous upgrading of the vehicle-mounted laser radar technology, the vehicle-mounted laser radar can quickly and conveniently acquire point cloud data of a current scene, the extraction of a target in the scene can be realized by utilizing the geometric structure information of the scene point cloud, and the technical method has penetrated into various industries such as smart city construction, automatic driving, unmanned distribution and the like. Due to the random and disordered laser point cloud and the large difference of density and sparsity, if the traditional target detection algorithm is adopted to carry out uniform manual feature extraction on massive point cloud data, the method cannot adapt to the shape change of the target under the complex road scene of automatic driving. Therefore, the point cloud target detection algorithm based on deep learning is rapidly developed and applied in an automatic driving scene.
The current general point cloud target detection method based on deep learning mainly comprises the following steps: target detection based on original point cloud and point cloud target detection based on voxel division.
The 3D target detection algorithm based on the original point cloud does not carry out any pretreatment on the scene point cloud, the coordinates of the original point cloud and the corresponding reflectivity numerical value are directly input into a neural network built by a multilayer perceptron (MLP), the point cloud scene is sampled layer by layer from shallow to deep by adopting Farthest Point Sampling (FPS), local detail characteristics and semantic characteristics are extracted by a local point Set characteristic extraction module (Set Abstract), and finally, the detail information characteristics and the semantic information characteristics are endowed to all points in the original scene by a characteristic transmission layer (Feature prediction) by adopting trilinear interpolation. The method has no information loss, but the perception capability of the multilayer perceptron to the disordered point cloud is lower than that of a structure built by a convolutional neural network based on a voxel division method.
The method comprises the steps of carrying out point cloud target detection based on voxel division to divide scene point cloud into uniform voxel grids according to the point cloud density scanned by different linear laser radars, carrying out feature extraction on each voxel by adopting a voxel feature extraction mode adapting to different voxel sizes, carrying out feature extraction on initialized voxel scene semantic information by utilizing 3D convolution or 3D sparse convolution, gradually compressing height dimensions to one dimension, and further adopting a two-dimensional convolution building region to provide a network (RPN) to classify and predict an anchor frame preset for each convolution grid point under a scene top view. According to the method, objects which are not easy to deform and have high point cloud density can be quickly and efficiently classified in an automatic driving point cloud scene, however, due to voxel division, geometric deformation of an original point cloud structure is caused, and particularly for small objects such as pedestrians and bicycles, local detail information is lost due to deformation caused by voxel division, so that the detected classification and regression effects deviate from real targets.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a point cloud target detection method fusing original point cloud and voxel division. Firstly, extracting local detail features and semantic features of point cloud by utilizing a lossless feature extraction network Pointnet + +, then constructing a loss function to further improve the perception capability of the lossless feature extraction network Pointnt + + on local neighborhood information, then embedding the local detail features and the semantic features without information loss into a point cloud target detection network based on voxel division in a voxel feature initialization stage and a sparse convolution perception stage by respectively adopting trilinear interpolation, and finally classifying and regressing each preset detection anchor frame through a two-dimensional RPN to obtain a final detection target.
In order to achieve the aim, the technical scheme provided by the invention is a point cloud target detection method fusing original point cloud and voxel division, which comprises the following steps:
step 1, extracting local detail features and semantic features of point cloud by using a lossless feature extraction network Polnnet + +;
step 1.1, constructing a multilayer encoder;
step 1.2, extracting local detail features and semantic features of each layer of point cloud through an SA (security association) module without information loss;
step 1.3, endowing the detail features and the semantic features extracted in the step 1.2 to all points in an original scene through a feature transfer layer by adopting trilinear interpolation;
step 2, constructing a loss function, supervising the execution of the feature extraction in the step 1, and promoting the lossless feature extraction of the network Pointnet + + perception feature information;
step 3, embedding the local detail features and semantic features without information loss into a point cloud target detection network based on voxel division;
step 3.1, initializing voxel characteristics by using the local detail characteristics extracted in the step 1;
step 3.2, performing feature extraction on the voxel scene semantic information initialized in the step 3.1 by using 3D sparse convolution;
step 3.3, converting the semantic features obtained in the step 1 into voxel features by adopting trilinear interpolation;
step 3.4, fusing the semantic features subjected to sparse convolution sensing in the step 3.2 with the voxel features obtained by conversion in the step 3.3 by adopting an attention mechanism mode to obtain semantic information fusing two sensing modes;
step 4, projecting the semantic features obtained by fusing in the step 3 to a two-dimensional top view, building a region extraction network (RPN) through two-dimensional convolution, and classifying and regressing a detection anchor frame preset for each pixel point under the scene top view to obtain a final detection target;
step 4.1, setting an RPN network structure and a predefined detection anchor frame;
and 4.2, designing a point cloud target detection loss function.
Moreover, in the step 1.1, the multilayer encoder is constructed by firstly collecting N points from the original point cloud as the input point cloud by using a farthest point sampling strategy (FPS), and then collecting the number of the N points from the input point cloud data layer by using the FPSThe point clouds in the layers form a 4-layer encoder, and the point clouds input in each layer are point sets output in the previous layer.
Moreover, the input of each layer of SA module in step 1.2 is obtained by the last layer through FPS samplingTo a fixed number of point sets, set point piFor the ith point obtained by FPS sampling of the current layer,for the point p in the previous layeriSet of points within a sphere neighborhood of radius r, point p, centered oniThe calculation of the output characteristics comprises the following steps:
Step 1.2.2, performing feature fusion extraction on the points sampled in step 1.2.1 through a multilayer perceptron, wherein a calculation formula is as follows:
where MLP represents a high-dimensional mapping of the point features by the multi-tier perceptron, max () represents taking the maximum value on the feature dimension of the set of points, f (p)i) I.e. point piThe output characteristics of (1);
and 1.2.3, repeatedly carrying out FPS sampling on each layer of input point cloud to obtain a corresponding number of point clouds, and aggregating neighborhood characteristics on the sampled points through the step 1.2.2, thereby completing the characteristic extraction without information loss. The first layer extracts local detail features, and the last three layers extract semantic features.
Moreover, the feature transfer in step 1.3 is the reverse process of feature extraction, starting from the last layer of the extraction layer, and proceeding to the next layer in sequence, i.e. from the last layerLayer transfer toA layer,Layer transfer toA layer,Layer transfer toA layer,The layers are transferred to the N layers. To be provided withLayer transfer toLayer is a transfer process for example introduction characteristics, assuming point piIs composed ofThe point at which a layer needs to transfer a feature, phi (P)i) To representDistance P in the Ou-space in the layeriCombination of nearest k points, PjDenotes phi (P)i) In (2), the computation method of the trilinear interpolation feature transfer is as follows:
in the formula, f (p)i) Is a feature that needs to be transferred,f(pj) Is indicated at point PiThe j point P in the neighborhoodjIs characterized by wijRepresenting point PiThe j point P in the neighborhoodjThe feature weighted weight of (1).
The characteristic of each transmitted point is obtained by carrying out weighted summation of Euclidean distances on the characteristics of k points in the next layer of field, and the transmitted points can be transmitted to each point cloud in the scene layer by layer forwards so as to enable the point cloud to have lossless information characteristics.
In step 2, the point cloud coordinates in the original scene are used as point cloud supervision information, and the smoothening-L1 loss is used as a loss function, and the calculation method is as follows:
wherein r' and r respectively represent the point cloud space coordinates predicted by the lossless feature extraction network and the space coordinates of the original point cloud, phi (p) represents a point cloud set in the whole original scene, and the perception effect of the lossless feature extraction network Pointnt + + on the local neighborhood information is further improved under the supervision of a loss function.
In the step 3.1, the initialization is to uniformly divide the point cloud space into a voxel grid, retain voxels containing points, discard voxels not containing points, and initialize the retained voxels by using the local detail features obtained in the step 1. Assume that the output of the first layer of the encoder network in step 1 isWherein P isiRepresenting points in the original point cloud space that require transfer of features, Fi PIs a point PiIs characterized in that it is a mixture of two or more of the above-mentioned components,indicates that the encoder has extracted a total amount in step 1Local detail features of the points. Center of voxelVjThe center of the voxel is represented by the center of the voxel,representing the voxel center VjFeatures need to be assigned, M indicates a common M voxel center needs to be assigned. The characteristics of the voxel center are assigned through a trilinear interpolation function to ensure thatRepresenting a distance V in Euclidean spacejCombination of nearest k points, PtTo representA point of (1), thenThe calculation method of (c) is as follows:
wherein, Ft PRepresenting the voxel center point VjT-th feature point P in neighborhoodtIs characterized by wtjRepresenting the voxel center point VjT point in the neighborhood PtThe feature weighted weight of (1).
Furthermore, step 3.2 is to stack 4 layers of sparse convolution modules using the Spconv library, wherein each sparse convolution module comprises two layers of sub-stream type convolution modules and one layerPoint cloud sparse convolution module with down-sampling of 2. Assuming that the input voxel sign tensor is represented as L × W × H × C, where L, W, H, C represents the length, width, height of the voxel scene and the characteristic dimension of each voxel, respectively, the output can be represented as L × W × H × C through 4 layers of sparse convolutionWherein C' represents the feature dimension after feature extraction.
Furthermore, in step 3.3, it is assumed that the three-layer semantic information features extracted from step 1 are represented asWherein 4 x represents four times of down-sampling, and the voxel coordinate after sparse convolution is The center of the voxel is represented by the center of the voxel,representing the center of a voxelThe characteristics that need to be imparted. Converting the semantic features of the points into voxel center representation by trilinear interpolationRepresenting distance in Euclidean spaceCombination of nearest k points, Pt,4×、Pt,8×、Pt,16×Are all made ofA point of middle, thenThe calculation method of (c) is as follows:
wherein,representing the voxel center after 3D sparse convolution, Pt,4×、Pt,8×、Pt,16×Representing the spatial points at which the feature weighting is performed,representing voxel center pointThe t-th point in the neighborhood downsamples four times the layer's features, wtj,4×Representing the center of a voxelThe t-th point in the neighborhood downsamples four times the layer's feature weighting.
In step 3.4, two semantic information are connected in series in the feature dimension, and the voxel feature dimension obtained by conversion in step 3.3 is assumed to be M1And the voxel characteristic obtained by sparse convolution perception is M2Then the characteristic dimension of the voxel after superposition is M1+M2Then, a layer of multilayer perceptron is adopted to combine the characteristics M1+M2Dimension mapping as M1。
And in the step 4.1, the RPN is built by a four-layer two-dimensional convolutional neural network and is output layer by adopting a U-Net network structure, which is specifically expressed as Each layer employs a 3 x 3 convolution to reduce the learning parameters. And further performing feature abstraction on the fused information by adopting an encoding-decoding network structure, presetting a corresponding detection anchor frame for each pixel point on a final feature map, and classifying and regressing the preset detection anchor frames to obtain an object detected by the RPN. A three-dimensional detection anchor frame can be expressed as { x, y, z, l, w, h, r }, (x, y, z) represents the center position of the detection anchor frame, l, w, h respectively correspond to length, width and height, and r is the rotation angle in the x-y plane. The voxel characteristics after 3D sparse convolution and semantic information fusion can be characterized asCompressing the height dimension to the characteristic dimension to obtain a two-dimensional image which is characterized byThus for a size ofAll together haveAnd a predefined detection anchor frame.
Furthermore, the classification loss function L in said step 4.2clsUsing a cross entropy loss function, namely:
wherein n represents the number of the detection anchor frames set in advance, P (a)i) Represents the score of the ith test anchor frame prediction, Q (a)i) A true tag value representing the probe anchor box.
Regression loss function LregUsing the Smooth-L1 loss function, namely:
in the formula, n represents the number of preset detection anchor frames, v represents the true value of the detection anchor frames, and v' represents the value of the detection anchor frames predicted by the RPN.
Through the combined supervision of the classification loss function and the regression loss function, the network can finally learn the capability of detecting the point cloud target.
Compared with the prior art, the invention has the following advantages: (1) the method combines the advantages of the current point cloud target detection method based on voxel division and the original point cloud, and has high-efficiency point cloud sensing capability and lossless feature coding capability; (2) by embedding the lossless encoding of the point cloud into a voxel method in a multi-scale and multi-level mode, the detection network is promoted to have multi-scale and multi-level information fusion perception capability.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a diagram of an example of detection according to an embodiment of the present invention, in which fig. 2(a) is an input point cloud, and fig. 2(b) is a point cloud detection anchor box.
Detailed Description
The invention provides a point cloud target detection method integrating original point cloud and voxel division, which comprises the steps of firstly extracting local detail features and semantic features of the point cloud by utilizing a lossless feature extraction network Pointnet + +, then constructing a loss function to further improve the perception capability of the lossless feature extraction network Pointnt + + on local neighborhood information, then embedding the local detail features and the semantic features without information loss into a point cloud target detection network based on voxel division in a voxel feature initialization stage and a sparse convolution perception stage by respectively adopting trilinear interpolation, and finally classifying and regressing each preset detection anchor frame through a two-dimensional RPN to obtain a final detection target.
The technical solution of the present invention is further explained with reference to the drawings and the embodiments.
Step 1, extracting local detail features and semantic features of the point cloud by using a lossless feature extraction network Polnnet + +.
Firstly, a fixed number N of input point clouds are collected, then, a local point cloud Feature extractor (setab interaction, SA) is sampled layer by layer and built for Feature extraction of a local scene, and then, a trilinear interpolation is adopted to endow local detail features and semantic features to all points in the original scene through a Feature transmission layer (Feature Propagation). The method comprises the following substeps:
and 1.1, constructing a multilayer encoder.
Firstly, a farthest point sampling strategy (FPS) is utilized to collect N points from an original point cloud as an input point cloud, and then the FPS is utilized to collect the number of the points layer by layer from the input point cloud data asThe point clouds in the layers form a 4-layer encoder, and the point clouds input in each layer are point sets output in the previous layer.
And 1.2, extracting the local detail features and semantic features of each layer of point cloud through an SA module without information loss.
The input of each layer of SA module is a fixed number of point sets obtained by FPS sampling of the previous layer, and a point p is setiFor the ith point obtained by FPS sampling of the current layer,for the point p in the previous layeriThe set of points inside a sphere neighborhood of radius r, centered. p is a radical ofiOf the output characteristicsThe calculation comprises the following steps:
Step 1.2.2, performing feature fusion extraction on the points sampled in the step 1.2.1 through a multilayer perceptron, and calculating to obtain a point piThe output characteristic of (1).
Firstly, a multi-layer perceptron is adopted to collect the randomly sampled point set in the step 1.2.1Extracting local detail feature to obtainThe high-dimensional mapping characteristic of the point is obtained, then the maximum information representation on the characteristic dimension is obtained through maximum pooling on the characteristic dimension, and the high-dimensional mapping characteristic of the maximum information representation is the point piThe output characteristic of (1). The calculation formula is as follows:
where MLP represents a high-dimensional mapping of the point features by the multi-tier perceptron, max () represents taking the maximum value on the feature dimension of the set of points, f (p)i) I.e. point piThe output characteristic of (1).
And 1.2.3, repeatedly carrying out FPS sampling on each layer of input point cloud to obtain a corresponding number of point clouds, and aggregating neighborhood characteristics on the sampled points through the step 1.2.2, thereby completing the characteristic extraction without information loss. The first layer extracts local detail features, and the last three layers extract semantic features.
And step 1.3, endowing the detail features and the semantic features extracted in the step 1.2 to all points in the original scene through a feature transfer layer by adopting tri-linear interpolation.
The feature transfer is the reverse process of feature extraction, starting from the last layer of the extraction layer, and sequentially proceeding to the upper layer for feature transfer, namely from the last layerLayer transfer toA layer,Layer transfer toA layer,Layer transfer toA layer,The layers are transferred to the N layers. To be provided withLayer transfer toLayer is a transfer process for example introduction characteristics, assuming point piIs composed ofThe point at which a layer needs to transfer a feature, phi (P)i) To representDistance P in the Ou-space in the layeriCombination of nearest k points, PjDenotes phi (P)i) In (2), the computation method of the trilinear interpolation feature transfer is as follows:
in the formula, f (p)i) Is a feature that requires transfer, f (p)j) Is indicated at point PiThe j point P in the neighborhoodjIs characterized by wijRepresenting point PiThe j point P in the neighborhoodjThe feature weighted weight of (1).
The characteristic of each transmitted point is obtained by carrying out weighted summation of Euclidean distances on the characteristics of k points in the next layer of field, and the transmitted points can be transmitted to each point cloud in the scene layer by layer forwards so as to enable the point cloud to have lossless information characteristics.
And 2, constructing a loss function, supervising the execution of the feature extraction in the step 1, and promoting the lossless feature extraction of the network Pointnet + + perception feature information.
The method adopts point cloud coordinates in an original scene as point cloud supervision information, Smooth-L1 loss as a loss function, and comprises the following calculation modes:
wherein r' and r respectively represent the point cloud space coordinates predicted by the lossless feature extraction network and the space coordinates of the original point cloud, phi (p) represents a point cloud set in the whole original scene, and the perception effect of the lossless feature extraction network Pointnt + + on the local neighborhood information is further improved under the supervision of a loss function.
And 3, embedding the local detail features and semantic features without information loss into a point cloud target detection network based on voxel division.
Firstly, dividing original point cloud into voxels, initializing the voxel characteristics by using the local detail characteristics extracted in the step 1, then perceiving a point cloud space structure through sparse 3D convolution, and then fusing the semantic characteristics extracted in the step 1 at a semantic level, wherein the method comprises the following substeps:
and 3.1, initializing the voxel characteristics by using the local detail characteristics extracted in the step 1.
Firstly, uniformly dividing a point cloud space into voxel grids, reserving voxels containing points, discarding voxels not containing points, and then initializing the reserved voxels by using the local detail features obtained in the step 1. Assume that the output of the first layer of the encoder network in step 1 isWherein P isiRepresenting points in the original point cloud space that require transfer of features, Fi PIs a point PiIs characterized in that it is a mixture of two or more of the above-mentioned components,indicates that the encoder has extracted a total amount in step 1Local detail features of the points. Center of voxelVjThe center of the voxel is represented by the center of the voxel,representing the voxel center VjFeatures need to be assigned, M indicates a common M voxel center needs to be assigned. The characteristics of the voxel center are assigned through a trilinear interpolation function to ensure thatRepresenting a distance V in Euclidean spacejCombination of nearest k points, PtTo representA point of (1), thenThe calculation method of (c) is as follows:
wherein, Ft PRepresenting the voxel center point VjT-th feature point P in neighborhoodtIs characterized by wtjRepresenting the voxel center point VjT point in the neighborhood PtThe feature weighted weight of (1).
And 3.2, performing feature extraction on the voxel scene semantic information initialized in the step 3.1 by using 3D sparse convolution.
And stacking 4 layers of sparse convolution modules by using a Spconv library, wherein each sparse convolution module comprises two layers of sub-stream type convolution modules and a layer of point cloud sparse convolution module with the down sampling of 2. Assuming that the input voxel sign tensor is represented as L × W × H × C, where L, W, H, C represents the length, width, height of the voxel scene and the characteristic dimension of each voxel, respectively, the output can be represented as L × W × H × C through 4 layers of sparse convolutionWherein C' represents the feature dimension after feature extraction.
And 3.3, converting the semantic features obtained in the step 1 into voxel features by adopting trilinear interpolation.
Assuming that the three-layer semantic features obtained in the step 1 areWhere 4 x represents four times the down-sampling, being subject to the thinningThe convolved voxel coordinate is The center of the voxel is represented by the center of the voxel,representing the center of a voxelThe characteristics that need to be imparted. Converting the semantic features of the points into voxel center representation by trilinear interpolationRepresenting distance in Euclidean spaceCombination of nearest k points, Pt,4×、Pt,8×、Pt,16×Are all made ofA point of middle, thenThe calculation method of (c) is as follows:
wherein,representing the voxel center after 3D sparse convolution, Pt,4×、Pt,8×、Pt,16×Representing the spatial points at which the feature weighting is performed,representing voxel center pointThe t-th point in the neighborhood downsamples four times the layer's features, wtj,4×Representing the center of a voxelThe t-th point in the neighborhood downsamples four times the layer's feature weighting.
And 3.4, fusing the semantic features subjected to sparse convolution sensing in the step 3.2 with the voxel features obtained by conversion in the step 3.3 by adopting an attention mechanism mode to obtain semantic information fusing two sensing modes.
Firstly, two semantic information are connected in series on a characteristic dimension, and the characteristic dimension of the voxel obtained by conversion in the step 3.3 is assumed to be M1And the voxel characteristic obtained by sparse convolution perception is M2Then the characteristic dimension of the voxel after superposition is M1+M2Then, a layer of multilayer perceptron is adopted to combine the characteristics M1+M2Dimension mapping as M1。
Step 4, projecting the semantic features obtained by fusing in the step 3 to a two-dimensional top view, building a network (RPN) through a two-dimensional convolution, classifying and regressing a detection anchor frame preset for each pixel point under the scene top view to obtain a final detection target, and the method comprises the following substeps:
step 4.1, RPN network structure and predefined box setting.
The RPN is built by a four-layer two-dimensional convolution neural network and is output layer by adopting a U-Net network structure, and the specific expression isEach layer employs a 3 x 3 convolution to reduce the learning parameters. And further performing feature abstraction on the fused information by adopting an encoding-decoding network structure, presetting a corresponding detection anchor frame for each pixel point on a final feature map, and classifying and regressing the preset detection anchor frames to obtain an object detected by the RPN. A three-dimensional detection anchor frame can be expressed as { x, y, z, l, w, h, r }, (x, y, z) represents the center position of the detection anchor frame, l, w, h respectively correspond to length, width and height, and r is the rotation angle in the x-y plane. The voxel characteristics after 3D sparse convolution and semantic information fusion can be characterized asCompressing the height dimension to the characteristic dimension to obtain a two-dimensional image which is characterized byThus for a size ofAll together haveAnd a predefined detection anchor frame.
And 4.2, designing a point cloud target detection loss function.
And classifying and regressing the preset detection anchor frame by utilizing a classification loss function and a regression loss function so as to obtain the object detected by the RPN.
Classification loss function LclsUsing a cross entropy loss function, namely:
wherein n represents the number of the detection anchor frames set in advance, P (a)i) Represents the score of the ith test anchor frame prediction, Q (a)i) A tag value indicating the authenticity of the test anchor box.
Regression loss function LregUsing the Smooth-L1 loss function, namely:
in the formula, n represents the number of preset detection anchor frames, v represents the true value of the detection anchor frames, and v' represents the value of the detection anchor frames predicted by the RPN.
Through the combined supervision of the classification loss function and the regression loss function, the network can finally learn the capability of detecting the point cloud target.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Claims (10)
1. A point cloud target detection method fusing an original point cloud and voxel division is characterized by comprising the following steps:
step 1, extracting local detail features and semantic features of point cloud by using a lossless feature extraction network Polnnet + +;
step 2, constructing a loss function, supervising the execution of the feature extraction in the step 1, and promoting the lossless feature extraction of the network Pointnet + + perception feature information;
step 3, embedding the local detail features and semantic features without information loss into a point cloud target detection network based on voxel division;
step 3.1, initializing voxel characteristics by using the local detail characteristics extracted in the step 1;
step 3.2, performing feature extraction on the voxel scene semantic information initialized in the step 3.1 by using 3D sparse convolution;
step 3.3, converting the semantic features obtained in the step 1 into voxel features by adopting trilinear interpolation;
step 3.4, fusing the semantic features subjected to sparse convolution sensing in the step 3.2 with the voxel features obtained by conversion in the step 3.3 by adopting an attention mechanism mode to obtain semantic information fusing two sensing modes;
and 4, projecting the semantic features obtained by fusing in the step 3 to a two-dimensional top view, providing a network RPN through a two-dimensional convolution building region, and classifying and regressing a detection anchor frame preset by each pixel point under the scene top view to obtain a final detection target.
2. The method for point cloud target detection by fusing original point cloud and voxel division according to claim 1, wherein the method comprises the following steps: the step 1 comprises the following substeps:
step 1.1, constructing a multilayer encoder;
collecting N points from the original point cloud as input point cloud by utilizing a farthest point sampling strategy, and then collecting the number of the points layer by layer from the input point cloud data by utilizing the farthest point sampling strategyThe point clouds of (1) form a 4-layer encoder, and the point clouds input by each layer are point sets output by the previous layer;
step 1.2, extracting the local detail features and semantic features of each layer of point cloud through a local point set feature extraction module without information loss;
and step 1.3, endowing the detail features and the semantic features extracted in the step 1.2 to all points in the original scene through a feature transfer layer by adopting tri-linear interpolation.
3. The fused raw point cloud and volume of claim 2The pixel division point cloud target detection method is characterized by comprising the following steps: the input of each layer of local point set feature extraction module in the step 1.2 is a fixed number of point sets obtained by sampling the last layer through a farthest point sampling strategy, and a point p is setiFor the ith point sampled by the farthest point sampling strategy in the current layer,for the point p in the previous layeriSet of points within a sphere neighborhood of radius r, point p, centered oniThe calculation of the output characteristics comprises the following steps:
Step 1.2.2, performing feature fusion extraction on the points sampled in step 1.2.1 through a multilayer perceptron, wherein a calculation formula is as follows:
where MLP represents a high-dimensional mapping of the point features by the multi-tier perceptron, max () represents taking the maximum value on the feature dimension of the set of points, f (p)i) I.e. point piThe output characteristics of (1);
step 1.2.3, a farthest point sampling strategy is repeated for each layer of input point cloud to sample point cloud with a corresponding number, and neighborhood features are aggregated for the sampled points through step 1.2.2, so that feature extraction without information loss is completed, wherein local detail features are extracted from the first layer, and semantic features are extracted from the last three layers.
4. The method of claim 2, wherein the method comprises the step of fusing the original point cloud and the voxel divisionCharacterized in that: the feature transfer in the step 1.3 is the reverse process of feature extraction, starting from the last layer of the extraction layer, and sequentially performing feature transfer to the upper layer, namely, starting from the last layerLayer transfer toA layer,Layer transfer toA layer,Layer transfer toA layer,The layers are transferred to the N layers toLayer transfer toLayer is a transfer process for example introduction characteristics, assuming point piIs composed ofThe point at which a layer needs to transfer a feature, phi (P)i) To representDistance P in the Ou-space in the layeriCombination of nearest k points, PjTo representφ(Pi) In (2), the computation method of the trilinear interpolation feature transfer is as follows:
in the formula, f (p)i) Is a feature that requires transfer, f (p)j) Is indicated at point PiThe j point P in the neighborhoodjIs characterized by wijRepresenting point PiThe j point P in the neighborhoodjThe feature weighted weight of (1).
5. The method for point cloud target detection by fusing original point cloud and voxel division according to claim 1, wherein the method comprises the following steps: in the step 2, point cloud coordinates in an original scene are used as point cloud supervision information, Smooth-L1 loss is used as a loss function, and the calculation mode is as follows:
wherein r' and r respectively represent the point cloud space coordinates predicted by the lossless feature extraction network and the space coordinates of the original point cloud, phi (p) represents a point cloud set in the whole original scene, and the perception effect of the lossless feature extraction network Pointnt + + on the local neighborhood information is further improved under the supervision of a loss function.
6. The method for point cloud target detection by fusing original point cloud and voxel division according to claim 1, wherein the method comprises the following steps: said step (c) is3.1 the initialization is to divide the point cloud space into voxel grid, to keep the voxel containing point and discard the voxel not containing point, then to initialize the retained voxel by using the local detail character obtained in step 1, to make the local detail characterRepresenting distance V from voxel center in euclidean spacejCombination of nearest k points, PtTo representA point of (2), a voxel center featureThe calculation method of (c) is as follows:
wherein, Ft PRepresenting the voxel center point VjT-th feature point P in neighborhoodtIs characterized by wtjRepresenting the voxel center point VjT point in the neighborhood PtThe feature weighted weight of (1).
7. The method for point cloud target detection by fusing original point cloud and voxel division according to claim 1, wherein the method comprises the following steps: step 3.2 is to stack 4 layers of sparse convolution modules by using a Spconv library, wherein each sparse convolution module comprises two layers of sub-stream type convolution modules and one layer of point cloud sparse convolution module with 2 down-sampling, and if the input voxel sign tensor is represented as L multiplied by W multiplied by H multiplied by C, L, W, H, C respectively represents the length, width and height of a voxel scene and the characteristic dimension of each voxel, the 4 layers of sparse convolution modules are passed throughThe convolution output can be expressed asC' represents a feature dimension after feature extraction.
8. The method for point cloud target detection by fusing original point cloud and voxel division according to claim 1, wherein the method comprises the following steps: in the step 3.3, the three layers of semantic information features extracted from the step 1 are assumed to be represented asWhere 4 x represents the down-sampling quadruple, orderRepresenting the voxel center of distance-passing sparse convolution in Euclidean spaceCombination of nearest k points, Pt,4×、Pt,8×、Pt,16×Are all made ofMedium point, sparse convolved voxel center featureThe calculation method of (c) is as follows:
wherein,representing the voxel center after 3D sparse convolution, Pt,4×、Pt,8×、Pt,16×Representing the spatial points at which the feature weighting is performed,representing voxel center pointThe t-th point in the neighborhood downsamples four times the layer's features, wtj,4×Representing the center of a voxelThe t-th point in the neighborhood downsamples four times the layer's feature weighting.
9. The method for point cloud target detection by fusing original point cloud and voxel division according to claim 1, wherein the method comprises the following steps: the step 4 comprises the following two substeps:
step 4.1, setting an RPN network structure and a predefined detection anchor frame;
the RPN is built by a four-layer two-dimensional convolution neural network and is output layer by adopting a U-Net network structure, and the specific expression isEach layer adopts 3 multiplied by 3 convolution to reduce learning parameters, adopts a coding-decoding network structure to further abstract the characteristics of the fused information, and presets a corresponding pixel point on a final characteristic diagramThe detection anchor frame is characterized in that objects detected by RPN are obtained by classifying and regressing the preset detection anchor frame, one three-dimensional detection anchor frame can be expressed as { x, y, z, l, w, h, r }, (x, y, z) represents the central position of the detection anchor frame, l, w and h respectively correspond to length, width and height, r is a rotation angle on an x-y plane, and voxel characteristics after 3D sparse convolution and semantic information fusion can be represented asCompressing the height dimension to the characteristic dimension to obtain a two-dimensional image which is characterized byThus for a size ofAll together haveA predefined detection anchor frame;
and 4.2, designing a point cloud target detection loss function.
10. The method of detecting a point cloud target by fusing an original point cloud and a voxel partition according to claim 9, wherein: the point cloud target detection loss function in the step 4.2 comprises a classification loss function and a regression loss function, wherein the classification loss function LclsUsing a cross entropy loss function, namely:
wherein n represents the number of the detection anchor frames set in advance, P (a)i) Represents the score of the ith test anchor frame prediction, Q (a)i) A tag value representing the authenticity of the test anchor box;
regression loss function LregUsing the Smooth-L1 loss function, namely:
in the formula, n represents the number of preset detection anchor frames, v represents the true value of the detection anchor frames, and v' represents the value of the detection anchor frames predicted by the RPN.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110651776.XA CN113378854A (en) | 2021-06-11 | 2021-06-11 | Point cloud target detection method integrating original point cloud and voxel division |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110651776.XA CN113378854A (en) | 2021-06-11 | 2021-06-11 | Point cloud target detection method integrating original point cloud and voxel division |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113378854A true CN113378854A (en) | 2021-09-10 |
Family
ID=77573977
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110651776.XA Pending CN113378854A (en) | 2021-06-11 | 2021-06-11 | Point cloud target detection method integrating original point cloud and voxel division |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113378854A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113900119A (en) * | 2021-09-29 | 2022-01-07 | 苏州浪潮智能科技有限公司 | Laser radar vehicle detection method, system, storage medium and equipment |
CN114120115A (en) * | 2021-11-19 | 2022-03-01 | 东南大学 | Point cloud target detection method for fusing point features and grid features |
CN114155524A (en) * | 2021-10-29 | 2022-03-08 | 中国科学院信息工程研究所 | Single-stage 3D point cloud target detection method and device, computer equipment and medium |
CN114463736A (en) * | 2021-12-28 | 2022-05-10 | 天津大学 | Multi-target detection method and device based on multi-mode information fusion |
CN114494183A (en) * | 2022-01-25 | 2022-05-13 | 哈尔滨医科大学附属第一医院 | Artificial intelligence-based automatic acetabular radius measurement method and system |
CN114638953A (en) * | 2022-02-22 | 2022-06-17 | 深圳元戎启行科技有限公司 | Point cloud data segmentation method and device and computer readable storage medium |
CN114821033A (en) * | 2022-03-23 | 2022-07-29 | 西安电子科技大学 | Three-dimensional information enhanced detection and identification method and device based on laser point cloud |
CN114882495A (en) * | 2022-04-02 | 2022-08-09 | 华南理工大学 | 3D target detection method based on context-aware feature aggregation |
CN115222988A (en) * | 2022-07-17 | 2022-10-21 | 桂林理工大学 | Laser radar point cloud data urban ground feature PointEFF fine classification method |
CN115375731A (en) * | 2022-07-29 | 2022-11-22 | 大连宗益科技发展有限公司 | 3D point cloud single-target tracking method of associated points and voxels and related device |
CN115471513A (en) * | 2022-11-01 | 2022-12-13 | 小米汽车科技有限公司 | Point cloud segmentation method and device |
CN116664874A (en) * | 2023-08-02 | 2023-08-29 | 安徽大学 | Single-stage fine-granularity light-weight point cloud 3D target detection system and method |
CN117058402A (en) * | 2023-08-15 | 2023-11-14 | 北京学图灵教育科技有限公司 | Real-time point cloud segmentation method and device based on 3D sparse convolution |
CN117475410A (en) * | 2023-12-27 | 2024-01-30 | 山东海润数聚科技有限公司 | Three-dimensional target detection method, system, equipment and medium based on foreground point screening |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109118564A (en) * | 2018-08-01 | 2019-01-01 | 湖南拓视觉信息技术有限公司 | A kind of three-dimensional point cloud labeling method and device based on fusion voxel |
CN111160214A (en) * | 2019-12-25 | 2020-05-15 | 电子科技大学 | 3D target detection method based on data fusion |
CN112052860A (en) * | 2020-09-11 | 2020-12-08 | 中国人民解放军国防科技大学 | Three-dimensional target detection method and system |
CN112418084A (en) * | 2020-11-23 | 2021-02-26 | 同济大学 | Three-dimensional target detection method based on point cloud time sequence information fusion |
-
2021
- 2021-06-11 CN CN202110651776.XA patent/CN113378854A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109118564A (en) * | 2018-08-01 | 2019-01-01 | 湖南拓视觉信息技术有限公司 | A kind of three-dimensional point cloud labeling method and device based on fusion voxel |
CN111160214A (en) * | 2019-12-25 | 2020-05-15 | 电子科技大学 | 3D target detection method based on data fusion |
CN112052860A (en) * | 2020-09-11 | 2020-12-08 | 中国人民解放军国防科技大学 | Three-dimensional target detection method and system |
CN112418084A (en) * | 2020-11-23 | 2021-02-26 | 同济大学 | Three-dimensional target detection method based on point cloud time sequence information fusion |
Non-Patent Citations (2)
Title |
---|
CHARLES R. QI ET AL.: "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space", 《ADVANCES 31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》 * |
TIANYUAN JIANG,ET AL.: "VIC-Net: Voxelization Information Compensation Network for Point Cloud 3D Object Detection", 《2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021)》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113900119B (en) * | 2021-09-29 | 2024-01-30 | 苏州浪潮智能科技有限公司 | Method, system, storage medium and equipment for laser radar vehicle detection |
CN113900119A (en) * | 2021-09-29 | 2022-01-07 | 苏州浪潮智能科技有限公司 | Laser radar vehicle detection method, system, storage medium and equipment |
CN114155524A (en) * | 2021-10-29 | 2022-03-08 | 中国科学院信息工程研究所 | Single-stage 3D point cloud target detection method and device, computer equipment and medium |
CN114120115A (en) * | 2021-11-19 | 2022-03-01 | 东南大学 | Point cloud target detection method for fusing point features and grid features |
CN114120115B (en) * | 2021-11-19 | 2024-08-23 | 东南大学 | Point cloud target detection method integrating point features and grid features |
CN114463736A (en) * | 2021-12-28 | 2022-05-10 | 天津大学 | Multi-target detection method and device based on multi-mode information fusion |
CN114494183A (en) * | 2022-01-25 | 2022-05-13 | 哈尔滨医科大学附属第一医院 | Artificial intelligence-based automatic acetabular radius measurement method and system |
CN114494183B (en) * | 2022-01-25 | 2024-04-02 | 哈尔滨医科大学附属第一医院 | Automatic acetabular radius measurement method and system based on artificial intelligence |
CN114638953A (en) * | 2022-02-22 | 2022-06-17 | 深圳元戎启行科技有限公司 | Point cloud data segmentation method and device and computer readable storage medium |
CN114638953B (en) * | 2022-02-22 | 2023-12-22 | 深圳元戎启行科技有限公司 | Point cloud data segmentation method and device and computer readable storage medium |
CN114821033A (en) * | 2022-03-23 | 2022-07-29 | 西安电子科技大学 | Three-dimensional information enhanced detection and identification method and device based on laser point cloud |
CN114882495A (en) * | 2022-04-02 | 2022-08-09 | 华南理工大学 | 3D target detection method based on context-aware feature aggregation |
CN114882495B (en) * | 2022-04-02 | 2024-04-12 | 华南理工大学 | 3D target detection method based on context-aware feature aggregation |
CN115222988A (en) * | 2022-07-17 | 2022-10-21 | 桂林理工大学 | Laser radar point cloud data urban ground feature PointEFF fine classification method |
CN115375731A (en) * | 2022-07-29 | 2022-11-22 | 大连宗益科技发展有限公司 | 3D point cloud single-target tracking method of associated points and voxels and related device |
CN115471513A (en) * | 2022-11-01 | 2022-12-13 | 小米汽车科技有限公司 | Point cloud segmentation method and device |
CN116664874B (en) * | 2023-08-02 | 2023-10-20 | 安徽大学 | Single-stage fine-granularity light-weight point cloud 3D target detection system and method |
CN116664874A (en) * | 2023-08-02 | 2023-08-29 | 安徽大学 | Single-stage fine-granularity light-weight point cloud 3D target detection system and method |
CN117058402B (en) * | 2023-08-15 | 2024-03-12 | 北京学图灵教育科技有限公司 | Real-time point cloud segmentation method and device based on 3D sparse convolution |
CN117058402A (en) * | 2023-08-15 | 2023-11-14 | 北京学图灵教育科技有限公司 | Real-time point cloud segmentation method and device based on 3D sparse convolution |
CN117475410B (en) * | 2023-12-27 | 2024-03-15 | 山东海润数聚科技有限公司 | Three-dimensional target detection method, system, equipment and medium based on foreground point screening |
CN117475410A (en) * | 2023-12-27 | 2024-01-30 | 山东海润数聚科技有限公司 | Three-dimensional target detection method, system, equipment and medium based on foreground point screening |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378854A (en) | Point cloud target detection method integrating original point cloud and voxel division | |
Zamanakos et al. | A comprehensive survey of LIDAR-based 3D object detection methods with deep learning for autonomous driving | |
CN112529015B (en) | Three-dimensional point cloud processing method, device and equipment based on geometric unwrapping | |
CN109410307B (en) | Scene point cloud semantic segmentation method | |
CN114937151B (en) | Lightweight target detection method based on multiple receptive fields and attention feature pyramid | |
Ye et al. | 3d recurrent neural networks with context fusion for point cloud semantic segmentation | |
CN111242041B (en) | Laser radar three-dimensional target rapid detection method based on pseudo-image technology | |
CN112488210A (en) | Three-dimensional point cloud automatic classification method based on graph convolution neural network | |
CN113850270B (en) | Semantic scene completion method and system based on point cloud-voxel aggregation network model | |
CN114255238A (en) | Three-dimensional point cloud scene segmentation method and system fusing image features | |
CN113345082A (en) | Characteristic pyramid multi-view three-dimensional reconstruction method and system | |
CN112347987A (en) | Multimode data fusion three-dimensional target detection method | |
Cheng et al. | S3Net: 3D LiDAR sparse semantic segmentation network | |
CN112560865B (en) | Semantic segmentation method for point cloud under outdoor large scene | |
CN113870160B (en) | Point cloud data processing method based on transformer neural network | |
CN114373104A (en) | Three-dimensional point cloud semantic segmentation method and system based on dynamic aggregation | |
Ahmad et al. | 3D capsule networks for object classification from 3D model data | |
CN115147601A (en) | Urban street point cloud semantic segmentation method based on self-attention global feature enhancement | |
CN112488117B (en) | Point cloud analysis method based on direction-induced convolution | |
Hazer et al. | Deep learning based point cloud processing techniques | |
CN117765258A (en) | Large-scale point cloud semantic segmentation method based on density self-adaption and attention mechanism | |
CN112132207A (en) | Target detection neural network construction method based on multi-branch feature mapping | |
CN111860668A (en) | Point cloud identification method of deep convolution network for original 3D point cloud processing | |
CN116894940A (en) | Point cloud semantic segmentation method based on feature fusion and attention mechanism | |
CN115424225A (en) | Three-dimensional real-time target detection method for automatic driving system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210910 |