CN116152579A

CN116152579A - Point cloud 3D target detection method and model based on discrete Transformer

Info

Publication number: CN116152579A
Application number: CN202310307131.3A
Authority: CN
Inventors: 李志恒; 黄迪和
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-05-23

Abstract

The invention discloses a point cloud 3D target detection method and model based on discrete transformers, wherein the method comprises the following steps: s1, acquiring a point cloud data frame of an object in real time; s2, carrying out point cloud voxelization on the point cloud data frame to obtain an initial voxel; s3, extracting voxel characteristics containing dynamic information and static information from the initial voxels through a 3D backbone network based on a discrete transducer; s4, mapping the voxel characteristics finally output in the step S3 to BEV space to obtain corresponding 2DBEV characteristics; and S5, the 2DBEV features are sent to a 3D target detector through a Neck network, 3D target detection is carried out, and object attribute information of an object in a 3D space is obtained.

Description

Point cloud 3D target detection method and model based on discrete Transformer

Technical Field

The invention relates to the field of automatic driving object perception, in particular to a point cloud 3D target detection method and model based on discrete transformers.

Background

An automatic driving automobile is a complex unmanned system which senses the environment and performs decision control by means of an on-board sensor. In order to realize the decision and control of automatic driving, the sensor (usually including a laser radar and a camera) is needed to sense the surrounding environment, and the sensor data is processed to obtain the 3D semantic information of the object in the surrounding environment.

The 3D object detection based on the laser radar is a key technology for solving the problem of environmental perception caused by automatic driving, and 3D semantic information of an object is obtained by encoding and decoding a point cloud data frame acquired in real time through a neural network. In order to achieve high efficiency, a voxel-based point cloud 3D target detection algorithm is currently commonly used in the field of autopilot. These methods first quantize the point cloud into voxels, and then extract the features of the voxels using a 3D sparse convolution based backbone network. However, since the point cloud data structure is a random unstructured discrete form, it is often difficult to extract the diverse geometric structure information of the object using static convolution with fixed weights. Recently VOTR has proposed the use of voxel transformers to extract the dynamic characteristics of voxels. The algorithm generates a voxel hash table from the voxel coordinates to index the voxel features and then advances the padding to a specified length into a conventional full-attention mechanism. However, this transducer ignores the sparse discrete nature of the point cloud, and requires a specified number of key voxel features for each query voxel feature, resulting in a significant increase in computation and increased time consumption. And the use of a large receptive field by the VOTR is detrimental to the detection of small objects. Second, the VOTR is entirely composed of transducers, which make it difficult to extract the static features of the point cloud. In order to improve the accuracy and recall rate of the point cloud 3D target detection algorithm, the existing 3D backbone network needs to be improved, and the dynamic characteristics and the static characteristics of the point cloud are effectively reserved.

Disclosure of Invention

In order to solve the problem of efficiently extracting dynamic and static characteristics of the point cloud in the existing point cloud 3D detection technology, the invention provides a point cloud 3D target detection method and model based on discrete transformers, so that a network can efficiently retain the static characteristics and dynamic characteristics of the point cloud.

According to an embodiment of the present invention, a point cloud 3D target detection method based on discrete transformers is provided, including the following steps: s1, acquiring a point cloud data frame of an object in real time; s2, carrying out point cloud voxelization on the point cloud data frame to obtain an initial voxel; s3, extracting voxel characteristics containing dynamic information and static information from the initial voxels through a 3D backbone network based on a discrete transducer; s4, mapping the voxel characteristics finally output in the step S3 to BEV space to obtain corresponding 2DBEV characteristics; and S5, the 2DBEV features are sent to a 3D target detector through a Neck network, 3D target detection is carried out, and object attribute information of an object in a 3D space is obtained.

According to another embodiment of the present invention, a point cloud 3D object detection model based on discrete transformers is provided, including: the point cloud voxelization module is used for carrying out point cloud voxelization on the point cloud data frame of the object and outputting an initial voxel; the 3D backbone network based on the discrete Transformer is connected to the output end of the point cloud voxelization module and is used for extracting voxel characteristics containing dynamic information and static information from the initial voxels; the voxel feature mapping module is connected with the 3D backbone network based on the discrete convertors and is used for mapping the voxel features containing dynamic information and static information to a BEV space to obtain corresponding 2DBEV features; and the Neck network is connected with the output end of the voxel feature mapping module and is used for sending the 2DBEV features to a 3D object detector for 3D object detection to obtain object attribute information of an object in a 3D space.

The invention provides a general grid point cloud feature extraction backbone network, which can be applied to all the existing grid-based point cloud 3D detectors. Compared with the prior art (algorithms such as CenterPoint, PV-RCNN, focals, voxel-RCNN, SST, pillarNet, pointPillar, and the like), the detection method provided by the invention can effectively extract dynamic characteristics and static characteristics of the point cloud, and retain more abundant 3D geometric information, thereby greatly improving the precision and recall rate of the 3D target detection algorithm and improving the perception capability of an automatic driving automobile to the surrounding environment.

Drawings

Fig. 1 is a schematic flow chart of performing point cloud 3D target detection based on a point cloud 3D target detection model of a discrete transducer according to an embodiment of the present invention.

FIG. 2 is a flow chart of processing voxels by a discrete transducer module according to an embodiment of the present invention.

FIG. 3 is a schematic flow chart of processing voxels by a multi-scale discrete transducer module according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of a discrete attention mechanism of an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and the detailed description. It should be understood that the examples are provided for the purpose of illustration only and are not intended to limit the scope of the invention.

An embodiment of the present invention provides a point cloud 3D target detection model based on discrete convertors, wherein a network architecture of the model please refer to fig. 1, and the network architecture is sequentially connected with: a point cloud voxelization module, a sub-manifold 3D sparse convolution 10, a discrete transform module 20, a multi-scale discrete transform module 30, a multi-scale discrete transform module 40, a BEV mapping module for voxel features, a negk network, and a 3D object detector. BEV is Bird's eye view.

The sub-manifold 3D sparse convolution 10, the discrete transform module 20, the multi-scale discrete transform module 30 and the multi-scale discrete transform module 40 form a 3D backbone network of the model, and are mainly responsible for extracting voxel characteristics containing dynamic information and static information from the voxelized point cloud. The 3D backbone network based on the discrete Transformer uses the discrete Transformer module to extract static information and dynamic information of the voxelized point cloud; the discrete transform module firstly obtains a down-sampled voxel through down-sampling 3D sparse convolution to serve as a query feature of a discrete attention mechanism, then extracts a static feature of the voxel through sub-manifold 3D sparse convolution, extracts a dynamic feature of the voxel through the discrete attention mechanism, and splices the dynamic feature and the static feature along a channel dimension to serve as an output feature.

Wherein the discrete transducer module 20, the multi-scale discrete transducer module 30 and the multi-scale discrete transducer module 40 are each composed of a downsampled 3D sparse convolution, a sub-manifold 3D sparse convolution and a 3D discrete attention mechanism, the main differences between the multi-scale discrete transducer module (30, 40) and the discrete transducer module 20 are that: 1) The inputs to the multi-scale discrete transducer module (30, 40) are voxels of two different scales; 2) Distinction in performing attention calculations (detailed later).

Another embodiment of the present invention provides a method for detecting a point cloud 3D target based on discrete convertors, where a flowchart of the method is shown in fig. 1, and the method includes: acquiring a point cloud data frame of an object in real time by using a laser radar; performing point cloud voxelization on the point cloud data frame through mean voxelization or dynamic voxelization based on a multi-layer perceptron to obtain an initial voxel V ₀ (the voxels contain voxel features and voxel coordinates); initial voxel V ₀ The receptive field is increased by sub-manifold 3D sparse convolution 10, the output voxels are denoted V ₁ The method comprises the steps of carrying out a first treatment on the surface of the Voxel V ₁ Sent to a discrete transducer module 20 to obtain a voxel V ₂ The method comprises the steps of carrying out a first treatment on the surface of the Voxel V ₁ And V is equal to ₂ Feeding into a multi-scale discrete transducer module 30 to obtain a voxel V ₃ The method comprises the steps of carrying out a first treatment on the surface of the Voxel V ₂ And V is equal to ₃ Feeding into a multi-scale discrete transducer module 40 to obtain a voxel V ₄ The method comprises the steps of carrying out a first treatment on the surface of the Voxel V ₄ Mapping to BEV space yields corresponding 2DBEV features, denoted F ^bev The method comprises the steps of carrying out a first treatment on the surface of the Finally F is arranged ^bev And sending the object information to a 3D object detector through a Neck network to detect a 3D object, and obtaining object attribute information such as the position of an object in a 3D space, the three-dimensional size of a boundary frame, the course angle of the object and the like.

In some embodiments of the present invention, in some embodiments, the convolution kernel size of the sub-manifold 3D sparse convolution 10 may be, for example, 3 x 3.

Voxel V ₁ Sent to a discrete transducer module 20 to obtain a voxel V ₂ The specific steps of (a) are as follows: please refer to fig. 2, according to voxel V ₁ (corresponding to voxel v in FIG. 2) ₀ Voxel coordinates with voxel feature dimensions denoted as mxc) generate a voxel hash table, each row of which stores a voxel value (calculation process of voxel value: the ith voxel coordinate is noted as (x _i ,y _i ,z _i ) All voxel coordinates have a maximum value of (x _max ,y _max ,z _max ) The value of the ith voxel is x _i *y _max *z _max +y _i *z _max +z _i ) Index id with the voxel, then voxel V ₁ (corresponding to voxel v in FIG. 2) ₀ ) Obtained by a downsampled 3D sparse convolutionVoxel v with voxel feature dimension of n×2c ₁ Voxel v ₁ Downsampled voxel v by means of a sub-manifold 3D sparse convolution 11 ₂ (voxel feature dimension is N C). Referring next to FIG. 4 in combination, voxel V is utilized ₁ (i.e., v in FIG. 2) ₀ ) And v ₂ Attention calculations were performed: according to v ₂ Is at voxel V with a search space of 3 x 3 ₁ Searching voxels in the corresponding range as key voxels for attention calculation in the hash table of (a) to obtain a key index table and a query index table (for v) ₂ The (x) coordinates of the (i) th voxel _i ,y _i ,z _i ) Then at voxel V ₁ The internal search is that the coordinates are 2x _i -1≤x≤2x _i +1,2y _i -1≤y≤2y _i +1,2z _i -1≤z≤2z _i Voxels in the +1 range, then get the query index table and key index table); with continued reference to FIG. 4, v is removed from the query index table ₂ Mid-index obtaining query features (the feature of the voxel obtained by indexing is K×C in dimension), removing voxel V according to key index table ₁ The mid index obtains key features (dimension is K multiplied by C); then carrying out dot product on the query feature and the key feature, summing along the feature dimension to obtain voxel feature with dimension of Kx1, then carrying out discrete Softmax according to the query index table to obtain Kx1-dimensional attention score, carrying out dot product on the attention score and the query feature to obtain feature with dimension of KxC, and carrying out discrete summation according to the query index table to obtain feature F with dimension of NxC _attention Finally, voxel v obtained by sub-manifold 3D sparse convolution 12 ₃ Splicing along the characteristic dimension to obtain an output voxel V with dimension of Nx2C ₂ (corresponding to voxel v in FIG. 2) ₄ )。

Voxel V ₁ And V is equal to ₂ Feeding into a multi-scale discrete transducer module 30 to obtain a voxel V ₃ The specific steps of (a) are as follows: referring to FIG. 3, FIG. 3 is a schematic diagram of the processing inside a multi-scale discrete transducer module, voxel V ₂ As voxel v in fig. 3 ₀ Voxel v in fig. 3 ₀ (V ₂ ) The processing procedure of (2) is the same as that of fig. 2, and is not repeated here; voxel V ₁ And V ₂ As "Duoduo" in FIG. 3The main difference between the processing voxels, the internal network architecture of the multi-scale discrete transducer module 30, 40 (fig. 3) being identical to the internal network architecture of the discrete transducer module 20 (fig. 2), is that the input to the discrete attention mechanism calculation process also contains multi-scale voxels, which are identical to the aforementioned "pair of voxels V ₁ Fed into a discrete transducer module 20 to obtain a voxel V ₂ The difference in the attention calculation in the "step" is that: will be to V ₁ And V is equal to ₂ Calculating 2 key index tables and 2 query index tables, and obtaining dimensions K respectively ₁ XC and K ₂ Two key features of XC, and dimensions are K respectively ₁ XC and K ₂ The two query features of the XC are spliced, and then the two key features are spliced to obtain the two-dimensional (K) ₁ +K ₂ ) The key feature and the query feature of the xC at this time, the dimensions corresponding to the "key feature" and the "query feature" in FIG. 4 are (K ₁ +K ₂ ) XC corresponds to the value of K ₁ +K ₂ ) Instead of K in the "key feature" and "query feature" dimensions in FIG. 4, the subsequent attention calculation steps are then repeated to obtain feature F having dimensions NxC _attention '. With continued reference to FIG. 3, feature F is obtained via a discrete attention mechanism _attention ' and voxel v ₃ Splicing along the characteristic dimension to obtain an output voxel V with dimension of Nx2C ₃ (corresponding to voxel v in FIG. 3) ₄ )。

Voxel V ₂ And V is equal to ₃ Feeding into a multi-scale discrete transducer module 40 to obtain a voxel V ₄ The specific steps of (a) are as follows: since the present step and the previous step are implemented by using a multi-scale discrete transducer module, the processing steps and principles are the same, and still contrast to fig. 3, only in the present step: voxel V ₂ And V ₃ As "multiscale voxel" in fig. 3, voxel V ₂ As voxel v in fig. 3 ₀ The feature output after the discrete attention mechanism is a feature F with dimension of N multiplied by C _attention ", voxel V ₄ Equivalent to voxel v in fig. 3 ₄ 。

Voxel V ₄ Mapping to BEV space yields the corresponding 2DBEV feature F _bev The specific steps of (a) are as follows: voxel V ₄ Further downsampling is performed through a 3D sparse convolution, then the high-dimensional features are spliced to the channel dimension, mapping from the 3D features to BEV features is completed, and 2D features F of the BEV space are obtained _bev 。

Finally F is arranged ^bev And sending the object to a 3D object detector through a Neck network to detect a 3D object, and obtaining the object attributes such as the position of the object in the 3D space, the three-dimensional size of the bounding box, the heading angle of the object, the class of the object and the like. It should be understood that the 3D object detector according to the embodiments of the present invention may be a 3D detection head such as CenterHead, PV-RCNNHead, voxel-RCNNHead, which is not limited by the present invention, and different 3D detection heads may output different object attributes.

In the detection method and the model provided by the embodiment of the invention, the use of a discrete attention mechanism can enable the transducer to be efficiently applied to the point cloud; according to the detection method and the detection model, dynamic characteristics and static characteristics of the point cloud can be extracted efficiently through the discrete convertors; the detection method and the model of the embodiment of the invention can be applied to all point cloud 3D target detection algorithms based on grids.

The detection method and the model provided by the embodiment of the invention can be applied to the sensing of the environment in automatic control scenes such as automatic driving automobiles and robots, and the like, and can improve the performance of the existing point cloud 3D target detection algorithm on the premise of not adding extra time consumption, and improve the sensing capability of the automatic driving automobiles to the environment, the accuracy rate of target detection and the recall rate.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several equivalent substitutions and obvious modifications can be made without departing from the spirit of the invention, and the same should be considered to be within the scope of the invention.

Claims

1. A point cloud 3D target detection method based on discrete transformers is characterized by comprising the following steps:

s1, acquiring a point cloud data frame of an object in real time;

s2, carrying out point cloud voxelization on the point cloud data frame to obtain an initial voxel;

s3, extracting voxel characteristics containing dynamic information and static information from the initial voxels through a 3D backbone network based on a discrete transducer;

s4, mapping the voxel characteristics finally output in the step S3 to BEV space to obtain corresponding 2DBEV characteristics;

and S5, the 2DBEV features are sent to a 3D target detector through a Neck network, 3D target detection is carried out, and object attribute information of an object in a 3D space is obtained.

2. The discrete Transformer-based point cloud 3D object detection method of claim 1, wherein: and in the step S2, the initial voxel is obtained through mean voxelization or dynamic voxelization based on a multi-layer perceptron.

3. The discrete transducer-based point cloud 3D object detection method as claimed in claim 1, wherein step S3 specifically comprises:

s31, inputting the initial voxel into a first sub-manifold 3D sparse convolution to obtain a first voxel;

s32, inputting the first voxel into a discrete transducer module to obtain a second voxel;

s33, inputting the first voxel and the second voxel into a first multi-scale discrete transducer module to obtain a third voxel;

s34, inputting the second voxel and the third voxel into a second multi-scale discrete transducer module to obtain a fourth voxel, wherein the fourth voxel is used as the voxel characteristic finally output in the step S3.

4. The discrete Transformer-based point cloud 3D target detection method of claim 3, wherein: the discrete transform module, the first multi-scale discrete transform module, and the second multi-scale discrete transform module are each comprised of a downsampled 3D sparse convolution, a second sub-manifold 3D sparse convolution, and a 3D discrete attention mechanism.

5. The discrete Transformer-based point cloud 3D object detection method of claim 1, wherein: the 3D backbone network based on the discrete Transformer uses the discrete Transformer module to extract static information and dynamic information of the voxelized point cloud;

the discrete transform module firstly obtains a down-sampled voxel through down-sampling 3D sparse convolution to serve as a query feature of a discrete attention mechanism, then extracts a static feature of the voxel through sub-manifold 3D sparse convolution, extracts a dynamic feature of the voxel through the discrete attention mechanism, and splices the dynamic feature and the static feature along a channel dimension to serve as an output feature.

6. The discrete Transformer-based point cloud 3D object detection method of claim 1, wherein the step S4 specifically includes:

and (3) splicing the height dimension features to the channel dimension after the voxel features finally output in the step (S3) are subjected to 3D sparse convolution, and completing the mapping from the 3D space to the BEV space to obtain the 2DBEV features.

7. The discrete Transformer-based point cloud 3D object detection method of claim 1, wherein: the object attribute information includes the position of the object in the 3D space, the three-dimensional size of the bounding box, the heading angle of the object and the category.

8. A discrete transducer-based point cloud 3D object detection model, comprising:

the point cloud voxelization module is used for carrying out point cloud voxelization on the point cloud data frame of the object and outputting an initial voxel;

the 3D backbone network based on the discrete Transformer is connected to the output end of the point cloud voxelization module and is used for extracting voxel characteristics containing dynamic information and static information from the initial voxels;

the voxel feature mapping module is connected with the 3D backbone network based on the discrete convertors and is used for mapping the voxel features containing dynamic information and static information to a BEV space to obtain corresponding 2DBEV features;

and the Neck network is connected with the output end of the voxel feature mapping module and is used for sending the 2DBEV features to a 3D object detector for 3D object detection to obtain object attribute information of an object in a 3D space.

9. The discrete Transformer-based point cloud 3D object detection model of claim 8, wherein: the discrete-transform-based 3D backbone network comprises a first sub-manifold 3D sparse convolution, a discrete transform module, a first multi-scale discrete transform module and a second multi-scale discrete transform module which are sequentially connected;

the first sub-manifold 3D sparse convolution takes the initial voxel as input and outputs a first voxel; the discrete transducer module takes the first voxel as input and outputs a second voxel; the first multi-scale discrete transducer module takes the first voxel and the second voxel as input and outputs a third voxel; the second multi-scale discrete transducer module takes the second voxel and the third voxel as input and outputs a fourth voxel.

10. The discrete Transformer-based point cloud 3D object detection model of claim 9, wherein: the discrete transform module, the first multi-scale discrete transform module, and the second multi-scale discrete transform module are each comprised of a downsampled 3D sparse convolution, a second sub-manifold 3D sparse convolution, and a 3D discrete attention mechanism.