CN115861944A

CN115861944A - Traffic target detection system based on laser radar

Info

Publication number: CN115861944A
Application number: CN202211692170.1A
Authority: CN
Inventors: 王秉路; 张磊; 胡世超; 李宁; 王小旭; 赵永强
Original assignee: Xian University of Architecture and Technology
Current assignee: Xian University of Architecture and Technology
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-03-28

Abstract

The invention discloses a traffic target detection system based on a laser radar, and relates to the technical field of traffic target detection systems; the method comprises a radar voxel-point set characteristic extraction network, wherein the radar voxel-point set characteristic extraction network comprises the following steps: a voxel-based multi-scale sparse convolutional network; extracting a network based on the feature of the key point sampling; features of the region of interest pool the network. The invention combines the voxel-based method and the point-based method together, so that the model has the high efficiency of the voxel-based model and the high performance of the point-based model; meanwhile, in order to further optimize the process of model training, a self-distillation mode is used for improving the training target in the process of model training, so that the model can be converged more quickly, and the accuracy is improved.

Description

Traffic target detection system based on laser radar

Technical Field

The invention relates to the technical field of traffic target detection systems, in particular to a traffic target detection system based on a laser radar.

Background

With the rapid development of the laser radar point cloud in the field of automatic driving for 3D target detection, the environmental perception in the intelligent driving system is continuously expanded. In practical application, the obstacle detection function is a basic function in the environment sensing system, and the types of obstacles mainly include common objects on a structured road such as pedestrians, vehicles, bicycles and the like. The laser radar can directly detect the real distance and size of a target, is a main sensor in an automatic driving scheme, is different from a camera, and can directly measure the real distance and size of the target.

However, the point cloud data generated by the laser radar is disordered, and the point cloud data is more sparse than the image, and the density of the point cloud data is gradually reduced along with the increase of the distance. Aiming at the characteristics of sparseness and disorder of laser radar point cloud data, two radar point cloud characteristic processing modes, namely voxel-based and poiin-based, are adopted.

However, the voxel-based approach inevitably causes information loss due to the use of voxels to gather point set characteristics, and the point-based approach causes a problem of low computational efficiency due to the need of processing characteristics point by point. There is a need for a lidar based traffic target detection system that overcomes the above-mentioned deficiencies.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a traffic target detection system based on a laser radar.

In order to achieve the purpose, the invention adopts the following technical scheme:

a traffic target detection system based on laser radar comprises a radar voxel-point set characteristic extraction network, wherein the radar voxel-point set characteristic extraction network comprises:

a voxel-based multi-scale sparse convolutional network;

extracting a network based on the feature of the key point sampling;

a feature pooling network of the region of interest;

the feature extraction and correction method of the radar voxel-point set feature extraction network comprises the following steps of:

s1: firstly, dividing voxels through a point cloud scene and sending the voxels into a multi-scale 3D coefficient convolution network to extract voxel characteristics and convert the voxel characteristics into BEV visual angle characteristics, predicting a target category and a target frame and generating a proposal;

s2: secondly, sampling key points in an FPS mode, and extracting multi-scale voxels, original point clouds and BEV characteristics for the key points by using a VSA module;

s3: and finally, dividing the proposal obtained by the multi-scale sparse convolution network based on the voxels into a plurality of grids, extracting key point features onto grid points through pooling operation, and further performing refined correction on the target frame by using the features.

Preferably, the following components: in the detection system, a voxel characteristic extraction method is carried out on the 3D sparse convolution network, and the voxel characteristic extraction method is used as a backbone to generate 3D propofol; wherein:

a 3D voxel CNN network; dividing the point cloud input P into L multiplied by H multiplied by W individual pixel grids, wherein the non-empty voxel grid characteristic is the characteristic average value of a plurality of points contained in the voxel; the point cloud features after voxelization are subjected to point cloud local features and global features through a series of multi-scale 3D convolutional networks, and the down-sampling scales of the 3D convolutional networks are 1 x, 2 x, 4 x and 8 x respectively;

3 Dpropofol generation; stacking the 8 Xdown-sampled 3D voxel characteristics along the Z axis to obtain a BEV characteristic map representation of L/8 XW/8; and then, candidate anchor frames are generated for various targets in the point cloud scene, each type of target is provided with a 3D anchor frame of 2 xL/8 xW/8, and the size of the anchor frame is the average size of the targets in the type, namely, the anchor frame is in two directions of 0 degrees and 90 degrees.

And further: in the detection system, the multi-scale speed-up features are gathered to a small number of key points, so that the key points become a bridge between the 3D voxel features and a propofol fine adjustment network;

selecting n key points k = { p1, p2,. Multidot, pn } from a Point cloud scene through a Furthest-Point-Sampling (FPS) algorithm; the FPS algorithm is adopted for sampling the key points, so that the key points can be distributed in the whole point cloud scene, and the characteristic of the key points can be ensured to represent the characteristic information of the whole scene;

and with the key point position as a reference point, using an SVA module to gather voxel characteristics around the key point.

Further preferred is: in the detection system, F ^(lk) ＝{f ₁ ^(lk) ,...,f _Nk ^(lk) Denotes the K-th scale 3D voxel characteristics; v () = { V = ₁ ^(lk) ,...,v _Nk ^(lk) Expressing 3D coordinates corresponding to the acceleration features, wherein the coordinates are calculated through voxel indexes and the size of voxels corresponding to scales, and Nk expresses the number of non-empty voxels in the k scale; for each keypoint pi, its neighboring non-empty voxels are first determined by rk, which is denoted as:

in this process, pi is added to v _j ^(lk) The relative coordinates of (a) and the voxel characteristics are merged together to represent the relative positional relationship between the two; generating, by a Pointblock module, a feature representation of the keypoints pi:

m represents S _i ^(lk) Randomly sampling the characteristics of Tk voxels to achieve the purpose of saving the calculation amount; g represents a MLP network for encoding voxel characteristics and relative relations; selecting a plurality of radius distances simultaneously to perform VSA operation; clustering voxel features of multiple scales onto keypoints, the feature of keypoint pi representing:

as a preferable aspect of the present invention: in the detection system, the expanded VSA module: in addition to gathering multi-scale voxel characteristics on key points, gathering characteristics of BEVs obtained by octave down-sampling of original point clouds on the key points, projecting the key points pi onto a BEV view, and then gathering adjacent BEV characteristics fi (BEVs) on the key points by a bilinear interpolation method; finally, the keypoint features are represented by the following formula:

further preferred as the invention: in the detection system, a Predicted Keypoint Weighting module is used for predicting the weight of key points; the PKW module takes the label of the 3D bounding box as supervision, and the supervision label of the key point contained in the 3D target frame is the foreground key point; finally, after weight prediction network processing, the key point characteristics are shown as follows:

a represents a three-layer MLP network and a sigmoid function for foreground confidence prediction; the PKW network is trained over focal local.

As a still further scheme of the invention: in the detection system, a pooling network of an interested area is used for simultaneously gathering the characteristics of key points to grid points of the interested area according to a plurality of receiving areas; the 3D propofol is divided into 6 × 6 × 6 grid points, as represented by:

G＝{g ₁ ，……，g ₂₁₆ }

the neighboring keypoints for each grid point are determined by:

wherein, pj-gi is used for preserving the relative position relation between the grid point and the adjacent key point pj; and then, encoding the feature aggregation of the adjacent key points to the grid points gi by adopting a PointNet module:

gathering key point characteristics of different receiving domains by adopting various radii r, and combining the characteristics of the different receiving domains together; the vectorized feature is then converted into a 256-dimensional feature through a two-layer MLP network to represent the propofol.

On the basis of the scheme: in the detection system, the size and position information of the 3D propofol can be predicted by a refined correction network by utilizing the characteristics of the propofol; the whole refinement and correction network consists of two branches: confidence prediction branches and frame regression are grouped, and each branch consists of two layers of MLP networks;

the confidence coefficient prediction network adopts the ROI of the 3D interested region and the 3D IoU between the corresponding GT as training targets, and for the kth 3D interested region, the confidence coefficient training target yk is as follows:

Y _k ＝min(1,max(0,2IoU _k -0.5))

then, the confidence gt and the predicted confidence score are subjected to loss calculation:

the regression target of the target frame is obtained through a traditional residual-based mode, and smooth L1 loss is used for optimization.

On the basis of the foregoing scheme, it is preferable that: in the detection system, for a model with input of X and a K-dimension one-hot supervision target Y, inputting X into the model to obtain a logit vector of z (X) = [ z1 (X) ], zK (X) ]; obtaining a prediction confidence coefficient P (x) = [ P1 (x) ], pK (x) ] through a softmax function; softening the confidence coefficient:

τ represents a temperature coefficient of temperature scaling; output of the teacher model and the student model is subjected to softmax to obtain PT (x) and PS (x); for the student model, the training goal is as follows:

when the temperature coefficient tau is 1, the objective function degenerates to P ^S (x) Cross entry function on soft supervision objective.

It is further preferable on the basis of the foregoing scheme that: in the detection system, the knowledge self-distillation model acquires knowledge from the model so as to improve the generalization capability of the model; the t-th epoch, predicted for x under self-distillation treatment to be P _t ^s The objective function of (a) is:

for the model of epoch, the training target is (1- α) y + α Pt-1S (x); the parameter α is the degree of confidence in the teacher model.

The invention has the beneficial effects that:

1. the invention combines the voxel-based method and the point-based method together, so that the model has the high efficiency of the voxel-based model and the high performance of the point-based model; meanwhile, in order to further optimize the process of model training, a self-distillation mode is used for improving the training target in the process of model training, so that the model can be converged more quickly, and the accuracy is improved.

2. According to the invention, a plurality of target detection models based on laser radar point cloud are subjected to detection performance comparison experiments, the improved models obtain excellent performance on a KITTI data set, meanwhile, the self-distillation module is subjected to ablation experiments, and the experimental results prove that the self-distillation module plays a great role in the training process.

Drawings

Fig. 1 is a flowchart of a traffic target detection system based on a laser radar according to the present invention.

Detailed Description

The technical solution of the present patent will be described in further detail with reference to the following embodiments.

Example 1:

a voxel-based multi-scale sparse convolutional network;

extracting a network based on the feature of the key point sampling;

a feature pooling network of the region of interest;

the feature extraction and correction method of the radar voxel-point set feature extraction network comprises the following steps:

s3: and finally, dividing the proposal obtained by the multi-scale sparse convolution network based on the voxels into a plurality of grids, extracting key point features to grid points through pooling operation, and further performing fine correction on the target frame by using the features.

In the detection system, a voxel characteristic extraction method is carried out on the 3D sparse convolution network, and the voxel characteristic extraction method is used as a backbone to generate 3D proposal; wherein:

a 3D voxel CNN network; dividing the point cloud input P into L multiplied by H multiplied by W individual pixel grids, wherein the non-empty voxel grid characteristics are characteristic average values of a plurality of points contained in voxels, and the characteristics generally comprise 3D coordinates and reflection intensity attributes of the points; the point cloud characteristics after voxelization are to be aggregated by a series of multi-scale 3D convolution networks to obtain point cloud local characteristics and global characteristics, wherein the down-sampling scales of the 3D convolution networks are 1 x, 2 x, 4 x and 8 x respectively;

3 Dpropofol generation; stacking the 8 Xdown-sampled 3D voxel characteristics along the Z axis to obtain a BEV characteristic map representation of L/8 XW/8; then, candidate anchor frames are generated for various targets in the point cloud scene, each type of target is provided with a 3D anchor frame of 2 xL/8 xW/8, and the size of the anchor frame is the average size of the type of target and is in two directions of 0 degrees and 90 degrees; compared with a PointNet-based method, the method has the advantages that a higher recall rate can be realized by adopting a 3D voxel CNN network and an anchor frame strategy;

and (5) a discussion of fine modification of the detection target. On the one hand, the refinement and correction of propofol directly on the 3D voxelized feature or the 2D feature map brings many problems. Firstly, point cloud characteristics processed by a backbone downsampling network cannot complete fine positioning of a target frame, and the fine correction effect of the target frame is influenced; secondly, even if the features are subjected to upsampling processing by means of linear interpolation and the like, the features are sparse, and fine correction of the target frame cannot be realized.

On the other hand, the set interaction operation proposed in the PointNet network can extract features from surrounding points in an arbitrary radius range; therefore, the effective fine correction work of the target box is realized by the operation of the set iteration. The problems of memory occupation and computational efficiency caused by extracting all the voxel characteristics are avoided. And selecting partial key points in the point cloud scene, and extracting the voxel characteristics to the partial key points. By the method, the problem that the voxelization features are too sparse can be avoided, the implementation problems of memory occupation and the like caused by extraction of all the voxelization features can also be avoided, and the fine correction effect of the target box is greatly improved.

In the detection system, the multi-scale speed-up features are gathered to a small number of key points, so that the key points become a bridge between the 3D voxel features and a propofol fine adjustment network;

sampling key points, namely firstly, selecting n key points k = { p1, p2,. Multidot.pn } from a Point cloud scene through a Furthest-Point-Sampling (FPS) algorithm; the FPS algorithm is adopted for sampling the key points, so that the key points can be distributed in the whole point cloud scene, and the characteristic of the key points can be ensured to represent the characteristic information of the whole scene;

a VSA module; the selected key points only contain a few parts of characteristic information in the whole point cloud scene; therefore, the voxel characteristics around the key points need to be gathered by using the SVA module with the key points as reference points;

F ^(lk) ＝{f ₁ ^(lk) ,...,f _Nk ^(lk) denotes the K-th scale 3D voxel characteristics; v () = { V = ₁ ^(lk) ,...,v _Nk ^(lk) Denotes the 3D coordinates of the corresponding acceleration feature, calculated from the voxel index and the size of the voxel of the corresponding scale, nk denotes the number of non-empty voxels in the k-th scale. For each keypoint pi, its neighboring non-empty voxels are first determined by rk, which is denoted as:

/>

where M denotes a slave S _i ^(lk) Randomly sampling the characteristics of Tk voxels to achieve the purpose of saving the calculation amount; g represents a MLP network for encoding voxel characteristics and relative relations; the number of voxels in adjacent regions of different key points is different, and the feature matching problem among the key points can be effectively solved by performing maximum pooling operation along the channel; in order to gather multi-scale semantic information, a plurality of radius distances are selected at the same time for VSA operation;

through multiple VSA operations, voxel characteristics of multiple scales can be gathered on key points; after the merging operation, a characteristic representation of the key point pi as shown in the following formula will be obtained:

extended VSA module: in addition to gathering multi-scale voxel features on the key points, gathering features of the original point cloud and the BEV obtained by eight-time down-sampling on the key points; the characteristic aggregation operation of the original point cloud is shown as a formula (2); in order to realize BEV feature aggregation, a key point pi is projected onto a BEV view, and then adjacent BEV features fi (BEV) are aggregated onto the key point by a bilinear interpolation method; finally, the keypoint features are represented by the following formula:

predicting the weight of the key point:

now, scene feature representation with a small part of key point codes is obtained, and the key point features are required to be further utilized to carry out refinement and correction work on a target frame; key points obtained by sampling by using an FPS algorithm are distributed over the whole point cloud scene, and a foreground area and a background area are both distributed in relation to the key points; in order to improve the precision rate of fine correction, the weight of the key points in the foreground region needs to be set to be larger, and the weight of the key points in the background region needs to be smaller;

therefore, the weighted prediction of the key points is carried out through a Predicted Keypoint Weighting module; the PKW module takes the label of the 3D bounding box as supervision, namely the supervision label of the key point contained in the 3D target frame is the foreground key point; finally, after the weight prediction network processing, the key point characteristics are shown as follows:

a represents a three-layer MLP network and a sigmoid function for foreground confidence prediction; the PKW network is trained through the focal local; and setting the hyper-parameters to solve the problem of imbalance of foreground and background points.

Region of interest feature pooling network: the method comprises the steps that a pooling network of an interested area is used for gathering the characteristics of key points to grid points of the interested area according to a plurality of receiving areas; the 3Dproposal is divided into 6 × 6 × 6 grid points, as represented by the following equation:

G＝{g ₁ ，……，g ₂₁₆

the neighboring keypoints for each grid point are determined by:

wherein pj-gi is used to preserve the relative positional relationship between the grid point and the neighboring key point pj. And then, encoding the feature aggregation of the adjacent key points to the grid points gi by adopting a PointNet module:

wherein M and G are in accordance with formula (2); gathering key point characteristics of different receiving domains by adopting various radii r, and combining the characteristics of the different receiving domains together; then converting the vectorized feature into a 256-dimensional feature through a two-layer MLP network to represent the propofol;

compared to previous voxel-based region-of-interest feature pooling operations. The network can obtain richer semantic information and more flexibly accept domain selection.

Proposal refinement correction and confidence prediction: by utilizing the characteristics of the propofol, the fine modification network can predict the size and the position information of the 3D propofol. The whole refinement and correction network consists of two branches: confidence prediction branches and box regression are grouped, and each branch consists of two layers of MLP networks.

Y _k ＝min(1,max(0,2IoU _k -0.5))

Self-distillation network: knowledge distillation as a soft supervision goal is just to distill the knowledge of one model (teacher model) to migrate to another model (student model). Usually from one large model to a small model. In addition to the one-hot supervision objective, the student model can also obtain information from the teacher model for learning. Finally, the smaller student model is trained on the basis of the teacher model to obtain performance consistent with that of the teacher model. The performance of the student model is even better if the two models are of the same size and scale.

For a model with X and K dimensions one-hot surveillance target Y, inputting X into the model results in a logit vector of z (X) = [ z1 (X),.., zK (X) ]. Then, a prediction confidence P (x) = [ P1 (x),.. So, pK (x) ] is obtained through a softmax function. For better knowledge distillation, the confidence is softened:

τ denotes the temperature coefficient of the temperature scaling. And (5) passing the output of the teacher model and the output of the student model through softmax to obtain PT (x) and PS (x). For the student model, the training goal is as follows:

Knowledge distillation is carried out from the prediction result of the last stage, and knowledge is obtained from the distillation model from the model so as to improve the generalization capability of the model; the t-th epoch, predicted for x under self-distillation treatment to be P _t ^s The objective function of (a) is:

compared to the traditional knowledge distillation model, the teacher model of the self-distillation model is in dynamic change. For student models, models for training epochs in the past can be used as teacher models. In order to obtain more valuable information, the model at the moment t-1 is selected as the teacher model. For the model of epoch, the training target is (1- α) y + α Pt-1S (x). The parameter a reflects the degree of trust in the teacher model.

The training target is improved in a self-distillation mode in the model training process, so that the model can be converged more quickly, and the accuracy is improved.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A traffic target detection system based on laser radar is characterized by comprising a radar voxel-point set feature extraction network, wherein the radar voxel-point set feature extraction network comprises:

a voxel-based multi-scale sparse convolutional network;

extracting a network based on the feature of the key point sampling;

a feature pooling network of regions of interest;

s1: firstly, dividing voxels through a point cloud scene, sending the voxels into a multi-scale 3D coefficient convolution network to extract voxel characteristics, converting the voxel characteristics into BEV (Bev view angle) characteristics, predicting target types and target frames and generating a propofol;

2. The system of claim 1, wherein in the detection system, a 3D sparse convolution network is subjected to a voxel feature extraction method to serve as a backbone and generate 3D propossal; wherein:

a 3D voxel CNN network; dividing the point cloud input P into L multiplied by H multiplied by W individual pixel grids, wherein the non-empty voxel grid characteristic is the characteristic average value of a plurality of points contained in a voxel; the point cloud characteristics after voxelization are subjected to point cloud local characteristics and global characteristics through a series of multi-scale 3D convolution networks, and the down-sampling scales of the 3D convolution networks are 1 x, 2 x, 4 x and 8 x respectively;

3 Dpropofol generation; stacking the 8 Xdown-sampled 3D voxel characteristics along the Z axis to obtain a BEV characteristic map representation of L/8 XW/8; and then, candidate anchor frames are generated for various targets in the point cloud scene, each type of target is provided with a 2 xL/8 xW/8 3D anchor frame, and the size of the anchor frame is the average size of the targets, namely, the anchor frame is in two directions of 0 degree and 90 degrees.

3. The lidar-based traffic target detection system according to claim 2, wherein in the detection system, it is proposed to cluster multi-scale acceleration features into a small number of key points, so that the key points become a bridge between the 3D voxel features and the propofol refinement network;

and with the key point position as a reference point, gathering voxel characteristics around the key point by using an SVA module.

4. The lidar-based traffic target detection system of claim 3, wherein in the detection system, F ^(lk) ＝{f ₁ ^(lk) ,...,f _Nk ^(lk) Denotes the K-th scale 3D voxel characteristics; v () = { V = ₁ ^(lk) ,...,v _Nk ^(lk) Denotes the 3D coordinates of the corresponding acceleration feature, which is measured by the voxel index and the size of the corresponding scale voxelCalculated, nk represents the number of non-empty voxels in the k-th scale; for each keypoint pi, its neighboring non-empty voxels are first determined by rk, which is denoted as:

m represents S _i ^(lk) Randomly sampling the characteristics of Tk voxels to achieve the purpose of saving the calculation amount; g represents a MLP network for encoding voxel characteristics and relative relations; selecting a plurality of radius distances simultaneously to perform VSA operation; clustering voxel characteristics of multiple scales onto keypoints, the characteristics of keypoints pi representing:

5. the lidar-based traffic target detection system of claim 4, wherein the extended VSA module: in addition to clustering multi-scale voxel features on key points, clustering features of original point clouds and BEVs obtained by eight-time down-sampling on the key points, projecting the key points pi onto a BEV view, and then clustering adjacent BEV features fi (BEVs) on the key points by a bilinear interpolation method; finally, the keypoint features are represented by the following formula:

6. the lidar-based traffic target detection system according to claim 5, wherein in the detection system, a Predicted keypoint weight prediction module is used for performing weight prediction on keypoints; the PKW module takes the label of the 3D bounding box as supervision, and the supervision label of the key point contained in the 3D target frame is the foreground key point; finally, after the weight prediction network processing, the key point characteristics are shown as follows:

a represents a three-layer MLP network and a sigmoid function for foreground confidence prediction; the PKW network is trained over focal loss.

7. The lidar-based traffic target detection system according to claim 6, wherein the detection system uses a pooling network of interested areas to simultaneously gather the features of key points to a grid point of the interested areas according to a plurality of receiving areas; the 3Dproposal is divided into 6 × 6 × 6 grid points, as represented by:

G＝{g ₁ ，……，g ₂₁₆ }

the neighboring keypoints for each grid point are determined by:

8. The lidar based traffic target detection system according to claim 7, wherein the refinement and correction network can predict the size and position information of the 3D propofol by using the characteristics of the propofol; the whole refinement and correction network consists of two branches: confidence prediction branches and frame regression are grouped, and each branch consists of two layers of MLP networks;

the confidence prediction network adopts the ROI of the 3D interested region and the 3D IoU between the corresponding GT as training targets, and for the kth 3D interested region, the confidence training target yk is as follows:

Y _k ＝min(1,max(0,2IoU _k -0.5))

a regression target of the target frame is obtained through a traditional residual-based mode, and smooth L1 loss is used for optimization.

9. The lidar-based traffic target detection system of claim 8, wherein for the model with input of X and K dimensions one-hot surveillance target Y, inputting X into the model results in a logit vector of z (X) = [ z1 (X),.. Once, zK (X) ]; obtaining a prediction confidence P (x) = [ P1 (x),. And pK (x) ] through a softmax function; softening the confidence coefficient:

when the temperature coefficient tau is 1, the objective function is degenerated to P ^S (x) Cross entry function on soft supervision target.

10. The lidar based traffic target detection system of claim 9, wherein in the detection system, knowledge is obtained from the distillation model from the model itself so as to improve generalization ability of the model; the t-th epoch, predicted for x under self-distillation treatment to be P _t ^s The objective function of (a) is: