CN115908829A

CN115908829A - Point column-based two-order multi-attention mechanism 3D point cloud target detection method

Info

Publication number: CN115908829A
Application number: CN202211104980.0A
Authority: CN
Inventors: 严一尔; 李鑫
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-04-04

Abstract

The invention provides a point-pillar-based second-order multi-attention mechanism 3D point cloud target detection method, which comprises the following steps of: providing a method for respectively realizing target detection by a second-order point attention mechanism, a second-order channel attention mechanism and a pseudo-image space attention mechanism based on point columns; providing a network, wherein the network mainly comprises a second-order point attention mechanism, a point column characteristic network, a second-order channel attention mechanism, a backbone network, a pseudo image space attention mechanism and an SSD detection head, then performing point cloud voxelization, performing second-order point attention mechanism operation on point cloud, converting the point cloud into the characteristics of a pseudo image, performing second-order channel attention mechanism operation on the characteristics of the pseudo image, outputting the characteristics of the pseudo space, performing pseudo image space attention mechanism operation on the characteristics of the pseudo space, and outputting to obtain a detection result; by the method, relatively high detection speed and extraction accuracy are guaranteed.

Description

Point column-based two-order multi-attention mechanism 3D point cloud target detection method

Technical Field

The invention belongs to the field of 3D pure laser radar point cloud targets, and particularly relates to a method for respectively realizing target detection based on three mechanisms of a second-order point attention mechanism, a second-order channel attention mechanism and a pseudo-image space attention mechanism of a point column.

Background

Currently, 3D point cloud target detection methods are increasingly widely used in the fields of computer vision, autopilot, robots, virtual reality, and the like. Lidar can provide more reliable depth information, more accurately locate objects and provide shape information than target detection of two-dimensional images. However, the 3D point cloud has no texture, blocking and intercepting, and uneven reflection, the lidar point cloud is sparse, the density is greatly changed, and the precision of the traditional 3D target detection method based on manual features is often affected. In recent years, as the deep neural network shows excellent feature extraction capability, high-dimensional data can be processed, and the precision of the 3D point cloud target detection method based on the deep neural network is improved to a certain extent. Nevertheless, due to the high sparsity and the intrinsic irregularity of the point cloud, the accuracy of the detection results of some categories is still a great space for improvement.

As proposed by Li et al in 2016, veloFCN, converts the point cloud into a look-ahead representation of the signature, and then uses an off-the-shelf detector. (see B.Li, T.Zhang, and T.Xia, "VeloFCN: vehicle detection from 3D lidar using full volumetric connected network," in Robotics, 2016.). Qi et al in 2017 proposed PointNet, which puts the original point cloud data into a model for deep neural network training for the first time. (refer to C.R.Qi, H.Su, K.Mo, and L.J.Guibas, "Point: deep learning on point sections for 3d classification and segmentation," in CVPR, 2017.). In 2018, complete-yolo was introduced by Martin Simon et al, and the model projects point cloud to a two-dimensional plane and uses an image method for target detection, thereby accelerating network reasoning. However, the projection mode is limited by the sparsity of the point cloud, so that the convolution cannot extract features well. (see M.Simon, S.Milz, K.Amend, and H. -M.Gross. "Complex-Yolo: real-time 3dobject detection on points clusters," arXiv:1803.06199,2018.). In order to alleviate the problem of occlusion caused by front view overlapping, yang et al propose PIXOR to rasterize point clouds into a more compact BEV representation, but have the obvious disadvantage of needing to manually extract features, but the manual design cannot fully utilize the three-dimensional information of an object, and is not beneficial to popularization in other radars. (refer to B.Y ang, W.Luo, and R.Urtastun, "Pixor: real-time 3d object detection from points clusters," in CVPR, 2018.). In 2018, zhou et al first proposed an end-to-end trainable network VoxelNet, a universal 3D detection framework. Unlike most previous work, voxelNet starts learning information rich feature representations and can simultaneously learn different feature representations from the point cloud. However, a drawback of 3D convolution is that it is too time consuming and faces a large amount of computation, resulting in a slow inference speed of the network. (see Y.Zhou and O.Tuzel, "Voxelnet: end-to-End learning for point closed based 3d object detection," In CVPR, 2018.). Then Yan et al propose SECOND, which reduces memory consumption and speeds up computation through sparse convolution operations. (see Y.Yan, Y.Mao, and B.Li, "SECOND: sparary embedded capacitive detection,". In Sensors,18 (10), 2018.). H.lang et al propose pointpilars to encode point clouds into vertical columns in 2019, which are essentially special divisions of voxels, in order to improve the inference speed with standard 2D convolution detection pipelines. (see A.H.Lang, S.V ora, H.Caesar, L.Zhou, J.Y ang, and O.Beijbom, "points encoders for object detection from points clusters," in CVPR, 2019.).

Furthermore, the method proposed in the prior art paper A.H.Lang, S.Vora, H.Caesar, L.Zhou, J.Y ang, and O.Beijbom, "Pointpilars: fast encoders for object detection from point clusters," in CVPR,2019, is specifically implemented by the steps of: firstly, the region division is carried out on the input original point cloud, the point cloud is voxelized and then converted into a sparse pseudo image form. A fixed number of points are randomly reserved in each pilar, and in this step, the characteristic dimensions of the points in the pilars are augmented from the original 4-dimensional information to 9-dimensional, and each point in the lidar has 9-dimensional characteristics. In the backbone network, learning of features is performed using a 2D network. The backbone network includes two sub-networks: one top-down network produces features with smaller and smaller spatial resolution, and a second network performs the functions of upsampling and cascading top-down. The final output feature is a concatenation of all features originating from the same dimension for different steps. In the detection head module, an SSD detection head is selected to carry out Bbox regression. A 2D joint cross-section (Iou) is used to match the prior box to the ground truth. The height and elevation of the Bbox are not used for matching, and 2D matching is used here, and the height and elevation serve as additional regression targets. Although the pointpilars network proposes that the pixelation of the point cloud is utilized to increase the speed, the characteristic information of an input image is usually lost in the down-sampling process of the backbone network, and points in the voxels have relevance with each other, so that a part of useful geometric information is inevitably lost when the points in the point cloud are processed in an isolated manner, and the detection precision is further influenced. In the backbone network, the processing of each channel separately and independently ignores the correlation between the channels, so that a part of useful information is lost, and the detection precision is reduced. After the generation of the pseudo-image, the same process is performed on the features in the pseudo-space. Since not all the characteristics of the pseudo space have the same contribution to the detection task, the greater the importance of the region with the greater relevance to the task, the greater the direct same processing will also reduce the final detection precision, so a real-time and accurate 3D point cloud target detection method is urgently needed, and a dynamic balance between speed and precision is realized.

Disclosure of Invention

In view of the above defects in the prior art, an object of the present invention is to provide a real-time and accurate 3D point cloud target detection method, which can achieve a dynamic balance between speed and accuracy, and solve the problem that the existing method cannot perform higher-accuracy target detection in real time through three mechanisms, namely, a second-order point attention mechanism, a second-order channel attention mechanism, and a pseudo-image space attention mechanism, based on a point column.

The technical problems solved by the invention are as follows:

firstly, in the step of a feature extraction network, feature information of an input image is usually lost in the down-sampling process of a main network, and points in voxels have relevance with each other, so that processing points in point clouds in an isolated manner will definitely lose a part of useful geometric information, and further detection precision is influenced. The invention provides a point-column-based second-order point attention mechanism, which can extract more fine characteristic information at relatively low inference speed by connecting points in the same voxel and points to retain more useful information.

Secondly, in the backbone network, the processing of each channel in an isolated manner ignores the correlation between the channels, so that a part of useful information is lost, and the detection precision is reduced. The invention provides a point-pillar-based second-order channel attention mechanism, which links channels, retains more useful characteristic information and improves the overall detection precision of the channels.

Thirdly, after the pseudo image is generated, the same processing is performed on the features in the pseudo space. Since not all the features of the pseudo space have the same contribution to the detection task, the greater the importance of the region having a greater correlation with the task, the greater the detection accuracy will be affected by directly performing the same processing. In view of the above, the invention provides a pseudo-image space attention mechanism based on a point column, which allocates different weights to each pixel point in a pseudo-space according to the importance degree of a region in the pseudo-space to a task, so as to obtain a more accurate detection result.

The technical scheme adopted by the invention for solving the technical problems is as follows: a point-column-based second-order multi-attention-mechanism 3D point cloud target detection method comprises the following steps:

s1: providing a method for respectively realizing target detection by a second-order point attention mechanism, a second-order channel attention mechanism and a pseudo-image space attention mechanism based on point columns;

s2: the method comprises the steps that a network is provided based on S1, the network mainly comprises a second-order point attention mechanism, a point column characteristic network, a second-order channel attention mechanism, a backbone network, a pseudo image space attention mechanism and an SSD detection head, and the network is also divided into a second-order attention module, a second-order point attention module and a second-order channel attention module;

s3: performing voxelization on the point cloud, and then performing second-order point attention mechanism operation on the point cloud to convert the point cloud into the characteristics of a pseudo image;

s4: performing a second-order channel attention mechanism operation on the characteristics of the pseudo image, and outputting the characteristics of a pseudo space;

s5: performing pseudo-image space attention mechanism operation on the features of the pseudo space, and outputting to obtain a detection result;

wherein the SSD detection head predicts a three-dimensional bounding box of the object using features of the trunk; the second order attention module comprises global maximum pooling, covariance pooling and row convolution; in the S3, under the condition that the point characteristics are used as the input of the second-order attention module, the obtained second-order point attention mechanism weight is used as the output, and the process is a second-order point attention module; when the channel characteristics are input to the second order attention module, a second order channel attention mechanism weight is obtained, and the process is a second order channel attention module.

In a given K-th voxel, for all points in the voxel

Wherein N represents the maximum value of the number of points, C represents the number of channels, and after global maximum pooling, a vector composed of the maximum values in each dimension is obtained

Will->

Input into a fully connected layer, where Nx 1 represents a vector of N rows and 1 columns, resulting in a vector->

Wherein t is through W ₁ The number of points after the full link layer is reduced, W ₁ After the full-connection layer, a ReLU activation function is used for calculating to obtain a covariance matrix between two points in the same voxel>

Performing convolution on the covariance matrix line by line to obtain a vector ^ er for the second-order point attention mechanism, wherein t is the number of points, t is the number of channels in the second-order channel attention mechanism, and t multiplied by t is the dimension>

Then the vector is->

Is input to W ₂ Fully connecting layers and using an activation function Sigmoid function to obtain an N-dimensional attention vector->

In S3, the second-order point attention mechanism is represented as:

s＝σ(W ₂ RC(Cov(σ(W ₁ (GMP(X))))))

where Cov (. Circle.) is the covariance matrix of the computation points, RC (. Circle.) is the row convolution, GMP (. Circle.) is the global maximum pooling, σ is the ReLU activation function,

And/or>

For two different fully connected layers, X is the point in the given Kth voxel->

The second-order channel attention mechanism is similar to the second-order point attention mechanism, the channel characteristics generate similar weights after passing through the second-order attention module, and in S4, the second-order channel attention mechanism is represented as:

M＝σ(W ₂ RC(Cov(σ(W ₁ (GMP(Y))))))

in the formula (I), the compound is shown in the specification,

for the features of the pseudo-image, superscript H, W is the height and width of the pseudo-image.

According to the importance degree of the region in the pseudo space to the task, different weights are distributed to each pixel point in the pseudo space to obtain a more accurate detection result, the characteristic P and the signal G of the pseudo space are used as input, the space attention weight generated by final output is S, and a pseudo image space attention mechanism is represented as follows:

/>

the relationship of P and G can be expressed as:

in the formula (I), the compound is shown in the specification,

and &>

For performing a linear transformation operation with a convolution of 1 x 1, in which->

For a linear transformation operation on P with a convolution of 1 x 1, a decision is made as to whether P is present in the transformed image or not>

A linear transformation operation is performed on G with a convolution of 1 x 1.

Parameterizing the three-dimensional ground truth box as (x, y, z, w, l, h, θ), wherein (x, y, z) represents the center position, (w, l, h) and θ represent the magnitude and direction angles of the bounding box, and the localization regression residual between the ground truth and the anchor is defined as follows:

Δθ＝sin(θ ^gt -θ ^a ),

in the formula (I), the compound is shown in the specification,

gt is ground truth value, a is parameters of anchor box, (x) ^gt ,y ^gt ,z ^gt ) As the coordinates of the center position of the 3D true value box (l) ^gt ,w ^gt ,h ^gt ) Is the length, width and height theta of the 3D true value frame ^gt For yaw angle, (x) of 3D truth frame around Z axis ^a ,y ^a ,z ^a ) As the coordinates of the center position of the anchor box (l) ^a ,w ^a ,h ^a ) The length, width and height of the anchor box.

θ ^a The yaw angle of the anchor box around the Z axis is adopted;

the regression loss is expressed as:

wherein SmoothL1 is a SmoothL1 loss function;

since the angular positioning penalty cannot distinguish between flipped frames, a softmax classification penalty L is used in the discretization direction _dir This enables the network to learn orientation, using focus loss for object classification loss:

L _cls ＝-a(1-p) ^r logp

in the formula, p is the probability of correctly detecting the frame, r and a are parameter settings, and the total loss obtained finally is expressed as:

in the formula, N _pos Indicates the number of correct detection boxes, beta _loc 、β _cls 、β _dir Is a preset value.

The invention has the advantages and beneficial effects that:

the second-order point attention mechanism in the invention considers that the points in the voxels have relevance, and compared with the existing method in which the points in the voxels are processed in an isolated manner, more useful geometric characteristic information can be reserved, and the detection accuracy is improved. Similarly, the second-order channel attention mechanism in the invention considers the correlation among the channels, and further improves the detection precision. The pseudo-image space attention mechanism provided by the invention considers that not all features of the pseudo space have the same contribution to the detection task, and the region importance with the larger task relevance is higher, so that different weights are distributed to each pixel point in the pseudo-space features, and the feature extraction effect is further improved. Therefore, the three mechanisms based on the point column ensure relatively high detection speed and extraction accuracy.

Drawings

FIG. 1 is a diagram of one embodiment of a second order attention module architecture of the present invention;

FIG. 2 is an overall frame diagram of the point-based second-order multi-attention mechanism 3D point cloud target detection method.

Detailed Description

The following examples are given to illustrate the present invention in detail, and the following examples are given to illustrate the detailed embodiments and specific procedures of the present invention, but the scope of the present invention is not limited to the following examples.

The invention provides a method for realizing target detection by using a Second-order point Attention machine (SOPA), a Second-order channel Attention machine (SOCA) and a Pseudo-Image space Attention machine (SAPI) based on a point column, which comprises the following steps:

s3: voxelizing the point cloud, and then performing second-order point attention mechanism operation on the point cloud to convert the point cloud into the characteristics of a pseudo image;

wherein the SSD detection head predicts a three-dimensional bounding box of the object using features of the trunk; the second order attention module comprises global maximum pooling, covariance pooling and row convolution; in the S3, under the condition that the point characteristics are used as the input of the second-order attention module, the obtained second-order point attention mechanism weight is used as the output, and the process is a second-order point attention module; when the channel features are input to the second order attention module, a second order channel attention mechanism weight is obtained, and the process is a second order channel attention module.

In a given K-th voxel, for all points in the voxel

Will->

Wherein t is a point at the second order point attention mechanismThe number, the attention mechanism of the second-order channel, t is the number of the channels, and t multiplied by t is the dimension, and the covariance matrix is convolved line by line to obtain the vector ^ er>

Then the vector is->

In S3, the second order point attention mechanism is expressed as:

s＝σ(W ₂ RC(Cov(σ(W ₁ (GMP(X))))))

And &>

M＝σ(W ₂ RC(Cov(σ(W ₁ (GMP(Y))))))

in the formula (I), the compound is shown in the specification,

the relationship of P and G can be expressed as:

in the formula (I), the compound is shown in the specification,

and &>

For a linear transformation operation on P with a convolution of 1 x 1>

Δθ＝sin(θ ^gt -θ ^a ),

in the formula (I), the compound is shown in the specification,

gt is ground truth value, a is parameters of anchor box, (x) ^gt ,y ^gt ,z ^gt ) As the center position coordinates of the 3D truth frame, (l) ^gt ,w ^gt ,h ^gt ) Is the length, width and height theta of the 3D true value frame ^gt For yaw angle, (x) of 3D truth frame around Z axis ^a ,y ^a ,z ^a ) As the coordinates of the center position of the anchor box (l) ^a ,w ^a ,h ^a ) The length, width and height of the anchor box. Theta ^a The yaw angle of the anchor box around the Z axis is adopted;

the regression loss is expressed as:

wherein SmoothL1 is a SmoothL1 loss function;

L _cls ＝-a(1-p) ^r logp

in the formula, N _pos Indicating the number of correct detection boxes. Here beta _loc ＝ ₂ 、β _cls ＝ ₁ 、β _dir ＝0.2。

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A point-pillar-based second-order multi-attention-mechanism 3D point cloud target detection method is characterized by comprising the following steps of:

s5: performing pseudo-image space attention mechanism operation on the characteristics of the pseudo space, and outputting to obtain a detection result;

wherein the SSD detection head predicts a three-dimensional bounding box of the object using features of the trunk; the second order attention module comprises global max pooling, covariance pooling and row convolution; in the S3, under the condition that the point characteristics are used as the input of the second-order attention module, the obtained second-order point attention mechanism weight is used as the output, and the process is a second-order point attention module; when the channel features are input to the second order attention module, a second order channel attention mechanism weight is obtained, and the process is a second order channel attention module.

2. The method for detecting the second-order multi-attention mechanism 3D point cloud target based on the point column as claimed in claim 1, wherein the method comprises the following steps:

in a given K-th voxel, for all points in the voxel

Wherein N represents the maximum value of the number of points and C represents the number of channels, and after global maximum pooling, the vector consisting of the maximum values in each dimension is obtained>

Will be provided with

Wherein t is through W ₁ Number of points after full connection layer reduction, W ₁ After the full-connection layer, a ReLU activation function is used for calculating to obtain a covariance matrix between two points in the same voxel>

Performing convolution on the covariance matrix line by line to obtain a vector ^ er/ion, wherein t is the number of points in a second-order point attention mechanism, t is the number of channels in a second-order channel attention mechanism, and t multiplied by t is a dimension>

Then the vector is->

Is input to W ₂ Fully connecting layers and using an activation function Sigmoid function to obtain an N-dimensional attention vector

In S3, the second order point attention mechanism is expressed as:

s＝σ(W ₂ RC(Cov(σ(W ₁ (GMP(X))))))

And/or>

For two different fully-connected layers, X is the point in the given Kth voxel

The second-order channel attention mechanism is similar to the second-order point attention mechanism, the channel characteristics are output after passing through the second-order attention module to generate similar weights, and in S4, the second-order channel attention mechanism is expressed as:

M＝σ(W ₂ RC(Cov(σ(W ₁ (GMP(Y))))))

in the formula (I), the compound is shown in the specification,

3. The method for detecting the second-order multi-attention mechanism 3D point cloud target based on the point column as claimed in claim 2, wherein the method comprises the following steps: according to the importance degree of the region in the pseudo space to the task, different weights are distributed to each pixel point in the pseudo space to obtain a more accurate detection result, the characteristic P and the signal G of the pseudo space are used as input, the space attention weight generated by final output is S, and a pseudo image space attention mechanism is represented as follows:

the relationship of P and G can be expressed as:

in the formula (I), the compound is shown in the specification,

for a linear transformation operation with a convolution of 1 x 1, a decision is made as to whether the transformation is based on the result of the convolution operation>

4. The method for detecting the second-order multi-attention mechanism 3D point cloud target based on the point column as claimed in claim 3, wherein: parameterizing the three-dimensional ground truth box as (x, y, z, w, l, h, θ), wherein (x, y, z) represents the center position, (w, l, h) and θ represent the magnitude and direction angles of the bounding box, and the localization regression residual between the ground truth and the anchor is defined as follows:

Δθ＝sin(θ ^gt -θ ^a ),

in the formula (I), the compound is shown in the specification,

gt is ground truth value, a is parameters of anchor box, (x) ^gt ,y ^gt ,z ^gt ) As the center position coordinates of the 3D truth frame, (l) ^gt ,w ^gt ,h ^gt ) Is the length, width and height theta of the 3D true value frame ^gt For yaw angle, (x) of 3D truth frame around Z axis ^a ,y ^a ,z ^a ) As the coordinates of the center position of the anchor box、(l ^a ,w ^a ,h ^a ) Is the length, width and height of the anchor box theta ^a The yaw angle of the anchor box around the Z axis is adopted;

the regression loss is expressed as:

wherein SmoothL1 is a SmoothL1 loss function;

L _cls ＝-a(1-p) ^r log p