CN117974990A

CN117974990A - Point cloud target detection method based on attention mechanism and feature enhancement structure

Info

Publication number: CN117974990A
Application number: CN202410379713.7A
Authority: CN
Inventors: 孙畅; 李月华
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-05-03
Anticipated expiration: 2044-03-29
Also published as: CN117974990B

Abstract

The specification discloses a point cloud target detection method based on an attention mechanism and a feature enhancement structure, which can input cloud data to be detected into a point cloud detection model, so as to determine point attention weights corresponding to the cloud data to be detected through a pseudo image conversion module in the point cloud detection model, determine weighted dense tensors according to the point attention weights, map the weighted dense tensors into basic pseudo images, carry out channel attention weighting on the basic pseudo images to obtain channel attention weighting results, and carry out space attention weighting on the channel attention weighting results to obtain target pseudo images. And then, inputting the target pseudo image into a feature extraction module for feature enhancement in a point cloud detection model to obtain point cloud features corresponding to the point cloud data to be detected, so that the point cloud detection model carries out target detection on the point cloud data to be detected according to the point cloud features, and the accuracy of point cloud detection is improved.

Description

Point cloud target detection method based on attention mechanism and feature enhancement structure

Technical Field

The present disclosure relates to the field of neural networks and point cloud target detection, and in particular, to a point cloud target detection method based on an attention mechanism and a feature enhancement structure.

Background

Currently, object detection is a fundamental technology in computer vision, which is widely used in many fields, such as autopilot, smart city, robot, etc. At present, the 2D target detection algorithm based on the image has great effect, and the scientific research direction and the application direction accumulate the achievements with influence. 3D target detection algorithms (including image-based, point cloud-based, multi-modal-based, etc.) are in rapid development, and many researchers have been devoted to exploring 3D target detection algorithms, especially point cloud-based 3D target detection algorithms. Compared with 2D target detection, the 3D target detection based on the point cloud can acquire the geometric and depth information of the target, so that the target can be positioned better.

In the prior art, the original point cloud can be divided into three-dimensional voxels, such as Pillars (strut), so that the processing efficiency of the point cloud characteristics is improved, wherein the original point cloud is converted into a series of Pillars in the voxelization process by using a PointPillars model, pointPillars, and then a pseudo image is obtained, and finally 2D convolution and deconvolution operations are performed on the pseudo image, so that the characteristics are extracted, and a 3D detection result is output. The reasons for the relatively low detection accuracy of PointPillars mainly include: (1) PointPillars has certain information loss phenomenon when generating a pseudo image; (2) The manner used in the pseudo-image feature extraction stage is relatively simple and does not fully exploit the context information.

Disclosure of Invention

The present disclosure provides a point cloud target detection method based on an attention mechanism and a feature enhancement structure, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a point cloud target detection method based on an attention mechanism and a feature enhancement structure, which comprises the following steps:

acquiring cloud data of a to-be-detected point;

Inputting the cloud data of the to-be-detected point into a point cloud detection model trained in advance, determining point attention weights corresponding to the cloud data of the to-be-detected point through a pseudo-image conversion module in the point cloud detection model, determining weighted dense tensors according to the point attention weights, mapping the weighted dense tensors into basic pseudo-images, carrying out channel attention weighting on the basic pseudo-images to obtain channel attention weighting results, and carrying out space attention weighting on the channel attention weighting results to obtain target pseudo-images;

Inputting the target pseudo image into a feature extraction module for feature enhancement in the point cloud detection model to obtain point cloud features corresponding to the point cloud data to be detected, so that the point cloud detection model carries out target detection on the point cloud data to be detected according to the point cloud features.

Optionally, determining, by a pseudo image conversion module in the point cloud detection model, a point attention weight corresponding to the cloud data to be detected, and determining a weighted dense tensor according to the point attention weight, including:

discretizing the region where the cloud data of the to-be-detected points are located into grids uniformly spaced in an x-y plane to obtain a series of struts with infinite space range in the z direction;

Determining relevant information corresponding to each point in the cloud data of the point to be detected, wherein the relevant information comprises original three-dimensional coordinates of the point in the cloud, laser reflectivity, center coordinates of a pillar where the point is positioned and distances deviating from the center of the pillar on x and y axes;

Converting the cloud data to be detected into dense tensors S with the sizes of (D, P, N) according to the relevant information corresponding to each point in the cloud data to be detected, wherein D is the number of dimensions of the relevant information, P is the number of struts, and N is the number of points in each strut;

Global pooling is carried out on the dense tensor S in the D and P dimensions, and a pooling result with the size of (1, N) is obtained;

Inputting the pooling result into a two-layer convolution network, and determining point attention weights, wherein the point attention weights represent attention weights corresponding to different points in one strut;

And weighting the dense tensor S according to the point attention weight to obtain the weighted dense tensor.

Optionally, mapping the weighted dense tensor into a base pseudo image, and performing channel attention weighting on the base pseudo image to obtain a channel attention weighting result, which specifically includes:

Inputting the weighted dense tensor into a linear layer, mapping the relevant information corresponding to each point in the weighted dense tensor into a high-dimensional feature to obtain a mapping result with the size of (C, P, N), wherein C is the number of feature dimensions of the mapping;

Taking the maximum value on the N channel of the mapping result to obtain tensors with the sizes of (C, P), and dispersing the tensors according to the positions of the struts where each point is located to obtain a basic pseudo image with the sizes of (C, H, W), wherein H and W are the length and the width of the pseudo image;

Changing the size of the basic pseudo image into (C, H multiplied by W) to obtain a change result, transposing the change result to obtain a transposed result with the size of (H multiplied by W, C), performing matrix multiplication on the change result and the transposed result to obtain tensors with the size of (C, C), and inputting the tensors with the size of (C, C) into an activation function Softmax layer to obtain channel attention weights, wherein the channel attention weights are used for representing weights corresponding to different dimensions of a high-dimensional feature;

and carrying out channel attention weighting on the basic pseudo image according to the channel attention weight to obtain a channel attention weighting result.

Optionally, spatial attention weighting is performed on the channel attention weighting result to obtain a target pseudo image, which specifically includes:

Inputting the channel attention weighted result into three cavity convolution layers with cavity rates of 1, 3 and 5 respectively to obtain three cavity convolution results;

splicing the three cavity convolution results to obtain a splicing result, and inputting the splicing result into a compression network to obtain a compression result, wherein the compression network comprises a convolution layer with the convolution kernel size of (1, 1), a BN layer and an activation function Relu layer;

Inputting the compression result into an activation function Softmax layer to obtain a spatial attention weight;

And carrying out spatial attention weighting on the channel attention weighting result according to the spatial attention weight to obtain a target pseudo image.

Optionally, inputting the target pseudo image into a feature extraction module for feature enhancement in the point cloud detection model to obtain point cloud features corresponding to the point cloud data to be detected, which specifically comprises;

Inputting the target pseudo image into a first sub-module of a feature extraction module for feature enhancement in the point cloud detection model to obtain a plurality of feature images, wherein the feature images are determined by a plurality of sequentially ordered convolution modules in the first sub-module in sequence, and the feature images which are arranged behind the feature images are smaller;

inputting the feature images into a second sub-module in the feature extraction module, and fusing a convolution result obtained by convolving the feature image with a deconvolution result corresponding to a later feature image of the feature image by the second sub-module for each feature image to obtain a fusion result;

And inputting each fusion result into a third sub-module in the characteristic extraction module, so that the sizes of the fusion results are unified through the third sub-module, and the point cloud characteristics corresponding to the cloud data of the to-be-detected points are obtained.

Optionally, training the point cloud detection model specifically includes:

Acquiring point cloud samples and marking information corresponding to the point cloud samples;

Inputting the point cloud sample into a point cloud detection model to be trained, determining a point attention weight corresponding to the point cloud sample through a pseudo-image conversion module in the point cloud detection model, determining a dense tensor after attention weighting according to the point attention weight, mapping the dense tensor after attention weighting into a two-dimensional basic pseudo-image, and carrying out channel attention weighting and space attention weighting on the basic pseudo-image to obtain a target pseudo-image corresponding to the point cloud sample;

Inputting the target pseudo image corresponding to the point cloud sample into a feature extraction module for feature enhancement in the point cloud detection model to obtain point cloud features corresponding to the point cloud sample, and determining a positioning result, an angle positioning result and a classification result of a target object in the point cloud sample according to the point cloud features corresponding to the point cloud sample;

And determining positioning loss, angle positioning loss and classification loss according to the labeling information corresponding to the point cloud sample, the positioning result, the angle positioning result and the classification result of the target object in the point cloud sample, and training the point cloud detection model by taking the positioning loss, the angle positioning loss and the classification loss as optimization targets.

The present specification provides a point cloud object detection apparatus based on an attention mechanism and a feature enhancement structure, including:

The acquisition module is used for acquiring cloud data of a to-be-detected point;

The pseudo image conversion module is used for inputting the cloud data of the to-be-detected point into a point cloud detection model trained in advance, determining point attention weights corresponding to the cloud data of the to-be-detected point through the pseudo image conversion module in the point cloud detection model, determining weighted dense tensors according to the point attention weights, mapping the weighted dense tensors into basic pseudo images, carrying out channel attention weighting on the basic pseudo images to obtain channel attention weighting results, and carrying out space attention weighting on the channel attention weighting results to obtain target pseudo images;

The feature extraction module is used for inputting the target pseudo image into the feature extraction module for feature enhancement in the point cloud detection model to obtain point cloud features corresponding to the point cloud data to be detected, so that the point cloud detection model carries out target detection on the point cloud data to be detected according to the point cloud features.

Optionally, the feature extraction module is specifically configured to input the target pseudo image into a first sub-module of the feature extraction module for feature enhancement in the point cloud detection model, so as to obtain a plurality of feature graphs, where the plurality of feature graphs are determined by sequentially passing through a plurality of sequentially ordered convolution modules in the first sub-module, and the feature graphs that are more arranged behind the feature graphs are smaller; inputting the feature images into a second sub-module in the feature extraction module, and fusing a convolution result obtained by convolving the feature image with a deconvolution result corresponding to a later feature image of the feature image by the second sub-module for each feature image to obtain a fusion result; and inputting each fusion result into a third sub-module in the characteristic extraction module, so that the sizes of the fusion results are unified through the third sub-module, and the point cloud characteristics corresponding to the cloud data of the to-be-detected points are obtained.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above described point cloud target detection method based on an attention mechanism and a feature enhancement structure.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described point cloud target detection method based on an attention mechanism and a feature enhancement structure when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

According to the point cloud target detection method based on the attention mechanism and the feature enhancement structure, the cloud data to be detected can be acquired, the cloud data to be detected is input into a point cloud detection model trained in advance, the point attention weight corresponding to the cloud data to be detected is determined through a pseudo image conversion module in the point cloud detection model, the weighted dense tensor is determined according to the point attention weight, the weighted dense tensor is mapped into a basic pseudo image, the basic pseudo image is subjected to channel attention weighting, a channel attention weighting result is obtained, and the channel attention weighting result is subjected to spatial attention weighting, so that the target pseudo image is obtained. And inputting the target pseudo image into a feature extraction module for feature enhancement in a point cloud detection model to obtain point cloud features corresponding to the point cloud data to be detected, so that the point cloud detection model carries out target detection on the point cloud data to be detected according to the point cloud features.

From the above, it can be seen that, in the method, by using multiple attention mechanisms to obtain the target pseudo-image, information loss after converting the point cloud data into the pseudo-image is reduced, and feature enhancement can be achieved in a certain way when feature extraction is performed on the pseudo-image, so that context information is effectively utilized, and accuracy of point cloud detection is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

fig. 1 is a schematic flow chart of a point cloud object detection method based on an attention mechanism and a feature enhancement structure provided in the present specification;

Fig. 2 is a schematic structural diagram of a point cloud detection model provided in the present specification;

FIG. 3 is a schematic flow diagram of a feature enhancement provided herein;

fig. 4 is a schematic diagram of a point cloud object detection device based on an attention mechanism and a feature enhancement structure provided in the present specification;

Fig. 5 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a point cloud object detection method based on an attention mechanism and a feature enhancement structure provided in the present specification, specifically including the following steps:

s100: and acquiring cloud data of a to-be-detected point.

S102: inputting the cloud data of the to-be-detected point into a pre-trained point cloud detection model, determining point attention weights corresponding to the cloud data of the to-be-detected point through a pseudo-image conversion module in the point cloud detection model, determining weighted dense tensors according to the point attention weights, mapping the weighted dense tensors into basic pseudo-images, carrying out channel attention weighting on the basic pseudo-images to obtain channel attention weighting results, and carrying out space attention weighting on the channel attention weighting results to obtain target pseudo-images.

S104: inputting the target pseudo image into a feature extraction module for feature enhancement in the point cloud detection model to obtain point cloud features corresponding to the point cloud data to be detected, so that the point cloud detection model carries out target detection on the point cloud data to be detected according to the point cloud features.

In the present specification, a method for performing object detection on a point cloud is provided, in which a module for converting the point cloud into a pseudo image and a module for performing feature extraction on the pseudo image are mainly improved in a point cloud detection model, and an explanation for training the point cloud detection model is provided.

Firstly, the step of performing target detection on point cloud data for a point cloud detection model is described:

The server can acquire cloud data of the to-be-detected points, input the cloud data of the to-be-detected points into a point cloud detection model trained in advance, determine point attention weights corresponding to the cloud data of the to-be-detected points through a pseudo-image conversion module in the point cloud detection model, determine weighted dense tensors according to the point attention weights, map the weighted dense tensors into basic pseudo-images, carry out channel attention weighting on the basic pseudo-images to obtain channel attention weighting results, and carry out space attention weighting on the channel attention weighting results to obtain target pseudo-images. And then, inputting the target pseudo image into a feature extraction module for feature enhancement in the point cloud detection model to obtain the point cloud features corresponding to the cloud data to be detected, so that the point cloud detection model carries out target detection on the cloud data to be detected according to the point cloud features.

It can be seen that two more important modules in the point cloud detection model are a pseudo-image conversion module and a feature extraction module, wherein the pseudo-image conversion module is mainly used for obtaining a pseudo-image with less information loss by determining various attention weights, and the feature extraction module is mainly used for enhancing the extracted image features when extracting the features of the pseudo-image so as to consider the context information of the pseudo-image. The pseudo image conversion module mainly comprises a point attention module, a channel attention module and a space attention module, and the characteristic extraction module mainly comprises three sub-networks, as shown in fig. 2.

Fig. 2 is a schematic structural diagram of a point cloud detection model provided in the present specification.

In the point attention module, the cloud data to be detected may be divided into pillars (Pillars) so as to obtain a pseudo image, that is, the area where the cloud data to be detected is located may be discretized into a grid with uniform intervals in the x-y plane, to obtain a series of Pillars with infinite space range in the z direction, and then, relevant information corresponding to each point in the cloud data to be detected is determined, where the relevant information includes original three-dimensional coordinates (x, y, z) of the point in the point cloud, the laser reflectivity r, central coordinates (xc, yc, zc) of the pillars where the laser reflectivity r is located, and distances (xp and yp) deviating from the centers of the pillars in the x and y axes.

Then, according to the relevant information corresponding to each point in the cloud data of the to-be-detected point, the cloud data of the to-be-detected point can be converted into a dense tensor S with the size of (D, P, N), wherein D is the number of dimensions of the relevant information, D= 9,P is the number of struts, N is the number of points in each strut, and then global pooling can be carried out on the dense tensor S in the D and P dimensions to obtain a pooling result with the size of (1, N). The pooling result is input into the two-layer convolution network, so that point attention weights can be determined, that is, the point attention weights represent attention weights corresponding to different points in one strut, and the point attention weights corresponding to different struts are consistent, and the dense tensor S is weighted through the determined point attention weights, so that the weighted dense tensor can be obtained.

The point attention weights and the weighted dense tensors can be expressed by the following formula:

Wherein the point attention weight is The weighted dense tensor is/>Global represents Global average pooling, and the mean is calculated in the D and P dimensions of the S tensor, resulting in tensors of size (1, N); w1 and W2 represent two 2D convolutions with convolution kernel size (1, 1) and step size 1; delta represents Relu activation functions; sigma represents a Sigmoid function; /(I)Representation will/>Is expanded to be the same size as the dense tensor S.

Then, for the channel attention module, the weighted dense tensor may be input into the linear layer, so that the relevant information corresponding to each point in the weighted dense tensor is mapped into a high-dimensional feature, to obtain a mapping result with a size of (C, P, N), where C is the number of feature dimensions of the mapping (herein, the channel attention weight refers to the attention weight of different feature dimensions in the dimension of C).

And then, taking the maximum value on the N channel of the mapping result to obtain tensors with the sizes of (C, P), and dispersing the tensors according to the positions of the struts where each point is located to obtain a basic pseudo image with the sizes of (C, H, W), wherein H and W are the length and the width of the pseudo image.

Changing the size of the basic pseudo image into (C, H multiplied by W) to obtain a change result, transposing the change result to obtain a transposed result with the size of (H multiplied by W, C), performing matrix multiplication on the change result and the transposed result to obtain a tensor with the size of (C, C), and inputting the tensor with the size of (C, C) into an activation function Softmax layer to obtain the channel attention weight. Wherein matrix multiplication is used to determine the channel attention weight in this dimension of C, softmax is used to normalize the calculated weights. And then, carrying out channel attention weighting on the basic pseudo image according to the determined channel attention weight to obtain a channel attention weighting result.

The formulas for determining and weighting channel attention weights may be as follows:

Wherein g represents the whole operation of mapping the discrete point cloud into a pseudo image, and the whole operation comprises the operation of taking the maximum value on a linear layer and an N channel; representing a pseudo image of size (C, H, W), H and W representing the height and width of the feature; /(I) Representing the determined channel attention weight, beta being the scale parameter,/>The result is weighted for channel attention.

It should be noted that, the determined channel attention weight is a tensor with a size (C, C), and thus the weighting process is: at the position ofAnd shape-changing/>(I.e., the above-mentioned change results) to generate a matrix multiplication of the size (C, H)Tensor of W). The shape of the tensor is changed again to change its size to (C, H, W). Multiplying the tensor after changing shape by the scale parameter β and by the base pseudo-image/>Element-wise summation to generate a channel attention weighting based pseudo-image/>。

Aiming at the spatial attention module, the channel attention weighting result can be input into three cavity convolution layers with the cavity rates of 1, 3 and 5 respectively to obtain three cavity convolution results, then the three cavity convolution results can be spliced to obtain a splicing result (the splicing result is (3C, H and W)), the splicing result is input into a compression network to obtain a compression result, the compression network comprises a convolution layer with the convolution kernel size of (1, 1), a BN layer and an activation function Relu layer, finally the compression result is input into an activation function Softmax layer to obtain spatial attention weight, and the channel attention weighting result is subjected to spatial attention weighting according to the spatial attention weight to obtain a target pseudo image.

Since the above-mentioned three hole convolution results are spliced to obtain tensors with the size (3C, H, W), the compression network is used for compressing the tensors with the size (3C, H, W) into tensors with the size (C, H, W), and the compression network is also used for calculating the spatial attention weights, wherein the spatial attention weights represent weights on the (H, W) distribution, that is, the weights on different pixels are determined if the target pseudo-image is regarded as a similar image.

The specific way of carrying out the spatial attention weighting is as follows: weighting spatial attentionAnd/>Multiplying by element, and multiplying the result by scale parameter/>Finally and/>Element-by-element summation to generate a target pseudo-image/>, based on spatial attention weighting。

Wherein,Representing a cavity convolution structure, d representing an expansion rate; /(I)Is a scale parameter; /(I)Representing a cavity convolution result; /(I)A compressed network is represented that includes a convolutional layer, a BN layer, and an activation function Relu.

After the target pseudo image is obtained, feature extraction can be performed on the target pseudo image, and the feature extraction module is mainly used for improving the extraction capability of the context information by compensating the deconvolution result of the smaller feature image into the convolution result of the larger feature image, wherein the process is shown in fig. 3.

FIG. 3 is a schematic flow chart of a feature enhancement provided in the present specification.

Specifically, the target pseudo image may be input into a first sub-module of a feature extraction module (feature enhancement module) for feature enhancement in the point cloud detection model, so as to obtain a plurality of feature images, where the plurality of feature images are determined by sequentially passing through a plurality of sequentially ordered convolution modules in the first sub-module, and the more ordered feature images in the plurality of feature images are smaller.

For example, three feature maps F11, F21, and F31 may be generated, the dimensions being (C, H/2, W/2), (2C, H/4, W/4), and (4C, H/8,W/8), respectively. The first sub-module includes a series of blocks (S, L, F). S represents the Block step size (determined from the input target pseudo-image), L represents the number of 3 x3 2D convolutional layers in a convolutional module, each followed by BN layers and an activation function Relu, and F represents the output channel width.

Wherein,、/>And/>Each representing a convolution module (Block) in the first subnetwork in the feature enhancement module (i.e., the first sub-module described above).

And then, the feature images can be input into a second sub-module in the feature extraction module, so that a convolution result obtained by convolving the feature image and a deconvolution result corresponding to the latter feature image are fused by the second sub-module for each feature image to obtain a fusion result.

The smaller the size of the subsequent feature map, the larger the receptive field obtained by convolution, because the purpose of fusing the convolution result of one feature map with the deconvolution result of the feature map smaller than this feature map is to strengthen the large feature map by the small feature map, but the smallest feature map does not need to be fused with the latter feature map, and therefore, the convolution result may be directly obtained for the smallest feature map as its corresponding fusion result.

Continuing with the example of generating three feature maps, the three feature maps F11, F21, and F31 output above may be transmitted to a second sub-network of the feature enhancement module (i.e., the second sub-module) to generate feature maps F12, F22, and F32 with dimensions (C, H/2, w/2), (2C, H/4, w/4), and (4C, H/8,W/8) corresponding to the feature maps F11, F21, and F31, respectively. The feature map F22 is obtained by element-wise addition of the feature map F31 after the deconvolution layer and the feature map F21 after the convolution layer, and the feature map F12 is obtained by element-wise addition of the feature map F21 after the deconvolution layer and the feature map F11 after the convolution layer. The feature map F32 is obtained from the feature map F31 after passing through the convolution layer. All of the above described convolution and deconvolution layers are followed by BN layers and activation functions Relu.

Wherein,And/>Representing Block consisting of deconvolution layer, BN and Relu,/>A Block consisting of convolutional layers, BN and Relu is shown.

And finally, inputting each fusion result into a third sub-module in the characteristic extraction module, so that the sizes of the fusion results are unified and spliced through the third sub-module, and the point cloud characteristics corresponding to the cloud data of the to-be-detected point are obtained.

Specifically, the three output feature maps F12, F22 and F32 may be transferred into the third sub-network of the feature enhancement module, so as to generate three feature maps F13, F23 and F33 with the same size, which are (2 c, h/2, w/2). The feature maps F13, F23, and F33 are obtained by deconvolution of feature maps F12, F22, and F32, respectively, all three deconvolution layers followed by BN layer and activation function Relu.

Wherein,A Block consisting of deconvolution layers, BN and Relu is shown.

And splicing the three feature maps F13, F23 and F33 with the same size together on a Channel to form a final output point cloud feature for subsequent target positioning and classification.

The foregoing is mainly a description of a forward propagation process of the point cloud detection model, and the foregoing process is adopted when the point cloud detection model is trained, whether the point cloud detection model is used for target detection or the point cloud detection model is trained.

The point cloud detection model performs target detection according to the extracted point cloud features, and may output a positioning result, a classification result and an angle positioning result for the target object, where the positioning result indicates a position of the target object, the classification result indicates a classification of the target object, and the angle positioning result may indicate an orientation of the target object.

Therefore, during training, three losses of the positioning result, the classifying result and the angle positioning result need to be determined, and the point cloud detection model needs to be trained with minimized positioning loss, angle positioning loss and classifying loss as optimization targets, wherein the three losses are shown in the following formula:

Wherein, Representing the number of positive a priori boxes,/>Representing loss of positioning,/>Representing classification loss,/>Representing angular positioning loss,/>，/>，/>。

Specifically, in the training stage, a point cloud sample and labeling information corresponding to the point cloud sample can be obtained; and inputting the point cloud sample into a point cloud detection model to be trained, determining the point attention weight corresponding to the point cloud sample through a pseudo-image conversion module in the point cloud detection model, determining the dense tensor after attention weighting according to the point attention weight, mapping the dense tensor after attention weighting into a two-dimensional basic pseudo-image, and carrying out channel attention weighting and space attention weighting on the basic pseudo-image to obtain a target pseudo-image corresponding to the point cloud sample. Inputting the target pseudo image corresponding to the point cloud sample into a feature extraction module for feature enhancement in the point cloud detection model to obtain point cloud features corresponding to the point cloud sample, and determining a positioning result, an angle positioning result and a classification result of the target object in the point cloud sample according to the point cloud features corresponding to the point cloud sample. And determining positioning loss, angle positioning loss and classification loss according to the labeling information corresponding to the point cloud sample, the positioning result, the angle positioning result and the classification result of the target object in the point cloud sample, and training the point cloud detection model by taking the minimized positioning loss, the angle positioning loss and the classification loss as optimization targets.

Model reasoning is carried out on KITTI verification sets, model 3D detection results are output, and detection effect assessment is carried out on an assessment program. The evaluation index used was the average Precision (THE AVERAGE Precision, AP), and the evaluation target selected was the Car class. The specific results are shown in Table 1 (Attention-pillars and comparison of the results of other algorithms on the KITTI validation set).

Wherein Easy represents that the minimum bounding box height of the target participating in the evaluation is 40 pixels, no occlusion exists, and the maximum truncation is 15%; moderate denotes that the minimum bounding box height of the object involved in the evaluation is 25 pixels, the partial occlusion, and the maximum cutoff is 30%; hard means that the minimum bounding box height of the object involved in the evaluation is 25 pixels, the maximum occlusion is difficult to see, and the maximum cutoff is 50%. As can be seen from the observation of Table 1, the detection accuracy of the method provided by the application is improved by 1.30%, 1.68% and 7.06% respectively compared with that of the Baseline model PointPillars under the three evaluation modes Easy, moderate and Hard.

TABLE 1

Model reasoning is carried out on nuScenes verification sets, model 3D detection results are output, and detection effect assessment is carried out on an assessment program. The evaluation indexes are mAP and NuScenes Detection Score (NDS), and the total number of the selected evaluation target categories is 10. Specific results are shown in Table 2 (comparison of detection results of the Attention-pillars method and other methods on nuScenes verification set), and the observation of Table 2 shows that under the two evaluation indexes of mAP and NDS, the detection accuracy of the method disclosed by the application is respectively improved by 11.09% and 5.31% compared with that of a Baseline model PointPillars.

TABLE 2

For convenience of description, the execution subject for executing the method will be described as a server, and the execution subject of the method may be a computer, a controller, a server, or the like, which is not limited herein. The features of the following examples and embodiments may be combined with each other without any conflict.

The above is a point cloud object detection method based on the attention mechanism and the feature enhancement structure, and based on the same thought, the present disclosure further provides a point cloud object detection device based on the attention mechanism and the feature enhancement structure, as shown in fig. 4.

Fig. 4 is a schematic diagram of a point cloud object detection device based on an attention mechanism and a feature enhancement structure provided in the present specification, including:

An acquisition module 401, configured to acquire cloud data of a point to be detected;

The pseudo image conversion module 402 is configured to input the cloud data to be detected into a point cloud detection model trained in advance, determine a point attention weight corresponding to the cloud data to be detected through the pseudo image conversion module in the point cloud detection model, determine a weighted dense tensor according to the point attention weight, map the weighted dense tensor into a basic pseudo image, and perform channel attention weighting on the basic pseudo image to obtain a channel attention weighting result, and perform spatial attention weighting on the channel attention weighting result to obtain a target pseudo image;

The feature extraction module 403 is configured to input the target pseudo image to a feature extraction module for feature enhancement in the point cloud detection model, so as to obtain a point cloud feature corresponding to the point cloud data to be detected, so that the point cloud detection model performs target detection on the point cloud data to be detected according to the point cloud feature.

Optionally, the pseudo-image conversion module 402 is specifically configured to discretize an area where the cloud data of the to-be-detected point is located into a grid with uniform intervals in an x-y plane, so as to obtain a series of struts with infinite spatial range in the z direction; determining relevant information corresponding to each point in the cloud data of the point to be detected, wherein the relevant information comprises original three-dimensional coordinates of the point in the cloud, laser reflectivity, center coordinates of a pillar where the point is positioned and distances deviating from the center of the pillar on x and y axes; converting the cloud data to be detected into dense tensors S with the sizes of (D, P, N) according to the relevant information corresponding to each point in the cloud data to be detected, wherein D is the number of dimensions of the relevant information, P is the number of struts, and N is the number of points in each strut; global pooling is carried out on the dense tensor S in the D and P dimensions, and a pooling result with the size of (1, N) is obtained; inputting the pooling result into a two-layer convolution network, and determining point attention weights, wherein the point attention weights represent attention weights corresponding to different points in one strut; and weighting the dense tensor S according to the point attention weight to obtain the weighted dense tensor.

Optionally, the pseudo image conversion module 402 is specifically configured to input the weighted dense tensor into a linear layer, map relevant information corresponding to each point in the weighted dense tensor into a high-dimensional feature, and obtain a mapping result with a size of (C, P, N), where C is a mapped feature dimension number; taking the maximum value on the N channel of the mapping result to obtain tensors with the sizes of (C, P), and dispersing the tensors according to the positions of the struts where each point is located to obtain a basic pseudo image with the sizes of (C, H, W), wherein H and W are the length and the width of the pseudo image; changing the size of the basic pseudo image into (C, H multiplied by W) to obtain a change result, transposing the change result to obtain a transposed result with the size of (H multiplied by W, C), performing matrix multiplication on the change result and the transposed result to obtain tensors with the size of (C, C), and inputting the tensors with the size of (C, C) into an activation function Softmax layer to obtain channel attention weights, wherein the channel attention weights are used for representing weights corresponding to different dimensions of a high-dimensional feature; and carrying out channel attention weighting on the basic pseudo image according to the channel attention weight to obtain a channel attention weighting result.

Optionally, the pseudo image conversion module 402 is specifically configured to input the channel attention weighting result into three hole convolution layers with hole rates of 1,3, and 5, to obtain three hole convolution results; splicing the three cavity convolution results to obtain a splicing result, and inputting the splicing result into a compression network to obtain a compression result, wherein the compression network comprises a convolution layer with the convolution kernel size of (1, 1), a BN layer and an activation function Relu layer; inputting the compression result into an activation function Softmax layer to obtain a spatial attention weight; and carrying out spatial attention weighting on the channel attention weighting result according to the spatial attention weight to obtain a target pseudo image.

Optionally, the method comprises the step of. The feature extraction module 403 is specifically configured to input the target pseudo image into a first sub-module of the feature extraction module for feature enhancement in the point cloud detection model, to obtain a plurality of feature graphs, where the plurality of feature graphs are determined sequentially by a plurality of sequentially ordered convolution modules in the first sub-module, and the feature graphs that are more arranged behind the feature graphs are smaller; inputting the feature images into a second sub-module in the feature extraction module, and fusing a convolution result obtained by convolving the feature image with a deconvolution result corresponding to a later feature image of the feature image by the second sub-module for each feature image to obtain a fusion result; and inputting each fusion result into a third sub-module in the characteristic extraction module, so that the sizes of the fusion results are unified through the third sub-module, and the point cloud characteristics corresponding to the cloud data of the to-be-detected points are obtained.

Optionally, the apparatus further comprises:

The training module 404 is configured to obtain a point cloud sample and labeling information corresponding to the point cloud sample;

inputting the point cloud sample into a point cloud detection model to be trained, determining a point attention weight corresponding to the point cloud sample through a pseudo-image conversion module in the point cloud detection model, determining a dense tensor after attention weighting according to the point attention weight, mapping the dense tensor after attention weighting into a two-dimensional basic pseudo-image, and carrying out channel attention weighting and space attention weighting on the basic pseudo-image to obtain a target pseudo-image corresponding to the point cloud sample; inputting the target pseudo image corresponding to the point cloud sample into a feature extraction module for feature enhancement in the point cloud detection model to obtain point cloud features corresponding to the point cloud sample, and determining a positioning result, an angle positioning result and a classification result of a target object in the point cloud sample according to the point cloud features corresponding to the point cloud sample; and determining positioning loss, angle positioning loss and classification loss according to the labeling information corresponding to the point cloud sample, the positioning result, the angle positioning result and the classification result of the target object in the point cloud sample, and training the point cloud detection model by taking the positioning loss, the angle positioning loss and the classification loss as optimization targets.

The present specification also provides a computer readable storage medium storing a computer program operable to perform the above-described point cloud object detection method based on an attention mechanism and a feature enhancement structure.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 5. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 5, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the point cloud target detection method based on the attention mechanism and the characteristic enhancement structure.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL（Advanced Boolean Expression Language）、AHDL（Altera Hardware Description Language）、Confluence、CUPL（Cornell University Programming Language）、HDCal、JHDL（Java Hardware Description Language）、Lava、Lola、MyHDL、PALASM、RHDL（Ruby Hardware Description Language）, and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. The point cloud target detection method based on the attention mechanism and the characteristic enhancement structure is characterized by comprising the following steps of:

acquiring cloud data of a to-be-detected point;

2. The method of claim 1, wherein determining, by a pseudo image conversion module in the point cloud detection model, a point attention weight corresponding to the cloud data to be detected, and determining a weighted dense tensor according to the point attention weight, specifically includes:

3. The method of claim 1, wherein mapping the weighted dense tensor into a base pseudo-image, and performing channel attention weighting on the base pseudo-image to obtain a channel attention weighted result, specifically comprises:

4. The method according to claim 1, wherein spatially attention weighting the channel attention weighted result to obtain a target pseudo image, specifically comprising:

5. The method of claim 1, wherein inputting the target pseudo image into a feature extraction module for feature enhancement in the point cloud detection model to obtain a point cloud feature corresponding to the point cloud data to be detected, specifically comprising;

and inputting each fusion result into a third sub-module in the characteristic extraction module, so that the sizes of the fusion results are unified through the third sub-module, and the point cloud characteristics corresponding to the cloud data of the to-be-detected point are obtained.

6. The method of claim 1, wherein training the point cloud detection model specifically comprises:

7. A point cloud object detection apparatus based on an attention mechanism and a feature enhancement structure, comprising:

8. The apparatus of claim 7, wherein the feature extraction module is specifically configured to input the target pseudo-image into a first sub-module of the feature extraction module for feature enhancement in the point cloud detection model, to obtain a plurality of feature maps, where the plurality of feature maps are determined sequentially by a plurality of sequentially ordered convolution modules in the first sub-module, and a feature map that is more ranked among the plurality of feature maps is smaller; inputting the feature images into a second sub-module in the feature extraction module, and fusing a convolution result obtained by convolving the feature image with a deconvolution result corresponding to a later feature image of the feature image by the second sub-module for each feature image to obtain a fusion result; and inputting each fusion result into a third sub-module in the characteristic extraction module, so that the sizes of the fusion results are unified through the third sub-module, and the point cloud characteristics corresponding to the cloud data of the to-be-detected points are obtained.

9. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-6 when executing the program.