CN113706480B

CN113706480B - Point cloud 3D target detection method based on key point multi-scale feature fusion

Info

Publication number: CN113706480B
Application number: CN202110928928.6A
Authority: CN
Inventors: 张旭; 柏琳娟; 杨艳; 廖敏; 张振杰; 冯梅; 李济; 万勤; 苟宇
Original assignee: Chongqing Productivity Promotion Center; Chongqing University of Post and Telecommunications
Current assignee: Chongqing Productivity Promotion Center; Chongqing University of Post and Telecommunications
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2022-12-09
Anticipated expiration: 2041-08-13
Also published as: CN113706480A

Abstract

The invention belongs to the field of 3D target detection, and particularly relates to a point cloud 3D target detection method based on key point multi-scale feature fusion, which comprises the following steps: acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result; the distance sampling global feature and the extraction algorithm of the feature sampling global feature are improved in the point cloud 3D target detection model, so that the target detection efficiency and accuracy are improved; the invention adds a characteristic farthest point sampling sequence extraction module, uses farthest point sampling based on characteristics to act on different voxel sparse convolution layers to obtain characteristics of different scales, and reduces the influence of background points and target detection.

Description

Point cloud 3D target detection method based on key point multi-scale feature fusion

Technical Field

The invention belongs to the field of 3D target detection, and particularly relates to a point cloud 3D target detection method based on key point multi-scale feature fusion.

Background

With the rapid development of 3D scene acquisition technologies, 3D detectors such as 3D scanners, radar detectors, and depth cameras have become more inexpensive and superior, which provides sufficient advantages for the mass use of 3D detectors in the field of autopilot. A laser radar (LIDAR) sensor enters the field of view of a person. Large-scale data collected using LIDAR sensors is referred to as a point cloud, and the data set typically includes a laser beam emitted by the LIDAR to locate the three-dimensional coordinates of surrounding objects and the beam return laser intensity.

In recent years, two-dimensional (2D) object detection under camera systems has achieved extraordinary success, but object detection using pictures also has some problems such as: the quality of the picture is limited by the weather state, the environment state, the light state and the like when the picture is collected, the laser radar is insensitive to the change of the weather state, the environment state and the light state, the laser radar light beam can easily penetrate rain fog, dust and the like, and the laser radar can work in the daytime and at night even under the conditions of glare and shadow.

Point cloud-based object detection methods have been extensively studied. A typical VoxelNet network eliminates the need for manual feature engineering of 3D point clouds, unifying feature extraction and target box prediction into a single-stage, end-to-end trainable deep network. The point cloud is divided into 3D voxels with equal intervals, a group of points in each voxel are converted into a uniform feature representation by introducing a voxel feature coding layer, and then the uniform feature representation is connected to a region generation network to generate a candidate frame. The SECOND provides 3D sparse convolution on the basis of VoxelNet to avoid the condition that 3D convolution characteristic diffusion is carried out after empty voxels exist due to the fact that point cloud voxelization intervals are too small.

Another specific representative PointNet proposes that a neural network is used for directly extracting unordered point features, the unordered point features take original point clouds as input, a multilayer perceptron is used for mapping low-dimensional features to a high-dimensional feature space to ensure network translation invariance, and the F-PointNet firstly applies the PointNet to three-dimensional target detection based on a two-dimensional image boundary box to cut the point clouds; and the 3DSSD selects key point samples of the up-sampling feature distance from the point cloud to classify and position the target frame respectively.

Although these methods have made remarkable progress, the detection accuracy of the sample is not high when the method is applied to a sparse point cloud target detection scene. The main reasons are 1 when the point cloud is sampled, 2 for distinguishing information of foreground points and background points, 3 for ignoring the correlation under different scale characteristics and detecting accuracy on severely shielded objects.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a point cloud 3D target detection method based on key point multi-scale feature fusion, which comprises the following steps: acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result;

the process of training the point cloud 3D target detection model comprises the following steps:

s1: acquiring original point cloud data, and selecting the original point cloud data by adopting a farthest distance sampling method to obtain a point cloud sequence;

s2: dividing original point cloud data into voxel blocks with equal intervals, and extracting initial features of the voxel blocks;

s3: inputting the initial characteristics of the point cloud sequence and the voxel block into a 3D sparse convolution neural network to obtain a voxel characteristic space; mapping the position information of the key points in the point cloud sequence to the voxel characteristic space of the corresponding position of each layer of sparse convolution, and updating the position information of the key points;

s4: extracting the characteristics of the key points in the characteristic space of each layer body element by adopting a distance farthest point sampling sequence extraction method to obtain the distance sampling local characteristics of the point cloud sequence;

s5: sampling distance sampling local features of the point cloud sequence by adopting a feature farthest point sampling method to obtain local feature key point features;

s6: fusing the distance sampling local features of each sparse convolution layer by adopting a fusion strategy to obtain a distance sampling global feature; fusing the local feature key point features of each sparse convolution layer by adopting a fusion strategy to obtain feature sampling global features;

s7: converting the voxel characteristic space into a 2D aerial view, and extracting dense characteristics of the aerial view by adopting a bilinear interpolation method; processing the dense features by adopting a regional feature extraction method to generate a 3D suggestion frame;

s8: performing sensitive area pooling on the distance sampling feature and the feature sampling feature according to the 3D suggestion frame to obtain a target detection result;

s9: and calculating a loss function of the model according to the obtained result, adjusting parameters of the model, and finishing the training of the model when the loss function is minimum.

And after the target detection result is obtained, updating the grid points under the distance key points and the feature key points according to the target detection to obtain a regression target frame and a classification target frame for the next target detection.

Preferably, the process of distance feature sampling of the original point cloud data includes: randomly initializing a point in the original point cloud data, and acquiring distance key points from all the point cloud data by taking the point as the initial point and adopting a distance farthest point sampling method to obtain a point cloud sequence.

Further, a spatial distance measurement formula between two points in the point cloud sequence is as follows:

where D-Distance represents the L2 Distance between two points, X and Y represent the reflection intensities of the coordinates of the two points, and Sqrt represents a non-negative square root function.

Preferably, the process of extracting the initial feature of the voxel block includes: equally dividing the input point cloud into voxel blocks with equal intervals, wherein the length, width and height of each voxel block are L, W and H respectively; and calculating the distance average value and the reflection intensity average value of each point in each voxel block, and taking the distance average value and the reflection intensity average value of each point as the initial characteristics of the voxel block.

Preferably, the process of acquiring the voxel feature space includes: allocating a buffer area in advance according to the number of the divided voxel blocks; traversing the point cloud sequence, distributing each point cloud to a corresponding associated voxel, and storing the voxel coordinates and the point number of each voxel; establishing a hash table in the iterative process of traversing the point cloud sequence, and checking whether a point cloud exists in a voxel through the hash table; if a voxel relevant to a certain point exists, the number of points in the voxel is increased by one, and if the voxel does not exist, other points are reselected for query; obtaining actual voxel number according to the obtained coordinates of all voxels and the number of midpoints of each voxel; detecting the obtained voxels, and deleting all empty voxels to obtain dense voxels; and carrying out convolution operation on the dense voxels by adopting the GEMM to obtain a voxel characteristic space.

Preferably, the process of obtaining the distance sampling local features of the point cloud sequence includes: obtaining the position information of a key point dp through distance sampling, mapping the position information of the key point dp to a voxel characteristic space of a corresponding position of each sparse convolution according to the index of the key point position information so as to ensure that the key point has one and only one corresponding voxel at different layers, and updating the position information of the key point according to the characteristic of the voxel; abstracting each voxel into a point, and extracting the voxel characteristics by adopting a PointNet + + sequence extraction method to obtain the characteristics of distance key points after sparse convolution; and fusing the features subjected to sparse convolution by adopting a local feature fusion strategy to obtain the distance sampling local features.

Preferably, the process of obtaining the local feature key point feature includes: mapping position information of a key point dp obtained by distance sampling to a voxel characteristic space of a corresponding position of each sparse convolution to ensure that the key point has one and only one corresponding voxel at different layers; obtaining length q using characteristic farthest point samplingAnd the characteristic key point sequence fp satisfies the constraint condition

Abstracting each voxel into a point, and extracting the voxel characteristics by adopting a PointNet + + sequence extraction method to obtain the characteristics of characteristic key points after sparse convolution; and fusing the features subjected to sparse convolution by using a feature fusion formula to obtain the local feature key point features.

Preferably, the dense features are obtained from the bird's-eye view features using bilinear interpolation: projecting the voxel characteristic space to a 2D aerial view through a Z axis, and performing interpolation operation by using adjacent voxel characteristics, wherein the operation formula is as follows:

where f (x, y) represents the feature in the current interpolated coordinate, x represents the abscissa of the point, y represents the ordinate of the point, f (Q) ₁₁ ) Represents Q ₁₁ Features in coordinates, Q ₁₁ 、Q ₂₁ 、Q ₁₂ 、Q ₂₂ Respectively, representing neighboring voxel characteristics.

Preferably, the fusion strategy comprises a feature key point fusion strategy and a feature splicing strategy of the feature key points;

the feature key point fusion strategy is as follows:

fp＝fp _conv1 ∪fp _conv2 ∪fp _conv3 ∪fp _conv4 ∪fp _bev

the characteristic splicing of the characteristic key points comprises the following steps:

wherein fp represents the union of the characteristic sampling points of each sparse convolution layer, fp _conv1 Representing the feature key points, fp, sampled after the first layer of sparse convolution _bev Representation of the acquisition of thickness from bird's-eye view features by bilinear interpolationThe feature key points sampled after the dense feature, ff represents the global feature of the feature key points,

representing the local key point characteristics sampled after the first layer of sparse convolution,

and the local key point characteristics sampled after dense characteristics are obtained from the bird's-eye view characteristics through bilinear interpolation are shown.

Preferably, the process of pooling the distance sampling feature and the feature sampling feature in the region of interest includes: dividing the distance sampling global features and the feature sampling global features by adopting a 3D suggestion frame, and generating 6 x 6 grid points at equal intervals in each 3D suggestion frame, wherein each grid point is used as

Representing; acquiring the characteristics of grid points from the key points by adopting sequence extraction operation; and obtaining a target frame regression result and a target frame classification prediction result according to the characteristics of the grid points.

Preferably, the loss function of the model is: the loss function of the model comprises a suggested frame generation network loss function and a network target frame loss function;

the expression of the proposed box to generate the network loss function is:

wherein L is _rpn Representing the suggestion box to generate a network loss function, L _cls Represents the classification loss calculated by using the Focal loss, x, y, z respectively represent the three-dimensional coordinates of the target frame, l, h, w respectively represent the length, width and height of the target frame, theta represents the direction angle of the target frame,

representing the Smooth-L1 loss calculation method,

denotes the classified prediction residual, Δ r ^a Representing regression residuals;

the expression of the network target box loss function is:

wherein L is _rcnn Representing the network target Box loss function, L _iou The prediction and truth boxes are shown to calculate the loss using the Focal loss,

representing the predicted target frame residual, Δ r ^p Representing the regression residual.

The invention has the advantages that:

1) The method adds a Feature FPS (Voxel Set extraction) sequence extraction Module), uses Feature-based farthest point sampling to act on different Voxel sparse convolution layers to obtain features of different scales, and reduces the influence of background points and target detection;

2) The invention designs a key point-based multi-scale feature fusion method, which is used for carrying out 3D target detection on a point cloud scene and is beneficial to detecting samples which are difficult to detect.

Drawings

FIG. 1 is a network flow diagram of the present invention;

FIG. 2 is a model overview framework diagram of the present invention;

FIG. 3 is a graph of the test results of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the field of point cloud target detection, a point cloud scene contains tens of thousands of points, and huge resource and time waste is caused by directly using all the points to carry out model prediction and regression. In most target detection algorithms, pointNet + + farthest-distance point sampling (FPS) is used iteratively to generate key points, and the adjacency relation between the key points and surrounding points is used to generate feature vectors. However, the points sampled according to distance contain a large number of background points, lacking useful foreground points. The key points containing background points may play a promoting role in the classification of the target box and a negative role in the regression of the target. Therefore, the formulation of the point selection strategy is a key problem for improving the target detection accuracy.

In the field of point cloud target detection, different models have different feature fusion strategies, and the most common feature fusion strategy is to combine features under all different convolution layers or features under different viewing angles, or features obtained under different modes. The lack of strong rationale for these approaches suggests that employing their fusion strategy results in a large number of calculations.

A point cloud 3D target detection method based on key point multi-scale feature fusion comprises the following steps: and acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result.

As shown in fig. 1, the process of training the point cloud 3D target detection model includes:

s6: fusing the distance sampling local features of each sparse convolution layer by adopting a fusion strategy to obtain distance sampling global features; fusing the local characteristic key point characteristics of each sparse convolution layer by adopting a fusion strategy to obtain a characteristic sampling global characteristic;

The data set employed by the present invention is a KITTI data set, which is a computer vision algorithm assessment data set widely used in the field of autonomous driving. The data set contains a plurality of tasks such as 3D object detection and multi-object tracking and segmentation. The 3D object detection reference consists of 7481 training images and 7518 test images and corresponding point clouds. The training samples are roughly divided into a training set (3712 samples) and a validation set (3769 samples).

The execution of the gradient descent algorithm by the model on all the training data is called a round, the parameters of the model are updated every round, and the maximum number of rounds is set to 80 rounds. During the 80 iterations of training the model, the model and its parameters that achieved the least error on the test data set are saved.

The model structure comprises an original point cloud data acquisition module, a characteristic farthest point sampling sequence extraction module, a farthest point sampling sequence extraction module, a 3D voxelization module, a 3D sparse convolution module, a bird's-eye view projection and suggestion frame generation module and a feeling region pooling module; the connection of the various modules is shown in figure 2.

In the process of sampling key points of the input point cloud P by adopting a distance characteristic sampling method, a point cloud sequence is selected from the point cloud by using a distance farthest point sampling method (D-FPS). The measurement mode of the space distance is as follows:

wherein, D-Distance represents L2 Distance between two points, L2 represents point cloud Distance, X and Y represent coordinates and reflection intensity of different points, sqrt represents mathematical square root,

representing a spatial dimension of

At any point of the time-series connection,

representing a spatial dimension other than X of

I, j represent the index of the point, P represents the point cloud,

representing the spatial dimensions of the point cloud.

After D-FPS calculation is carried out, a distance keypoint sequence dp = { p } with the length p can be obtained ₁ ，p ₂ ，p ₃ …p _p }。

The input point cloud P is equally divided into voxel blocks L multiplied by W multiplied by H with equal intervals, and L, W and H respectively represent the length, width and height of the voxel blocks. The average of the distance and reflection intensity that fall into different points of each voxel block is used as the initial characteristic of that voxel block. The formula for calculating the initial features of the voxel blocks is:

wherein [ X, V, Z]Representing the three-dimensional coordinates of the point cloud falling within the voxel, R representing the sum of the reflection intensities of the point cloud falling within the voxel,

representing the average three-dimensional coordinates of the point cloud falling within the voxel,

representing the mean reflection intensity of the point cloud falling into the voxel and T representing the transpose.

The process of using the voxelized point cloud as point cloud feature extraction using a 3D sparse convolutional neural network comprises: pre-allocating a buffer according to the voxel number limit; and traversing the point cloud, distributing the points to the corresponding voxels associated with the points, and storing the coordinates of the voxels and the point number of each voxel. A hash table is built up in an iterative process to check for the presence of points in the voxels. If a voxel associated with a certain point exists, the number of points in the voxel is incremented by one. And finally obtaining the coordinates of all voxels and the number of points in each voxel to obtain the actual voxel number. The sparsity of the point cloud cannot avoid the existence of empty voxels. And performing aggregation operation on the sparse voxels to obtain dense voxel characteristics, namely deleting empty voxels. Then carrying out convolution operation on the dense voxels by using the GEMM to obtain dense output characteristics; and mapping the dense output features to the sparse output features through the constructed input-output index rule matrix.

The location information of the keypoints obtained by distance sampling is indexed to the voxel feature space of the sparse convolution corresponding location of each layer to ensure that the keypoints have one and only one corresponding voxel at different layers. And updating the feature of the keypoint based on the feature of the voxel. And regarding each voxel as a point, and applying a sequence extraction method proposed by the PointNet + + idea to the aggregation of voxel direction features.

The process of sampling the key points in each layer of body feature space by adopting a distance farthest point sampling method to obtain the distance sampling local features of the point cloud sequence comprises the following steps: keypoint dp = { p) obtained by distance sampling ₁ ，p ₂ ，p ₃ ，...p _p Mapping the position information to a voxel characteristic space of a corresponding position of each sparse convolution through an index so as to ensure that key points have one and only one corresponding voxel at different layers, updating the characteristics of the key points according to the characteristics of the voxels, abstracting each voxel into one point, and using a sequence extraction method proposed by PointNet + + for extracting the characteristics of the voxels to obtain the characteristics of the characteristic key points after sparse convolution; and fusing the characteristics of the distance key points after sparse convolution by adopting a local characteristic fusion strategy to obtain the distance sampling local characteristics. The formula of the local feature fusion strategy is as follows:

wherein the content of the first and second substances,

features representing the ith distance keypoint of the kth layer,

features representing the p-th distance keypoints of the k-th layer after 3D sparse convolution,

represents the mapping of the distance key points on the k layer voxel space, r _k Representing the fixed radius of feature extraction.

Using PointNet to generate the characteristics of the distance key points after sparse convolution:

wherein the content of the first and second substances,

representing the sequence extraction characteristics of ith distance key points of the kth layer after multilayer sparse convolution,

representing a fixed number of distance keypoint features representing random samples,

representing features representing ith distance key point of kth layer, G representing feature coding by using multilayer perceptron, (l) _k ) Denotes the k-th layer and max (.) denotes the use of the maximum pooling function.

And acquiring the characteristic key points by using the characteristic farthest point sampling on the basis of acquiring the distance key points and the characteristics thereof. And acquiring the characteristic key points by using the characteristic farthest point sampling on the basis of acquiring the distance key points and the characteristics thereof. Specifically, a distance key point is initialized randomly, and a point cloud sequence is selected from the distance key points based on the distance key point by using a characteristic farthest point sampling method (F-FPS) in an iteration mode.

The spatial feature measurement mode is as follows:

where F-Distance represents the L2 feature Distance between the keypoints of two feature samples. And X and Y represent features from different distances and key points at different scales extracted by the sparse convolution sequence. Obtaining a characteristic key point sequence fp = { p) with the length of q through characteristic farthest point sampling (F-FPS) ₁ ，p ₂ ，p ₃ …p _q }。

The process of obtaining the local feature of the feature sampling by using the feature key points comprises the following steps: keypoint dp = { p) obtained by distance sampling ₁ ，p ₂ ，p ₃ ...p _p The position information of the points is mapped to the voxel characteristic space of the corresponding position of each sparse convolution by indexes so as to ensure that key points have and only have at different layersA corresponding voxel is subjected to characteristic farthest point sampling (F-FPS) to obtain a characteristic key point sequence fp = { p } with the length of q ₁ ，p ₂ ，p ₃ …p _q And satisfy constraint conditions

The characteristic key points belong to distance key point subsets, each voxel is abstracted into one point, and the voxel characteristics are extracted by adopting a PointNet + + sequence extraction method to obtain the characteristics of the characteristic key points after sparse convolution; and fusing the features after sparse convolution by adopting a feature fusion formula to obtain the key point features of the local features. The feature fusion formula is:

wherein, the first and the second end of the pipe are connected with each other,

features representing the ith feature keypoint of the kth layer,

features representing distance keypoints of the kth layer after 3D sparse convolution,

represents the mapping of the distance key points on the k layer body space, r _k And expressing the fixed radius of the feature extraction, and generating the feature of the feature key point after sparse convolution by using PointNet.

The formula for extracting the characteristics of the sequence of the characteristic key points after multilayer thin convolution is as follows:

representing the sequence extraction characteristics of the ith characteristic key point of the kth layer after multilayer thin convolution,

representing randomly sampled feature key point features of a fixed quantity, G (-) representing feature coding by using a multilayer perceptron, and max (-) representing using a maximum pooling function.

And projecting the 3D voxel characteristics after the multi-layer sparse convolution into a 2D bird's eye view through a Z axis. Dense features are obtained from the bird's-eye-view features using bilinear interpolation. And generating a 3D suggestion frame by adopting a SECOND regional feature extraction method. The formula of the bilinear interpolation operation is as follows:

wherein f (x, y) represents the feature in the current interpolation coordinate, x represents the abscissa of the point, y represents the ordinate of the point, and f (Q) ₁₁ ) Represents Q ₁₁ Features in coordinates, Q ₁₁ 、Q ₂₁ 、Q ₁₂ 、Q ₂₂ Respectively, representing neighboring voxel characteristics.

The 3D proposal frame RPN architecture consists of three phases. Each stage starts with a downsampled convolutional layer, after each convolutional layer, the BatchNorm and ReLU layers are used. The output of each stage is then upsampled to the same size and the profiles are concatenated into one. And finally, performing category prediction and position regression prediction on each voxel by using a full-connection layer. And selecting the suggestion boxes with high intersection areas of the Top-k suggestion boxes and the truth boxes and high classification confidence as candidate boxes.

Distance sampling and feature sampling key points are obtained, and the following feature fusion strategies are respectively used. For the distance key points, splicing the characteristics of each layer; for the feature keypoints, the following representation is used as the feature keypoint sequence of different layers, taking 4-layer convolution and 1-layer projection as an example, the obtained result is:

fp _conv1 ＝{p ₁ ，p ₂ ，p ₃ …p _q }

fp _conv2 ＝{p ₁ ，p ₂ ，p ₃ …p _q }

fp _conv3 ＝{p ₁ ，p ₂ ，p ₃ …p _q }

fp _conv4 ＝{p ₁ ，p ₂ ，p ₃ …p _q }

fp _bev ＝{p ₁ ，p ₂ ，p ₃ …p _q }

the feature key point fusion strategy is as follows:

fp＝fp _conv1 ∪fp _conv2 ∪fp _conv3 ∪fp _conv4 ∪fp _bev

the characteristic splicing of the characteristic key points is as follows:

wherein fp represents the union of the characteristic sampling points of each sparse convolution layer, fp _conv1 Representing the feature key points, fp, sampled after the first layer of sparse convolution _bev Representing feature key points sampled after obtaining dense features from the bird's-eye view features by bilinear interpolation, ff representing global features of the feature key points,

and representing the local key point characteristics sampled after dense characteristics are obtained from the bird's-eye view characteristics through bilinear interpolation.

The fusion strategy of the distance feature key points is the same as the fusion strategy of the feature key points.

And obtaining distance sampling characteristics and characteristic sampling characteristics through multi-scale key point characteristic fusion, and pooling an interested region between two different key point sequences. And respectively adopting a multi-layer perceptron (MLP) as class prediction and target box regression for the obtained ff and df. Specifically, the confidence coefficient prediction is performed on the type of the target frame by using the 3-layer fully-connected network, the position of the target frame is regressed by using the 3-layer fully-connected network, and x, y, z, l, h, w, and θ respectively represent the center coordinates of the target frame, the length, width, and height of the target frame, and the direction angle of the target frame in the bird's eye view.

The loss function of the model comprises a suggestion box generation network loss function and a network target box loss function;

the expression of the proposed box to generate the network loss function is:

wherein L is _rpn Representing the suggestion box to generate a network loss function, L _cls Represents the classification loss calculated by using the Focal loss, x, y, z represent the three-dimensional coordinates of the target frame, l, h, w represent the length, width and height of the target frame, theta represents the direction angle of the target frame,

showing the Smooth-L1 loss calculation method,

representing the classified prediction residual, Δ r ^a Representing regression residuals;

the expression of the network target box loss function is:

Evaluation indexes are as follows: the average accuracy (mAP) with 40 recall locations was used to evaluate model performance on three difficulty levels: "simple level", "medium level", and "difficult level". To evaluate the degree of overlap of the detection target box with the truth box, we used the same criteria as the official evaluation. Specifically, for an automobile, the overlap of the bounding box over simple, medium and difficult level objects requires 70%,50% and 50%, respectively. For pedestrians and cyclists, the overlap of the bounding box on easy, medium and difficult level objects requires 50%,25% and 25%, respectively.

As shown in FIG. 3, the results of selecting and directly performing target detection in the KITTI test set are as follows, and it can be seen from the results that the model can perform good target detection under the shielded object and the difficult sample.

In this embodiment, a Python programming language is adopted, and the Python programming language can be run on a mainstream computer platform. The operating system used in the implementation is CentOS 6.5, the CPU is required to be Intel i7, the memory is more than 16GB, the hard disk space is required to be more than 60GB, the GPU is NVIDIA GTX 1080Ti, and the video memory is 11G.

The present invention was carried out based on the PyTorch 1.0 framework.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A point cloud 3D target detection method based on key point multi-scale feature fusion is characterized by comprising the following steps: acquiring point cloud data to be detected at the current moment, and inputting the acquired point cloud data into a trained point cloud 3D target detection model to obtain a target detection result;

s3: inputting the initial characteristics of the point cloud sequence and the voxel block into a 3D sparse convolution neural network to obtain a voxel characteristic space; mapping the position information of the key points in the point cloud sequence to the voxel characteristic space of the corresponding position of each layer of sparse convolution, and updating the key position information;

s4: extracting the characteristics of the key points in the characteristic space of each layer body element by adopting a distance farthest point sampling sequence extraction method to obtain the distance sampling local characteristics of the point cloud sequence; obtaining the position information of a key point dp through distance sampling, mapping the position information of the key point dp to a voxel characteristic space of a corresponding position of each sparse convolution according to the index of the key point position information so as to ensure that the key point has one and only one corresponding voxel at different layers, and updating the position information of the key point according to the characteristic of the voxel; abstracting each voxel into a point, and extracting voxel characteristics by adopting a PointNet + + sequence extraction method to obtain characteristics of distance key points after sparse convolution; fusing the sparse convolved features by adopting a local feature fusion strategy to obtain distance sampling local features;

s5: sampling distance sampling local features of the point cloud sequence by adopting a feature farthest point sampling method to obtain local feature key point features; mapping position information of a key point dp obtained by distance sampling to a voxel characteristic space of a corresponding position of each sparse convolution to ensure that the key point has one and only one corresponding voxel at different layers; obtaining a characteristic key point sequence fp with the length of q by adopting characteristic farthest point sampling, wherein the characteristic key point sequence meets constraint conditions

Abstracting each voxel into a point, and extracting the voxel characteristics by adopting a PointNet + + sequence extraction method to obtain the characteristics of characteristic key points after sparse convolution; using feature fusion formulasFusing the features after sparse convolution to obtain local feature key point features;

s6: fusing the distance sampling local features of each sparse convolution layer by adopting a fusion strategy to obtain distance sampling global features; fusing the local feature key point features of each sparse convolution layer by adopting a fusion strategy to obtain feature sampling global features;

s8: performing region-of-interest pooling on the distance sampling global features and the feature sampling global features according to the 3D suggestion frame to obtain a target detection result;

2. The method for detecting the point cloud 3D target based on the multi-scale feature fusion of the key points as claimed in claim 1, wherein the process of distance feature sampling of the original point cloud data comprises: randomly initializing a point in original point cloud data, and acquiring distance key points from all the point cloud data by adopting a distance farthest point sampling method by taking the point as the initial point to obtain a point cloud sequence; the formula of the farthest-distance point sampling method is as follows:

wherein D-Distance represents L2 Distance between two points, X and Y represent coordinates and reflection intensity of the two points, sqrt represents a square root function of a nonnegative number,

the representation space dimension is

At any point of the time-series connection,

representing a spatial dimension other than X of

I, j represent the index of the point, P represents the point cloud,

representing the spatial dimensions of the point cloud.

3. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the process of extracting the initial features of the voxel block comprises: equally dividing the input point cloud into voxel blocks with equal intervals, wherein the length, width and height of each voxel block are L, W and H respectively; and calculating the distance average value and the reflection intensity average value of each point in each voxel block, and taking the distance average value and the reflection intensity average value of each point as the initial characteristics of the voxel block.

4. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the process of obtaining the voxel feature space comprises: pre-allocating a buffer area according to the number of the divided voxel blocks; traversing the point cloud sequence, distributing each point cloud to a corresponding associated voxel, and storing the voxel coordinates and the point number of each voxel; establishing a hash table in the iterative process of traversing the point cloud sequence, and checking whether a point cloud exists in a voxel through the hash table; if a voxel relevant to a certain point exists, the number of points in the voxel is added by one, and if the voxel does not exist, other points are reselected for query; obtaining actual voxel number according to the obtained coordinates of all voxels and the number of midpoints of each voxel; detecting the obtained voxels, and deleting all empty voxels to obtain dense voxels; and carrying out convolution operation on the dense voxels by adopting the GEMM to obtain a voxel characteristic space.

5. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the dense features are obtained from the bird's-eye view features by using bilinear interpolation: projecting the voxel characteristic space to a 2D aerial view through a Z axis, and performing interpolation operation by using adjacent voxel characteristics, wherein the operation formula is as follows:

wherein f (x, y) represents the feature in the current interpolation coordinate, x represents the abscissa of the point, y represents the ordinate of the point, and f (Q) ₁₁ ) Represents Q ₁₁ Features in coordinates, Q ₁₁ 、Q ₂₁ 、Q ₁₂ 、Q ₂₂ Respectively, representing the coordinates of the features of neighboring voxels.

6. The method for detecting the point cloud 3D target based on the multi-scale feature fusion of the key points as claimed in claim 1, wherein the fusion strategy comprises a feature key point fusion strategy and a feature splicing strategy of the feature key points;

feature key point fusion strategy:

fp＝fp _conv1 ∪fp _conv2 ∪fp _conv3 ∪fp _conv4 ∪fp _bev

characteristic splicing of characteristic key points:

where fp represents the union of the sparse convolution layer characteristic sampling points, fp _conv1 Representing the feature key points, fp, sampled after the first layer of sparse convolution _bev Representing feature key points sampled after obtaining dense features from the bird's-eye view features by bilinear interpolation, ff representing global features of the feature key points,

7. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the process of pooling the region of interest of the distance sampling global feature and the feature sampling global feature comprises: dividing the distance sampling global features and the feature sampling global features by adopting a 3D suggestion frame, and generating 6 x 6 grid points at equal intervals in each 3D suggestion frame, wherein each grid point is used as

8. The method for detecting the point cloud 3D target based on the key point multi-scale feature fusion as claimed in claim 1, wherein the loss function of the model comprises a suggestion box generation network loss function and a network target box loss function;

the expression of the proposed box to generate the network loss function is:

wherein L is _rpn Representing the suggestion box to generate a network loss function, L _cls Represents the classification loss calculated by Focal loss, x, y, z represent the three-dimensional coordinates of the target frame, l, h, w represent the length, width and height of the target frame, and theta represents the direction of the target frameThe angle of the first and second side walls is,

showing the Smooth-L1 loss calculation method,

the expression of the network target box loss function is:

wherein L is _rcnn Representing the loss function of the network target box, L _iou The prediction and truth boxes are shown to calculate the loss using the Focal loss,

representing the predicted target frame residual, Δ r ^p The regression residuals are indicated.