CN111612059B

CN111612059B - Construction method of multi-plane coding point cloud feature deep learning model based on pointpilars

Info

Publication number: CN111612059B
Application number: CN202010425656.3A
Authority: CN
Inventors: 周洋; 吕精灵; 李小毛; 彭艳; 蒲华燕; 谢少荣; 罗均
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-05-19
Filing date: 2020-05-19
Publication date: 2022-10-21
Anticipated expiration: 2040-05-19
Also published as: CN111612059A

Abstract

The invention belongs to the technical field of computer vision, and particularly discloses a construction method of a multi-plane coding point cloud feature deep learning model based on pointpilars. The construction method comprises the following steps: and acquiring a training sample, and training the multi-plane coding point cloud characteristic deep learning model by adopting the training sample, so that the point cloud data in the training sample is input into the trained multi-plane coding point cloud characteristic deep learning model to obtain a recognition result, namely the position boundary box coordinates of the detection target in the point cloud data and the existence probability of the target in the boundary box coordinates. The multi-plane coding point cloud feature deep learning model constructed by the invention can realize three-dimensional space sampling of point cloud data, and can perform learning fusion on the point cloud features of the support columns in three planes obtained by sampling, thereby solving the problem of space information loss of the existing point cloud sampling, better reducing the loss of detection precision caused by different angles of the point cloud in each direction in space, and having good robustness and high detection accuracy.

Description

Construction method of multi-plane coding point cloud feature deep learning model based on pointpilars

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a construction method of a multi-plane coding point cloud feature deep learning model based on pointpilars.

Background

The target detection is an important task of computer vision, and aims to identify the type of a target and position the target, for traditional two-dimensional target detection, the field of computer vision is mature at present, and because the two-dimensional target detection aims at an image level and only contains plane information of an object, the target detection focuses on three-dimensional information of the object more and more along with the rapid development of the automatic driving industry, the three-dimensional target detection technology based on deep learning is also rapidly developed, and the current three-dimensional target detection technology mainly relies on images and laser radar point clouds for environmental perception. Based on the two data, the information of the spatial structure of the object can be extracted, including the posture, the size, the motion direction, the shape and the like of the object. The object identification from the laser radar point cloud data is the core problem of the current three-dimensional target detection, and the point cloud data is sparse, disordered and unstructured, and the identification difficulty is high in an extreme environment, so that the laser radar point cloud data is an open problem for the three-dimensional target detection.

In recent years, scholars at home and abroad propose various three-dimensional target detection algorithms, which are mainly applied to a point cloud method acquired by using a laser radar sensor under an unmanned scene, and the point cloud method comprises the steps of fusing bird's view images and 2D image images after being respectively operated by 2D CNN (video-on-demand network), such as AVOD (automatic voltage and optical density); converting the point cloud into standard 3D voxels as voxelization of the three-dimensional scene; the feature is learned by using three-dimensional convolution, but the calculation amount is too large, and the speed is slow. A novel coding mode is proposed by a pointpilars article, the pointnet is used for learning the vertical column representation method pilars of the point cloud, and the mature 2D convolution frame can be used for learning through coding characteristics, so that the speed is higher, the calculation power is smaller, the speed can reach 62Hz, and the quick version can reach 105Hz.

MV3D is also a Multi-view (Multi-view) 3D object recognition network proposed by scholars, which uses Multi-modal data as input and predicts the target of the 3D space, using RGB images, radar birds views, radar front views as input to the network: and realizing accurate automobile identification and 3D frame regression.

The method is also a three-dimensional target detection method proposed by students based on monocular, binocular and depth camera vision. For detection of an indoor scene, firstly, the indoor scene is small in scale, long-distance targets in the outdoor scene cannot appear, and the variety is more diversified, so that more abundant input information is needed, therefore, a method based on a binocular/Depth camera is more suitable, a Depth Map channel is added, which refers to an image or an image channel of information related to the distance of a scene object surface of a viewpoint, a pixel value is an actual distance from a sensor to an object, and multi-features such as image texture features and Depth features are fused, for example, algorithms such as Depth RCNN, AD3D, 2D-ivedrn and the like, but the detection effect is improved in the effectiveness of a model for 2D target detection. And the indoor scene is complicated, and little target is many, has many objects to shelter from, often can influence the detection precision. In 2012, fidler et al extended DPM to three-dimensional object detection under monocular vision, expressed each object class as a deformable three-dimensional cuboid, and effectively realized three-dimensional detection of some indoor objects with obvious shape characteristics, such as obvious cuboid characteristic objects of beds, tables, and the like, through a transformation relationship between an object part and the surface of a three-dimensional detection frame. In order to improve the detection accuracy of multiple targets in an indoor scene, zhuo et al proposes an end-to-end monocular vision-based three-dimensional target detection network combining a depth estimation network and a 3D RPN.

For the three-dimensional target detection of outdoor scenes, the three-dimensional geometric information of the target can be regressed by combining methods such as prior information fusion, geometric characteristics, three-dimensional model matching, depth estimation network under monocular vision and the like based on a monocular vision sensor. Chen et al proposed a Mono3D target detection method in 2016, but when a 3D detection frame is extracted using complex prior information, the problem of error accumulation in energy loss calculation is present, so that the detection performance is not outstanding. There is a gap in comparison to 2D detectors and end-to-end training is not possible. Mouswaian et al propose a 3D target detection method of Deep3Dbbox using the learning experience of a 2D target detector network. The method expands a 2D target detector network, and obtains the three-dimensional size and the course angle of the target by utilizing a regression method. The computing power is greatly reduced, and the computing speed is improved. But there is no substantial improvement in detection accuracy due to the lack of depth information. The method aims at the three-dimensional target detection algorithm of monocular vision, which has the defects of small size, high positioning accuracy of an occluded target and the like, and the estimation deviation of depth information is the main reason of low detection accuracy, especially for the positioning of a long-distance and occluded target. The binocular/depth camera relies on the advantage of accurate depth information, and has obvious detection precision improvement compared with a monocular vision algorithm especially aiming at target detection and positioning tasks in the application of the vision algorithm of a three-dimensional space.

In recent years, with the development of deep learning and artificial intelligence, more and more people apply the technology to various fields. The application scenes in the fields of unmanned driving and the like are complex and changeable, the traditional two-dimensional target detection algorithm has obvious limitations, the accuracy and the precision of detection are improved, and the safety of a driver is guaranteed, so that the precision and the speed of three-dimensional target detection are greatly challenged, but the unmanned scene is spacious, the point cloud is not uniform when a laser radar collects the point cloud, collected remote points are very sparse, and the method for deep learning by utilizing the space point cloud needs to contain complete space information.

Disclosure of Invention

Aiming at the problems and the defects in the prior art, the invention aims to provide a construction method of a multi-plane coding point cloud feature deep learning model based on pointpilars.

In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows:

the invention firstly provides a construction method of a multi-plane coding point cloud characteristic deep learning model based on pointpilars, which comprises the following steps: acquiring a training sample, wherein the training sample comprises point cloud data containing a detection target and marking information corresponding to the point cloud data, and the marking information is used for indicating a boundary box coordinate of the detection target in the point cloud data and a classification label of the detection target in the boundary box coordinate; and training the multi-plane coding point cloud feature deep learning model by adopting a training sample, so that the point cloud data in the training sample is input into the trained multi-plane coding point cloud feature deep learning model to obtain an identification result, namely the position boundary box coordinates of the detection target in the point cloud data and the existence probability of the target in the boundary box coordinates.

According to the construction method, preferably, the multi-plane coding point cloud feature deep learning model is an improvement based on a pointpilars algorithm, and the specific improvement is as follows: a multi-plane fusion feature coding network is adopted to replace a feature coder network in the pointpilars algorithm; the multi-plane coding point cloud feature deep learning model consists of a multi-plane fusion feature coding network, a backhaul network and a protection Head network, wherein the backhaul network and the protection Head network are both the original backhaul network and the protection Head network in the pointpilars algorithm, and the network structures are unchanged. The input of the multi-plane fusion feature coding network is point cloud data, and the output is a sparse pseudo image converted from the fusion features of the point cloud; the input of the Backbone network is a sparse pseudo image, and the output of the Backbone network is a convolution characteristic diagram of the sparse pseudo image; the input of the Detection Head network is a convolution characteristic diagram output by the backhaul network, and the output is a prediction boundary box coordinate of a Detection target in the point cloud data and the probability of the target existing in the prediction boundary box.

According to the construction method, preferably, the specific steps of training the multi-plane coding point cloud feature deep learning model by using the training samples are as follows:

(1) Inputting a training sample into a multi-plane fusion feature coding network, and performing fusion coding on the features of point cloud data x-y plane, x-z plane and y-z plane in the training sample by the multi-plane fusion feature coding network to obtain the fusion feature of point cloud of the x-y plane and convert the fusion feature of the point cloud into a sparse pseudo image;

(2) Inputting the sparse pseudo image into a backhaul network for feature extraction to obtain a convolution feature map of the sparse pseudo image;

(3) Inputting the convolution feature map of the sparse pseudo image into a Detection Head network to obtain the coordinates of a prediction boundary box of a Detection target in the point cloud data and the probability of the target existing in the prediction boundary box;

(4) And (4) taking the predicted bounding box coordinates obtained in the step (3) as a predicted result, taking bounding box coordinates marked in a training sample as a real result, constructing a loss function according to the predicted result and the real result, adopting a square error loss function as the loss function, optimizing network parameters of the multi-plane coding point cloud feature deep learning model through a random gradient descent algorithm, reducing the numerical value of the loss function, continuously iterating the process to optimize the network parameters until the loss function stops descending, ending the training process of the multi-plane coding point cloud feature deep learning model, and obtaining the trained multi-plane coding point cloud feature deep learning model.

According to the above construction method, preferably, the specific operation of step (1) is:

(1a) Dispersing point cloud data in a training sample on a grid uniformly spaced on an x-y plane, without limitation in the z direction, creating a series of struts on the x-y plane, wherein the point cloud contained in each strut is represented by r, x _c ,y _c ,z _c ,x _p ,y _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,y _p ) The expanded point cloud feature dimension D =9; wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is the number of _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; x is the number of _p ,y _p Representing deviations of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane;

(1b) On an x-y plane, the number of point clouds contained in all non-empty struts is adjusted to be consistent, then a dense tensor (D, P, N) is created according to the number of the non-empty struts on the plane, the number of the point clouds contained in the non-empty struts and the characteristics of the point clouds in the non-empty struts, and the characteristics (D, P, N) of each non-empty strut on the x-y plane are obtained, wherein D represents the characteristic dimension of the point clouds in the non-empty struts, P represents the number of the non-empty struts on the x-y plane, and N represents the number of the point clouds contained in the non-empty struts;

(1c) Performing feature learning on the features (D, P, N) of the non-empty pillars on the x-y plane by adopting a PointNet network, and obtaining final features (C, P, N) of the point cloud in each non-empty pillar on the x-y plane after learning; c represents a new characteristic dimension obtained after the point cloud is learned through a PointNet network, P represents the number of non-empty struts on an x-y plane, and N represents the number of the point clouds contained in the non-empty struts;

(1d) Uniformly spaced net for respectively dispersing point cloud data in training sample on x-z planeOn a grid, without limitation in the y-direction, a series of struts on the x-z plane are created, with the point cloud contained in each strut being represented by r, x _c ,y _c ,z _c ,x _p ,z _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,z _p ) (ii) a Wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is the number of _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; x is a radical of a fluorine atom _p ,z _p Representing deviations of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane; then, according to the operations of the steps (1 b) to (1C), obtaining the final point cloud characteristics (C, P, N) of each non-empty strut on the x-z plane;

(1e) Respectively dispersing point cloud data in a training sample on grids uniformly spaced on a y-z plane, without limitation in the x direction, creating a series of struts on the y-z plane, wherein the point cloud contained in each strut is r, x _c ,y _c ,z _c ,y _p ,z _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,y _p ,z _p ) (ii) a Wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is a radical of a fluorine atom _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; y is _p ,z _p Representing deviations of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane; then, according to the operations of the steps (1 b) to (1C), obtaining the final point cloud characteristics (C, P, N) of each non-empty pillar on the y-z plane;

(1f) Superposing the final characteristics of the point cloud of each non-empty support column of the x-z plane and the y-z plane with the final characteristics of the point cloud of each non-empty support column of the x-y plane to obtain fusion characteristics (3C, P, N) of the point cloud of the x-y plane, processing the fusion characteristics (3C, P, N) by adopting maximum pooling operation to obtain tensors (3C, P), and then creating a sparse pseudo image (3C, H, W) according to the tensors (3C, P), wherein H represents the height of the sparse pseudo image, and W represents the width of the sparse pseudo image.

According to the above construction method, the size of the x-z plane and the y-z plane is preferably the same as the x-y plane.

According to the construction method, preferably, a backhaul network is adopted in the step (2) to perform feature extraction on the sparse pseudo-image, a convolution kernel traverses the whole sparse pseudo-image from left to right and from top to bottom, and the dimensionality of a feature map output after each layer of convolution of the input sparse pseudo-image is as follows:

W ₂ ＝(W ₁ -F+2P)/S+1(I)

H ₂ ＝(H ₁ -F+2P)/S+1(II)

D ₂ ＝K(III)

wherein, W ₁ ,H ₁ Inputting the width, height and depth of the characteristic diagram before the convolution layer; w ₂ ,H ₂ ，D ₂ Respectively the width, height and depth of the convolved output characteristic diagram; k is the number of convolution kernels; f is the convolution kernel size of the convolution layer; p is the zero filling quantity of the convolution layer input characteristic diagram; s is the step length.

According to the above construction method, preferably, the specific operation of step (3) is:

(3a) Inputting the convolution characteristic diagram of the sparse pseudo-image into a Detection Head network, and finding out the central coordinates of each position characteristic of the convolution characteristic diagram on an x-y sampling plane according to the mapping relation of the receptive field; setting 3D preset frames according to central coordinates in an x-y sampling plane, setting two 3D preset frames with different angles at each central coordinate, wherein the size of each 3D preset frame is the same as the average size of a detection target boundary frame marked in a training sample, then calculating IoU after the 3D preset frames and the marked detection target boundary frames are projected on the x-y plane, comparing IoU obtained by calculation with a set threshold value, and screening out 3D candidate frames from the 3D preset frames; wherein IoU is greater than the set threshold, the 3D default frame is a 3D candidate frame, and the initial position coordinates of the 3D candidate frame are (G) _x ,G _y ,G _z ，G _w ,G _h ,G _l ,G _θ )；

(3b) Performing frame regression on the 3D candidate frame obtained by screening in the step (3 a) to obtain the coordinate correction offset (D) of the 3D candidate frame _x ,d _y ,d _z ,d _w ,d _h ,d _l ,d _θ ) Calculating according to the initial position coordinates of the 3D candidate frame and the coordinate correction offset of the candidate frame obtained by frame regression to obtain the position coordinates (R) of the predicted boundary frame of the detection target _x ,R _y ,R _z ,R _w ,R _h ,R _l ,R _θ ) And outputting and simultaneously outputting the probability of the detection target existing in the prediction boundary box.

According to the above construction method, preferably, the angles of the two 3D preset frames set at each center coordinate in step (3 a) are 0 degree and 90 degrees, respectively.

According to the above construction method, preferably, the specific calculation process of the position coordinates of the prediction bounding box of the detection target in step (3 c) is as follows: calculating the position coordinates of the predicted boundary frame according to formulas (IV) to (X) according to the initial position coordinates of the 3D candidate frame and the coordinate correction offset of the candidate frame output by the frame regression network;

R _x ＝G _x ×d _x +Gx (Ⅳ)

R _y ＝G _y ×d _y +Gy (Ⅴ)

R _z ＝G _z ×d _z +Gy (Ⅵ)

R _w ＝G _w ×e ^dw (Ⅶ)

R _h ＝G _h ×e ^dh (Ⅷ)

R _l ＝G _l ×e ^dl (Ⅸ)

R _θ ＝G _θ ×d _θ +G _θ (Ⅹ)

wherein G is _x As the abscissa of the center position of the 3D candidate frame, gy is the ordinate of the center position of the 3D candidate frame, G _z Z-coordinate, G, of the 3D candidate box center position _w Is the width of the 3D candidate box, G _h High, G, of 3D candidate box _l Is the length of the 3D candidate box, G _θ Is the angle of the 3D candidate frame; d _x As an offset of the abscissa of the center position of the 3D candidate frame, D _y For 3D candidate frame center positionOffset by the ordinate, d _z The offset of the z coordinate of the 3D candidate frame center position is obtained; dw is the offset of the width of the 3D candidate frame, dh is the offset of the height of the 3D candidate frame; dl is the offset of the 3D candidate frame length, D _θ Is the offset of the 3D candidate frame angle; r _x To predict the abscissa of the central position of the bounding box, R _y For predicting the ordinate, R, of the position of the centre of the bounding box _z For predicting the ordinate, R, of the position of the centre of the bounding box _w To predict the width of the bounding box, R _h To predict the height of the bounding box, R _l Predicting the bounding box length, R, for 3D _θ Is to predict the height of the bounding box.

According to the above construction method, preferably, the specific operation of obtaining the training sample is: collecting point cloud data containing detection targets, framing boundary frames of all the detection targets in the point cloud data by adopting a labeling tool, labeling position coordinates (x, y, z, w, l, h and theta) of each boundary frame in a space and classification labels of the detection targets in each boundary frame, and taking the labeled point cloud data as a training sample; wherein, x is the x-axis coordinate of the boundary box center, y is the y-axis coordinate of the boundary box center, z is the z-axis coordinate of the boundary box center, w is the width of the boundary box, l represents the length of the boundary box, h is the height of the boundary box, and theta is the angle of the projection of the boundary box to the x-y plane. More preferably, the detection target is any one of a vehicle, a pedestrian, or a bicycle.

The invention also provides a method for detecting the point cloud data target by utilizing the multi-plane coding point cloud characteristic deep learning model constructed by the construction method.

Compared with the prior art, the invention has the following positive beneficial effects:

in the existing point cloud data target detection method, when Pillars sampling is carried out on spatial point cloud data, the spatial point cloud data is only sampled on an x-y plane, a strut obtained by only sampling on the x-y plane does not contain complete spatial information of the point cloud data, and when the strut obtained by sampling on the x-y plane is subsequently detected and analyzed, the point cloud spatial information is lost, so that the target detection accuracy and precision are low. The invention constructs a novel pointpilars-based multi-plane coding point cloud feature deep learning model, which can respectively sample an x-y plane, an x-z plane and a y-z plane of space point cloud data to obtain the features of point clouds in three plane struts, learn the features of the point clouds in the three plane struts through PointNet, fuse the point cloud features in the three plane struts after learning, and then analyze and detect the fused point cloud features. Therefore, the point cloud data three-dimensional space sampling can be realized by the multi-plane coding point cloud characteristic deep learning model based on pointpilars, the point cloud characteristics of the three in-plane support columns obtained by sampling are learned and fused, the problem of information loss of the existing point cloud sampling space is solved, the acquisition of point cloud characteristic information in multiple directions of the whole space is enhanced, the loss of detection precision caused by different angles of the point cloud in each direction in the space is better reduced, and the shape and position characteristics of an object in different directions are better acquired by extracting the fusion of the point cloud characteristics in the support columns on three planes, so that the robustness and the detection accuracy of the detection model are improved.

Drawings

FIG. 1 is a flowchart of the training process of the multi-plane encoded point cloud feature deep learning model based on pointpilars according to the present invention.

Fig. 2 is a schematic structural diagram of a backhaul network.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the scope of the present invention is not limited thereto.

Example 1:

a construction method of a multi-plane coding point cloud feature deep learning model based on pointpilars comprises the following steps:

the method comprises the following steps: and acquiring a training sample, wherein the training sample comprises point cloud data containing a detection target and marking information corresponding to the point cloud data, and the marking information is used for indicating a boundary box coordinate of the detection target in the point cloud data and a classification label of the detection target in the boundary box coordinate.

Step two: and training the multi-plane coding point cloud characteristic deep learning model by adopting the training sample, so that the point cloud data in the training sample is input into the trained multi-plane coding point cloud characteristic deep learning model to obtain a recognition result, namely the position boundary box coordinates of the detection target in the point cloud data and the existence probability of the target in the boundary box coordinates.

The specific operation of obtaining the training sample in the first step is as follows:

collecting point cloud data containing detection targets, wherein the detection targets are vehicles, adopting a marking tool frame to output boundary frames of all the detection targets in the point cloud data, marking position coordinates (x, y, z, w, l, h and theta) of each boundary frame in space and classification labels of the detection targets in each boundary frame, and then taking the marked point cloud data as training samples; wherein x is the x-axis coordinate of the center of the bounding box, y is the y-axis coordinate of the center of the bounding box, z is the z-axis coordinate of the center of the bounding box, w is the width of the bounding box, l represents the length of the bounding box, h is the height of the bounding box, and theta is the angle of the projection of the bounding box to the x-y plane.

In the second step, the multi-plane coding point cloud characteristic deep learning model is improved based on the pointpilars algorithm, and the specific improvement is as follows: a multi-plane fusion feature coding network is adopted to replace a feature coder network in the pointpilars algorithm; the multi-plane coding point cloud feature deep learning model consists of a multi-plane fusion feature coding network, a backhaul network and a protection Head network, wherein the backhaul network and the protection Head network are both the original backhaul network and the protection Head network in the pointpilars algorithm, and the network structures are unchanged. The input of the multi-plane fusion feature coding network is point cloud data, and the output is a sparse pseudo image converted from the fusion features of the point cloud; the input of the Backbone network is a sparse pseudo image, and the output of the Backbone network is a convolution characteristic diagram of the sparse pseudo image; the input of the Detection Head network is a convolution characteristic diagram output by the backhaul network, and the output is a prediction boundary box coordinate of a Detection target in the point cloud data and the probability of the target existing in the prediction boundary box.

In the second step, the specific steps of training the multi-plane coding point cloud feature deep learning model by using the training samples are as follows (as shown in fig. 1):

(1) Inputting a training sample into a multi-plane fusion feature coding network, and performing fusion coding on the features of point cloud data x-y plane, x-z plane and y-z plane in the training sample by the multi-plane fusion feature coding network to obtain the fusion feature of point cloud of the x-y plane, and converting the fusion feature of the point cloud into a sparse pseudo image.

The specific operation of step (1) is as follows:

(1a) Dispersing point cloud data in a training sample on a grid uniformly spaced on an x-y plane, without limitation in the z direction, creating a series of struts on the x-y plane, wherein the point cloud contained in each strut is represented by r, x _c ,y _c ,z _c ,x _p ,y _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,y _p ) The expanded point cloud feature dimension D =9; wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is the number of _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; x is the number of _p ,y _p Representing the deviation of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane;

(1b) On an x-y plane, the number of point clouds contained in all non-empty struts is adjusted to be consistent, and then a density tensor (D, P, N) is created according to the number of the non-empty struts on the plane, the number of the point clouds contained in the non-empty struts and the characteristics of the point clouds in the non-empty struts, so that the characteristics (D, P, N) of each non-empty strut on the x-y plane are obtained, wherein D represents the characteristic dimension of the point cloud in the non-empty struts, P represents the number of the non-empty struts on the x-y plane, and N represents the number of the point clouds contained in the non-empty struts;

(1d) Respectively dispersing point cloud data in the training sample on grids uniformly spaced on an x-z planeWithout limitation in the y-direction, the x-z plane is the same size as the x-y plane, a series of struts on the x-z plane are created, and the point cloud contained in each strut is represented by r, x _c ,y _c ,z _c ,x _p ,z _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,z _p ) The expanded point cloud feature dimension D =9; wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is the number of _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; x is the number of _p ,z _p Representing deviations of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane; then, according to the operations of the steps (1 b) to (1C), obtaining the final point cloud characteristics (C, P, N) of each non-empty pillar on the x-z plane;

(1e) Respectively dispersing point cloud data in a training sample on grids uniformly spaced on a y-z plane, wherein the x direction is not limited, the size of the y-z plane is the same as that of the x-y plane, creating a series of support columns on the y-z plane, and the point cloud contained in each support column is determined by using r and x _c ,y _c ,z _c ,y _p ,z _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,y _p ,z _p ) The expanded point cloud feature dimension D =9; wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is the number of _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; y is _p ,z _p Representing the deviation of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane; then, according to the operations of the steps (1 b) to (1C), obtaining the final point cloud characteristics (C, P, N) of each non-empty pillar on the y-z plane;

(1f) And overlapping the final characteristics of the point cloud of each non-empty support column of the x-z plane and the y-z plane with the final characteristics of the point cloud of each non-empty support column of the x-y plane to obtain fused characteristics (3C, P and N) of the point cloud of the x-y plane, processing the fused characteristics (3C, P and N) by adopting maximum pooling operation to obtain tensors (3C and P), and then creating sparse pseudo images (3C, H and W) according to the tensors (3C and P), wherein H represents the height of the sparse pseudo images, and W represents the width of the sparse pseudo images.

(2) And inputting the sparse pseudo image into a backhaul network for feature extraction to obtain a convolution feature map of the sparse pseudo image. The backhaul network is an original backhaul network in the pointpilars algorithm, and the backhaul network (as shown in fig. 2) is a network structure known to those skilled in the art.

In the step (2), a backhaul network is adopted to carry out feature extraction on the sparse pseudo-image, a convolution kernel traverses the whole sparse pseudo-image from left to right and from top to bottom, and the dimensionality of a feature graph output after the input sparse pseudo-image is convolved by each layer is as follows:

W ₂ ＝(W ₁ -F+2P)/S+1(I)

H ₂ ＝(H ₁ -F+2P)/S+1(II)

D ₂ ＝K(III)

(3) And inputting the convolution feature map of the sparse pseudo image into a Detection Head network to obtain the coordinates of a prediction boundary box of the detected target in the point cloud data and the probability of the target existing in the prediction boundary box. The Detection Head network is an original Detection Head network in the pointpilars algorithm, and the Detection Head network is a network structure known to those skilled in the art.

The specific operation of the step (3) is as follows:

(3a) And inputting the convolution characteristic diagram of the sparse pseudo-image into a Detection Head network, and finding out the central coordinates of each position characteristic of the convolution characteristic diagram on an x-y sampling plane according to the mapping relation of the receptive field. And setting a 3D preset frame according to a central coordinate in the x-y sampling plane, wherein each central coordinate is respectively provided with two 3D preset frames with different angles of 0 degree and 90 degrees, and the size of each 3D preset frame is the same as the average size of a detection target boundary frame marked in the training sample. Then, projecting the 3D preset frame and the marked detection target boundary frame on an x-y plane, calculating IoU, comparing IoU obtained through calculation with a set threshold value, and screening out a 3D candidate frame from the 3D preset frame; wherein IoU is greater than the set threshold, the 3D default frame is a 3D candidate frame.

(3b) Performing frame regression on the 3D candidate frame obtained by screening in the step (3 a) to obtain the coordinate correction offset (D) of the 3D candidate frame _x ,d _y ,d _z ,d _w ,d _h ,d _l ,d _θ ) Calculating the coordinate correction offset of the candidate frame according to the initial position coordinates of the 3D candidate frame and the coordinate correction offset of the candidate frame obtained by frame regression according to formulas (IV) to (X) to obtain the position coordinates (R) of the predicted boundary frame of the detection target _x ,R _y ,R _z ,R _w ,R _h ,R _l ,R _θ ) And outputting and simultaneously outputting the probability of the detection target existing in the prediction boundary box.

R _x ＝G _x ×d _x +Gx (Ⅳ)

R _y ＝G _y ×d _y +Gy (Ⅴ)

R _z ＝G _z ×d _z +Gy (Ⅵ)

R _w ＝G _w ×e ^dw (Ⅶ)

R _h ＝G _h ×e ^dh (Ⅷ)

R _l ＝G _l ×e ^dl (Ⅸ)

R _θ ＝G _θ ×d _θ +G _θ (Ⅹ)

Wherein G is _x As the abscissa of the center position of the 3D candidate frame, gy is the ordinate of the center position of the 3D candidate frame, G _z Z-coordinate, G, of the 3D candidate box center position _w Is the width of the 3D candidate box, G _h High for 3D candidate box, G _l Is the length of the 3D candidate box, G _θ Is the angle of the 3D candidate frame; d _x As an offset of the abscissa of the center position of the 3D candidate frame, D _y Sit upright for 3D candidate frame center positionTarget offset, d _z The offset of the z coordinate of the 3D candidate frame center position is obtained; dw is the offset of the width of the 3D candidate frame, dh is the offset of the height of the 3D candidate frame; dl is the offset of the 3D candidate frame length, D _θ Is the offset of the 3D candidate frame angle; r _x To predict the abscissa of the central position of the bounding box, R _y To predict the ordinate, R, of the position of the centre of the bounding box _z For predicting the ordinate, R, of the position of the centre of the bounding box _w To predict the width of the bounding box, R _h To predict the height of the bounding box, R _l Predicting the bounding box length, R, for 3D _θ Is to predict the height of the bounding box.

(4) And (3) taking the predicted boundary frame coordinate obtained in the step (3) as a predicted result, taking the boundary frame coordinate marked in the training sample as a real result, constructing a loss function according to the predicted result and the real result, adopting a cross entropy loss function (the cross entropy loss function is well known in the field) as the loss function, optimizing network parameters of the multi-plane coding point cloud feature deep learning model through a random gradient descent algorithm, reducing the numerical value of the loss function, continuously iterating the process to optimize the network parameters until the loss function stops descending, and ending the training process of the multi-plane coding point cloud feature deep learning model to obtain the trained multi-plane coding point cloud feature deep learning model.

Example 2:

a method for detecting a point cloud data target by utilizing the multi-plane coding point cloud characteristic deep learning model based on pointpilars constructed in embodiment 1 is characterized in that collected point cloud data are input into the multi-plane coding point cloud characteristic deep learning model for calculation, and the multi-plane coding point cloud characteristic deep learning model finally outputs the boundary frame coordinates of the detection target of the point cloud data and the probability of the detection target existing in the boundary frame coordinates.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the present invention, but rather as the following description is intended to cover all modifications, equivalents and improvements falling within the spirit and scope of the present invention.

Claims

1. A construction method of a multi-plane coding point cloud feature deep learning model based on pointpilars is characterized by comprising the following steps: acquiring a training sample, wherein the training sample comprises point cloud data containing a detection target and marking information corresponding to the point cloud data, and the marking information is used for indicating a boundary box coordinate of the detection target in the point cloud data and a classification label of the detection target in the boundary box coordinate; training the multi-plane coding point cloud characteristic deep learning model by adopting a training sample, and enabling a recognition result obtained by inputting point cloud data in the training sample into the trained multi-plane coding point cloud characteristic deep learning model to be position boundary box coordinates of a detection target in the point cloud data and existence probability of the target in the boundary box coordinates;

the multi-plane coding point cloud feature deep learning model is an improvement based on a pointpilars algorithm, and the specific improvement is as follows: a multi-plane fusion feature coding network is adopted to replace a feature coder network in the pointpilars algorithm; the multi-plane coding point cloud feature deep learning model consists of a multi-plane fusion feature coding network, a backhaul network and a protection Head network;

the specific steps of training the multi-plane coding point cloud characteristic deep learning model by adopting the training samples are as follows:

(4) Taking the predicted boundary frame coordinate obtained in the step (3) as a predicted result, taking the boundary frame coordinate marked in the training sample as a real result, constructing a loss function according to the predicted result and the real result, adopting a cross entropy loss function as the loss function, optimizing network parameters of the multi-plane coding point cloud feature deep learning model through a random gradient descent algorithm, reducing the numerical value of the loss function, continuously iterating the process to optimize the network parameters until the loss function stops descending, finishing the training process of the multi-plane coding point cloud feature deep learning model, and obtaining the trained multi-plane coding point cloud feature deep learning model;

the specific operation of the step (1) is as follows:

(1d) Respectively dispersing point cloud data in a training sample on grids uniformly spaced on an x-z plane, without limitation in the y direction, creating a series of struts on the x-z plane, wherein the point cloud contained in each strut is r, x _c ,y _c ,z _c ,x _p ,z _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,z _p ) (ii) a Wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is a radical of a fluorine atom _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; x is the number of _p ,z _p Representing deviations of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane; then, according to the operations of the steps (1 b) to (1C), obtaining the final point cloud characteristics (C, P, N) of each non-empty pillar on the x-z plane;

(1e) Respectively dispersing point cloud data in a training sample on grids uniformly spaced on a y-z plane, without limitation in the x direction, creating a series of struts on the y-z plane, wherein the point cloud contained in each strut is r, x _c ,y _c ,z _c ,y _p ,z _p Feature expansion to obtain expanded point cloud features (x, y, z, r, x) _c ,y _c ,z _c ,y _p ,z _p ) (ii) a Wherein x, y, z represent the initial coordinate values of the point cloud; r represents the point cloud reflectivity; x is the number of _c ,y _c ,z _c Representing the coordinate value obtained by arithmetic mean of all point cloud coordinates in the strut; y is _p ,z _p Representing deviations of all point clouds in the strut relative to the coordinate center position under the coordinate system of the current plane; then, according to the operations of the steps (1 b) to (1C), obtaining the final point cloud characteristics (C, P, N) of each non-empty strut on the y-z plane;

2. The method of claim 1, wherein the x-z plane and the y-z plane are the same size as the x-y plane.

3. The construction method according to claim 1, wherein in the step (2), a backhaul network is adopted to perform feature extraction on the sparse pseudo-image, a convolution kernel traverses the whole sparse pseudo-image from left to right and from top to bottom, and the dimensions of the feature graph output after the input sparse pseudo-image is convolved by each layer are as follows:

W ₂ ＝(W ₁ -F+2P)/S+1 (I)

H ₂ ＝(H ₁ -F+2P)/S+1 (II)

D ₂ ＝K (III)

4. The construction method according to any one of claims 1 to 3, wherein the specific operation of step (3) is:

(3a) Inputting the convolution characteristic diagram of the sparse pseudo-image into a Detection Head network, and finding out the central coordinates of each position characteristic of the convolution characteristic diagram on an x-y sampling plane according to the mapping relation of the receptive field; setting 3D preset frames according to central coordinates in an x-y sampling plane, setting two 3D preset frames with different angles at each central coordinate, wherein the size of each 3D preset frame is the same as the average size of a detection target boundary frame marked in a training sample, then calculating IoU after the 3D preset frames and the marked detection target boundary frames are projected on the x-y plane, comparing IoU obtained by calculation with a set threshold value, and screening out 3D candidate frames from the 3D preset frames; wherein IoU is greater than the set threshold, the 3D default frame is a 3D candidate frame;

(3b) Performing frame regression on the 3D candidate frame obtained by screening in the step (3 a) to obtain the coordinate correction offset of the 3D candidate frame, calculating according to the initial position coordinate of the 3D candidate frame and the coordinate correction offset of the candidate frame obtained by frame regression to obtain the position coordinate of the predicted boundary frame of the detection target, outputting the position coordinate of the predicted boundary frame of the detection target, and simultaneously outputting the probability that the detection target exists in the predicted boundary frame.

5. The building method according to claim 4, wherein the angles of the two 3D preset frames set at each center coordinate in the step (3 a) are 0 degree and 90 degrees, respectively.

6. The constructing method according to claim 5, wherein the specific calculation process of detecting the position coordinates of the predicted bounding box of the object in the step (3 b) is as follows: calculating and predicting the position coordinates of the boundary frame according to formulas (IV) to (X) according to the initial position coordinates of the 3D candidate frame and the coordinate correction offset of the candidate frame output by the frame regression network;

R _x ＝G _x ×d _x +Gx (Ⅳ)

R _y ＝G _y ×d _y +Gy (Ⅴ)

R _z ＝G _z ×d _z +Gy (Ⅵ)

R _w ＝G _w ×e ^dw (Ⅶ)

R _h ＝G _h ×e ^dh (Ⅷ)

R _l ＝G _l ×e ^dl (Ⅸ)

R _θ ＝G _θ ×d _θ +G _θ (Ⅹ)

wherein G is _x As the abscissa of the center position of the 3D candidate frame, gy is the ordinate of the center position of the 3D candidate frame, G _z As z-coordinate of 3D candidate frame center position, G _w Is the width of the 3D candidate box, G _h High for 3D candidate box, G _l Is the length of the 3D candidate box, G _θ Is the angle of the 3D candidate frame; d _x As an offset of the abscissa of the center position of the 3D candidate frame, D _y Is the offset of the ordinate of the 3D candidate frame center position, D _z The offset of the z coordinate of the 3D candidate frame center position is obtained; dw is the offset of the width of the 3D candidate frame, dh is the offset of the height of the 3D candidate frame; dl is the offset of the 3D candidate frame length, D _θ Is the offset of the 3D candidate frame angle; r _x To predict the abscissa of the central position of the bounding box, R _y For predicting the ordinate, R, of the position of the centre of the bounding box _z For predicting the ordinate, R, of the position of the centre of the bounding box _w To predict the width of the bounding box, R _h To predict the height of the bounding box, R _l Predicting the bounding box length, R, for 3D _θ Is to predict the height of the bounding box.

7. A method for detecting a point cloud data target by using a pointpilars-based multi-plane coding point cloud feature deep learning model constructed by the construction method of any one of claims 1 to 6.