CN112766100A

CN112766100A - 3D target detection method based on key points

Info

Publication number: CN112766100A
Application number: CN202110017625.9A
Authority: CN
Inventors: 张益新
Original assignee: Shanghai Xuehu Technology Co ltd
Current assignee: Shanghai Xuehu Technology Co ltd
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-05-07

Abstract

The invention relates to the technical field of laser radar data processing and target identification, in particular to a 3D target detection method based on key points, which comprises a first stage: carrying out a model training process to generate a model; and a second stage: using a model generated by a training process to perform a prediction process and generate 3D target information so as to complete 3D target detection; the invention constructs a 3D detection network structure for LiDAR point cloud based on key points, which is a key point-based anchor-free 3D detection method.A branch is provided for regression of the key points of a 3D target; and then, an auxiliary training module for the connection relation between the key points is provided, the module realizes the accurate positioning of the frame by regressing the connection relation between the points of the same 3D target, and meanwhile, the module enables the 3D detection method to obtain better performance without spending extra cost.

Description

3D target detection method based on key points

Technical Field

The invention relates to the technical field of laser radar data processing and target identification, and relates to a 3D target detection method based on key points.

Background

The vision-based target detection is one of important research directions in the field of point cloud data processing and computer vision, and can be applied to target detection of vehicles, pedestrians, traffic signs and the like in an automatic driving system, abnormal event analysis in video monitoring, a service robot and other fields. In recent years, with the development of deep neural networks, researches on point cloud data classification, target detection, semantic segmentation and the like have been remarkably advanced, and particularly in the field of target detection, a two-stage network framework represented by R-CNN, Fast RCNN and Mask RCNN and a one-stage network framework represented by YOLO and SSD appear. In any frame, the accuracy and the real-time performance of the two-dimensional target detection algorithm based on deep learning are greatly improved compared with the conventional machine learning method based on characteristics. However, since two-dimensional target detection is only used for regressing the pixel coordinates of the target, and lacks physical world parameter information such as depth and size, there is a certain limitation in practical application, and especially in the sensing of an autonomous vehicle and a service robot, it is often necessary to implement a multi-modal fusion algorithm in combination with sensors such as a laser radar and millimeter waves to enhance the reliability of the sensing system.

Therefore, researchers have proposed a method for detecting a three-dimensional target, which aims to obtain geometric information such as a target position, a size, and a posture in a three-dimensional space; the existing three-dimensional target detection algorithm can be roughly divided into three types of vision, laser point cloud and multi-mode fusion according to different sensors. The vision method is widely used in the field of target detection due to its advantages of low cost, rich texture features, etc., and can be classified into monocular vision and binocular/depth vision according to the type of camera. The former has the key problem that depth information cannot be directly acquired, so that the positioning error of a target in a three-dimensional space is large; the latter not only provides abundant texture information, but also has more accurate depth information, and has higher detection precision compared with the former at present.

For a vehicle-mounted intelligent computing platform supporting and realizing an automatic driving function, a laser radar is an important device for sensing the surrounding environment of a vehicle. The laser radar is also an important means for sensing the surrounding environment of vehicles and robots, and comprises a laser transmitting system, a laser receiving system and a rotating assembly, wherein the laser transmitting system generally consists of a single-beam multi-line narrow-band laser, the laser transmits laser pulses towards the direction at a certain frequency, and the laser pulses are reflected back if hitting the surface of an object within an attenuation distance and are finally received by the receiving system. The rotating assembly continuously rotates to enable the single-beam multi-line laser pulse to achieve acquisition of 360-degree ambient environment information, the transmitting frequency of the transmitter can reach millions of pulses per second, meanwhile, the receiver can also receive laser points reflected by the pulses in corresponding time, and a point cloud characteristic diagram capable of outlining the ambient environment is formed by a large number of laser points. The feature of any single point is represented as pi ═ x, (xi, yi, zi, ri), xi, yi, zi are space coordinate values under the X, Y, Z axis respectively, and ri is the reflection intensity; coordinate description is carried out through a large number of point sets, so that the point cloud data can be applied to different perception methods, and 3D perception of the surrounding environment is achieved.

According to the working characteristics of the laser radar, the laser pulse moves along a straight line, the speed of the known light is determined, the straight line distance between the surface of an object and an emitting point can be obtained according to the time difference between the emitting time and the receiving time, meanwhile, if the center of the laser radar is used as the origin of a coordinate system, X, Y, Z relative coordinate information with the accuracy of the laser reflection point can be obtained, and therefore the spatial information with the accuracy of the surrounding environment can be restored. However, due to the special sensing characteristics of the laser radar, point cloud data generated by the laser radar has the characteristics of sparseness, disorder and noise, and the sparseness is realized in two aspects:

on one hand, because laser pulses generated in unit time of the laser radar are limited, peripheral obstacle information is subjected to discrete sampling, and the characteristic of the discrete sampling causes that local point cloud features formed on the surface of an object have sparse characteristics;

on the other hand, space obstacles are sparse compared with the whole space. Compared with natural point cloud data generated by a sensor such as a camera, the point cloud data is not related to the order of points when describing spatial characteristics.

In addition, when the laser radar receives laser pulses, a few noise reflection points exist, so that partially isolated and wrong noise points exist in formed point cloud data, and therefore, it is very important to design a proper perception algorithm for 3D target detection.

Currently, there are three types of point cloud representations as 3D detector inputs:

1) a point-based representation. The original point cloud is directly processed, and a bounding box is predicted according to each point.

2) A voxel-based representation. The raw point cloud is converted into a compact representation using 2D/3D voxelization.

3) Point and voxel mixed representation. In these methods, both points and voxels are used as input and their features are fused to different stages of the network for bounding box prediction. Different methods may consume different types of point cloud representations.

Existing one-stage 3D object detection methods can achieve real-time performance, however, they are dominated by anchor-based detectors, which are inefficient and require additional post-processing, in this context, we eliminate anchor points and model objects as a single point-the center point of their bounding box. Based on the center point, we propose an anchorless 3D detection network that can perform 3D object detection without anchor points, our 3D detection uses keypoint estimation to find the center point and directly regress the 3D bounding box, but due to the inherent sparsity of the point cloud, the 3D object center point is likely to be located in the white space region, which makes it difficult to estimate an accurate boundary. To solve this problem, we propose an auxiliary connection relation regression module to force the CNN skeleton to give more attention to the connection relation between key points, so as to effectively obtain a more accurate bounding box, and in addition, our 3D detection is not affected by non-maximum suppression, which makes it more effective and simpler.

Disclosure of Invention

In view of the above technical problems, the present invention provides a 3D object detection method based on key points.

The technical scheme adopted by the invention for solving the technical problems is as follows:

A3D target detection method based on key points is characterized by comprising the following steps:

the first stage is as follows: carrying out a model training process to generate a model;

and a second stage: and performing a prediction process by using a model generated by the training process to generate 3D target information, thereby completing the 3D target detection.

The above 3D target detection method based on the keypoints is characterized in that, in the first stage, the model is a convolutional neural network model.

The above 3D target detection method based on key points is characterized in that the second stage specifically includes:

step 1: acquiring point cloud data to be detected, and processing the data;

step 2: and detecting the point cloud data through the trained convolutional neural network model to obtain a detection result.

The above method for detecting a 3D target based on a keypoint is characterized in that, in step 2, the detection result includes keypoint position information and keypoint connection relation information of the 3D target information.

The above 3D target detection method based on the key points is characterized in that the key points include a start point, a middle point and an end point of diagonal angles of the 3D target.

The technical scheme has the following advantages or beneficial effects:

the invention provides a 3D target detection method based on key points, which is a 3D detection network structure constructed for LiDAR point cloud based on key points, and is an anchorless 3D detection method based on key points, wherein firstly, a branch is provided for returning to the key points of a 3D target; and then, an auxiliary training module for the connection relation between the key points is provided, the module realizes the accurate positioning of the frame by regressing the connection relation between the points of the same 3D target, and meanwhile, the module enables the 3D detection method to obtain better performance without spending extra cost.

Drawings

The invention and its features, aspects and advantages will become more apparent from reading the following detailed description of non-limiting embodiments with reference to the accompanying drawings. Like reference symbols in the various drawings indicate like elements. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is a basic flow diagram of a keypoint-based 3D object detection method;

FIG. 2 is a network architecture diagram of a keypoint-based 3D object detection method;

fig. 3 is a vector diagram of a keypoint-based 3D object detection method.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 and 2, the present invention discloses a 3D target detection method based on key points, and the specific scheme is as follows:

the first stage is as follows: carrying out a model training process to generate a model; such as convolutional neural network models.

And a second stage: and performing a prediction process by using a model generated by the training process to generate 3D target information, thereby completing the 3D target detection. Specifically, step 1: acquiring point cloud data to be detected, and processing the data;

step 2: and detecting the point cloud data through the trained convolutional neural network model to obtain a detection result. The detection result comprises key point position information and key point connection relation information of the 3D target information, and the key points comprise a starting point, a middle point and an end point of diagonal angles of the 3D target.

In the embodiment of the invention, the point cloud data is detected through the trained convolutional neural network model to obtain a detection result, so that specific characteristics influencing the point cloud data can be effectively extracted, and the method is suitable for 3D target identification under various specific complex scenes; according to the detection result, point cloud data to be detected are obtained, the identification process is stable in operation, the detection rate and the identification accuracy of the 3D target are effectively improved, and the 3D target identification effect is good. The flow shown in FIG. 1 is described in detail below:

step 101, point cloud data containing to-be-detected points are obtained and data processing is carried out.

Specifically, the obtained point cloud data is processed in the space where the 3D target to be detected is located.

And 102, detecting the point cloud data through the trained convolutional neural network model to obtain a detection result. Specifically, the detection result is the key point position information and the key point connection relationship information of the 3D target to be detected, wherein the key points at least include a starting point, a middle point and an end point of diagonal angles of the 3D target.

Since the key points are important for describing the point cloud data, the specific characteristics influencing the point cloud data can be effectively extracted by extracting the key point information of the point cloud data through the trained convolutional neural network model, and the method is suitable for 3D target identification under various specific complex scenes.

103, according to the detection result, during reasoning, independently extracting the central prediction in the central heatmap of each class, filtering whether all values of the central prediction are larger than or equal to 8-connected neighbors of the central prediction for distance filtering, and then only keeping the central prediction points with the values higher than a predefined threshold value as detection central points; and removing redundant key points to obtain point cloud data to be detected. Specifically, point cloud data corresponding to the detection result is searched according to the detection result and the corresponding relation between preset key point information and the point cloud data; the key point information comprises key point position information and key point connection relation information. Due to the fact that the corresponding relation between the key point information and the point cloud data is preset, the corresponding point cloud data can be found out according to the detection result, the 3D target is identified through the method, the identification process is stable, the detection rate and the identification accuracy of the 3D target are effectively improved, and the 3D target identification effect is good.

In the embodiment of the invention, point cloud data to be detected is acquired and processed, specifically as follows:

in the embodiment, point cloud data to be detected is obtained, a known projection matrix is utilized to project an area of each category in each point cloud data into a LIDAR point cloud space, and the area corresponding to the LIDAR point cloud space has a category attribute consistent with the point cloud data area; then, points related to vehicles, pedestrians and cyclists are screened and extracted from the original point cloud to form viewing cones, and the viewing cones respectively pass through three steps of reduce, Encode and VFE;

reduce: acquiring a view cone projected to an image range within 0.001-100 m by using original radar data, picture size information and sensor calibration information;

constructing a vertex under a pixel coordinate system according to the size of an image, transforming the vertex to a coordinate of a camera 2 by using a matrix C, then transforming the vertex to a coordinate under a correction camera coordinate system by using R and T, and finally transforming the vertex to a coordinate under a radar coordinate system by using (R0_ rect _ Tr _ velo _ to _ cam) _ inv, wherein z is a vertex coordinate of a visual cone at two positions of 0.001m and 100 m; and filtering according to the index of whether the original point cloud is in the view cone, and only keeping the points in the view cone.

Encode: initializing a K x T x 4 dimensional tensor (named voxels) with all 0 values for storing all voxel input features; for each point in the point cloud (eg point coordinates (x1, y1, z1, r1)), we check whether the corresponding voxel already exists; after all points are traversed, or when voxelidx reaches k-1 and the point number statistics num of all voxel grids reaches T, the operation is finished;

VFE: step 11, calculating the average of all actual points in each effective voxel grid on each dimension of the characteristics based on voxels and num _ points _ per _ voxel to obtain a tensor of K × 4 dimensions, wherein each row stores (x _ bar, y _ bar, z _ bar, r _ bar) and expresses the characteristic information of one voxel grid;

step 12, calculating the updated voxels (dimension K x 4) obtained in step 1 for each row (x _ bar ^2+ y _ bar ^2) ^ (1/2);

step 13, the step of stitching feature, stitching (x _ bar ^2+ y _ bar ^2) ^ 1/2 calculated in step 12 and z _ bar, r _ bar calculated in step 11 into a final feature expression voxels (dimension K x 3, each row ((x _ bar ^2+ y _ bar ^2) ^ 1/2), z _ bar, r _ bar) which is the feature of a valid voxel grid;

and step 14, returning to the voxels obtained in the step 13 and the coors obtained in the step 3.

In the embodiment of the invention, a network structure building link uses a pytorech to build a deep target detection network, and the network comprises three parts: a point cloud feature extractor using a mesh, a convolution intermediate extraction layer, and a regional pre-selection network RPN. In a grid point cloud feature extractor, orderly cutting the whole view cone by using a 3D grid with a set size, and sending all points in each grid to the grid feature extractor, wherein the grid feature extractor consists of a linear layer, a batch normalization layer BatchNorm and a nonlinear activation layer ReLU; in the convolution intermediate layer, 3 convolution intermediate modules are used, each convolution intermediate module is formed by sequentially connecting a 3D convolution layer, a batch normalization layer and a nonlinear activation layer, the output of the grid point cloud extractor is used as the input, and the feature with the 3D structure is converted into a 2D pseudo-graph feature which is used as the output; the input of the regional preselection network RPN is provided by a convolution intermediate layer, the architecture of the PRN consists of three full convolution modules, each module containing a downsampled convolution layer followed by two convolution layers corresponding to the characteristic image size, after each convolution layer the BatchNorm and ReLU operations are applied; the output of each block is then upsampled to feature maps of the same size and the feature maps are concatenated into a whole.

The output of the 3D detection network consists of a center classification module, a frame regression module and a connection relation regression module: and the center classification module can predict the probability of the center point of the object class. A box regression module that generates an eight-channel feature map to regress box attributes of an object: an offset regression for regressing a discretization error of the central 2D position caused by the downsampling step; z coordinate regression, predictable center position Z axis; a size regression for regressing a 3D size of the object; returning the direction, and returning the rotating angle of the winding to the Z axis; a connection relation regression module that predicts probabilities of connections between the key points and the key points.

And processing the central point prediction through maximum posing to obtain a final boundary box without IOU-based non-maximum suppression (NMS), wherein the connection relation regression module does not participate in the prediction process and only participates in the training process.

In a specific embodiment of the present invention, the central classification module: the center classifies and outputs center heatmaps of a plurality of convolutional layers, and each center heatmaps corresponds to one category; each category is artificially defined;

is the center of heatmaps; r is the down-sampling step, C is the number of center point types, equal to the detected class;

the real value of the key point information is obtained through the following modes: acquiring a point cloud data set with marking information; reading the labeling information, and carrying out Gaussian transformation on the labeling information to generate L2 data with 3D target key point position information and L1 data with commodity key point connection relation information; the L2 data and the L1 data are true values of the keypoint information. The labeling information at least comprises key point information and category information of the 3D target.

More specifically, the tag information is subjected to a gaussian transformation, and the tag information subjected to the gaussian transformation becomes a response region having a gaussian distribution, wherein the gaussian transformation involves a transformation formula of

The size of the gaussian radius varies with the size of the 3D object being labeled. By the method, the detection rate of the 3D target under the condition of being blocked can be effectively improved, and the occurrence of false detection can be effectively inhibited.

In an embodiment of the present invention, the frame regression module: the bounding box regression module only occurs on the feature of the positive center point, and each bounding box is an eight-dimensional vector; for each frame, an 8-dimensional vector, [ d ]_x，d_y，z，l，w，h，cos(r)，sin(r)]Used to regressively represent the object instances in the lidar point cloud.

d_x，d_yWhether the center point of the last feature map has discretization deviation or not is represented;

z is an offset from the Z axis;

l, w, h are the size of the 3d target, length, width and height;

cos (r), sin (r) indicates what the value of the trigonometric function is for the angle r of rotation about the z-axis;

thus box regression includes offset regression, z-coord regression, size regression and direction regression.

Offset Regression：

Offset regression is used to predict center point offsets, for each center point, feature map ∈ R ×. All classes share the same offset prediction, and the predicted value and the true value offset during training are between 0 and 1. Offset regression uses the L1 loss function:

Direction Regression：

the direction prediction is to predict the rotation angle around the Z axis, and during reasoning, each rotation angle is encoded and then decoded. Directional regression prediction was trained using L1 losses:

Z-coord Regression：

the Z-coord regression is used to predict the offset of the center position on the Z-axis. It outputs a z-coord feature map for each center point. All classes share the same content z-coord prediction, however, since the regression objective is that z-coord is sensitive to outliers, these outliers, which can be considered as hard samples, will produce too large a gradient, which will affect the training process of the model. To account for this problem, the loss of L1 is used to train z-coord

Size Regression：

Predicting the length, width and height as size regression work; like z-coord regression, size regression also easily introduces training imbalances. The loss function of the size regression also takes the L1:

in an embodiment of the present invention, the connection relation regression module: the regression of the connection relation uses a PAF (Part Affinity Fields, namely a local association field) mechanism, and a direction vector at a pixel level is constructed between each point; the 3D vector encodes the position and orientation of the 3D object, and the PAFs is a 3D vector field, preserving the position and orientation.

As shown in fig. 3, a 3D object is represented, which is composed of two

key points

0,2, and each point on the 3D object is a 3D unit vector from one key point to the next; during the prediction phase, the keypoints dj0, dj2 and the PAF calculate the line integral of the above PAF along the line segment composed of the "keypoint pairs" to measure the degree of association (affinity) of the "keypoint pairs" to determine whether the 0 and 2 keypoints belong to the same 3D target.

Those skilled in the art will appreciate that those skilled in the art can implement the modifications in combination with the prior art and the above embodiments, and the details are not described herein. Such variations do not affect the essence of the present invention and are not described herein.

The above description is of the preferred embodiment of the invention. It is to be understood that the invention is not limited to the particular embodiments described above, in that devices and structures not described in detail are understood to be implemented in a manner common in the art; those skilled in the art can make many possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, without affecting the spirit of the invention, using the methods and techniques disclosed above, without departing from the scope of the invention. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A3D target detection method based on key points is characterized by comprising the following steps:

and a second stage: and performing a prediction process by using a model generated by the training process to generate 3D target information, thereby completing 3D target detection.

2. The keypoint-based 3D object detection method of claim 1, wherein in the first stage the model is a convolutional neural network model.

3. The method for detecting the 3D target based on the key points according to claim 2, wherein the second stage specifically comprises:

step 1: acquiring point cloud data to be detected, and processing the data;

4. The method according to claim 3, wherein in step 2, the detection result includes key point position information and key point connection relationship information of the 3D object information.

5. The method of claim 4, wherein the key points comprise a start point, a middle point and an end point of diagonal corners of the 3D object.