CN111145174A

CN111145174A - 3D target detection method for point cloud screening based on image semantic features

Info

Publication number: CN111145174A
Application number: CN202010000186.6A
Authority: CN
Inventors: 吴飞; 杨永光; 荆晓远; 葛琦; 季一木
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2020-05-12
Anticipated expiration: 2040-01-02
Also published as: CN111145174B

Abstract

The invention provides a 3D target detection method for point cloud screening based on image semantic features. The method comprises the following steps: firstly, a 2D semantic segmentation method is used for segmenting image data to obtain semantic prediction. And projecting the generated semantic prediction into a LIDAR point cloud space through a known projection matrix, so that each point in the point cloud can obtain the semantic category attribute of the corresponding image position. We extract points related to vehicles, pedestrians, cyclists from the original point cloud and form the viewing cones. Secondly, the viewing cones are used as input of the depth 3D target detector, and a loss function which accords with the characteristics of the viewing cones is designed to conduct network training. The invention designs a 3D target detection algorithm for point cloud screening based on image semantic features, thereby greatly reducing the time and the calculation requirements of 3D detection. Finally, the performance of the method on a reference data set KITTI of 3D target detection shows that the method has good real-time target detection performance.

Description

3D target detection method for point cloud screening based on image semantic features

Technical Field

The invention relates to 3D target detection, in particular to a 3D target detection algorithm for point cloud screening based on image semantic features, and belongs to the field of pattern recognition.

Background

Point cloud based 3D object detection serves as an important character in real life for many applications, such as autopilot, home robot, augmented reality, and virtual reality. Compared to traditional image-based target detection methods, LIDAR point clouds provide more accurate depth information that can be used to locate objects and delineate the shape of objects. However, LIDAR point clouds are more sparse and have large differences in density of parts, due to factors such as non-uniform 3D spatial sampling, effective range of the sensor, and object occlusion and relative position, unlike conventional images. To solve the above problem, many methods process a 3D point cloud into features that can be processed by a corresponding target detector using an artificially designed feature extraction method. However, these methods using all point clouds as input require a lot of computing resources and cannot achieve real-time detection.

Disclosure of Invention

The purpose of the invention is: aiming at the problems in the prior art, a 3D target detection algorithm for carrying out point cloud screening based on image semantic features is provided, the algorithm is an end-to-end depth 3D target detection method, a 2D image semantic segmentation method is simultaneously adopted to obtain the category attribute of each pixel in an image under the same scene, the prediction result is used as the prior category attribute, each point in the point cloud is marked through a known projection matrix, and the points of which the categories are automobiles, pedestrians and riders are all extracted from the point cloud to form a viewing cone which is used as the input of a 3D target detection network. At the same time we have designed a 3D object detector that handles cones. In addition to the basic components of the object detector, i.e., the point cloud feature extractor using the mesh, the convolutional intermediate extraction layer, and the regional pre-selection network (RPN), we also optimize the loss function to make the entire network more sensitive to objects in the view frustum that lack references. Our algorithm includes the following steps:

step (1): performing semantic segmentation on the two-dimensional image on the image data to obtain semantic prediction;

step (2): projecting the semantic prediction into a point cloud space, and screening points of a specific category to form a view cone;

and (3): building a 3D target detection network, and taking a viewing cone as the input of a 3D target detector;

and (4): enhancing the sensitivity of the loss function to the position of the 3D target frame;

and (5): and obtaining a total objective function and carrying out algorithm optimization.

Further, the specific method for performing semantic segmentation on the image data in the step (1) is as follows:

images were segmented using the deplab v3+ semantic segmentation method: firstly, manually labeling the image part of a training set in a data set; then, 200 epochs are pre-trained on a Cityscapes data set by DeepLabv3+, and then 50 epochs are finely adjusted on a manually marked semantic label data set; the resulting semantic segmentation network is trained to classify each pixel in the picture as one of 19 classes.

Further, in the step (2), based on the result predicted by the 2D semantic segmentation method, projecting the region of each category in each image into the LIDAR point cloud space by using a known projection matrix, wherein the region corresponding to the LIDAR point cloud space has a category attribute consistent with the image region; points about vehicles, pedestrians and cyclists are then screened from the original point cloud and extracted to form viewing cones.

Further, in step (3), a deep object detection network is constructed by using the pytorech, and the network comprises three parts: point cloud feature extractor using mesh, convolution intermediate extraction layer and regional pre-selection network RPN:

in a grid point cloud feature extractor, orderly cutting the whole view cone by using a 3D grid with a set size, and sending all points in each grid to the grid feature extractor, wherein the grid feature extractor consists of a linear layer, a batch normalization layer BatchNorm and a nonlinear activation layer ReLU;

in the convolution intermediate layer, 3 convolution intermediate modules are used, each convolution intermediate module is formed by sequentially connecting a 3D convolution layer, a batch normalization layer and a nonlinear activation layer, the output of the grid point cloud extractor is used as the input, and the feature with the 3D structure is converted into a 2D pseudo-graph feature which is used as the output;

the input of the regional preselection network RPN is provided by a convolution intermediate layer, the architecture of the PRN consists of three full convolution modules, each module containing a downsampled convolution layer followed by two convolution layers corresponding to the characteristic image size, after each convolution layer the BatchNorm and ReLU operations are applied; then, the output of each block is up-sampled to feature maps with the same size, and the feature maps are connected into a whole; finally, three 1 × 1 2D convolutional layers are applied to the desired learning objective to generate: (1) probability score plot, (2) regression bias, and (3) directional prediction.

Further, in step (4), an overall loss function L is added to the model_totalAs follows:

L_total＝β₁L_cls+β₂(L_{reg_θ}+L_{reg_other})+β₃L_dir+β₄L_corner

wherein L is_clsTo predict the loss of classification, L_{reg_θ}Predicted angle loss for 3D bounding box, L_{reg_other}To predict the loss of correction for the remaining parameters of the 3D bounding box, L_dirTo predict loss of direction, L_cornerTo predict the loss of vertex coordinates for the 3D bounding box β₁,β₂β₃,β₄Are hyper-parameters, set to 1.0, 2.0, 0.2 and 0.5, respectively;

for L_{reg_θ}And L_{reg_other}The following variables were used:

Δθ＝θ^g-θ^a

wherein

w^g,l^g,h^g,θ^gThe parameters for each bounding box provided for the tag,

is a prediction parameter of an anchor point, where x_c,y_c,z_cW, l, h and theta respectively refer to the central coordinate of the three-dimensional bounding box, the length, the width, the height and the overlooking course angle of the three-dimensional bounding box; wherein d is^a＝(l^a)²+(w^a)²The length of the diagonal line of the anchor point floor; for the predicted angle theta_pAngle loss L_{reg_θ}The concrete expression is as follows:

L_{reg_θ}＝SnoothL1(sin(θ_p-Δθ))

correcting the loss L for a parameter_{reg_other}In particular, it is the SmoothL1 function of the differences Δ x, Δ y, Δ z, Δ w, Δ L, Δ h, Δ θ, while the loss of coordinates of the vertices L of the 3D bounding box_cornerThe composition of (A) is as follows:

wherein NS, NH traverses all bounding boxes; p, P^*,P^**Denotes the predicted bounding box vertex, the vertex of the label bounding box, the vertex of the inverse bounding box, delta_ijTo balance the coefficients, i, j are the indices of the targets generated by the final profile.

Further, in step (4), the positive and negative anchor point balance is adjusted using focal length:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)

wherein p is_tIs the estimated probability of the model, α_tAnd γ is a super parameter adjustment coefficient, set to 0.5 and 2, respectively.

Further, in step (5), the whole model is trained according to steps (2), (3) and (4), that is, the 3D target detection network is trained on the KITTI data set, and the specific parameters and implementation method are as follows: training 20 ten thousand times, 160epochs, on a 1080Ti GPU using a random gradient descent SGD and Adam optimizer; the initial learning rate was set to 0.0002, the exponential decay factor was 0.8 and decayed every 15 epochs.

Has the advantages that: in the invention, the image is subjected to semantic segmentation, the obtained prediction result is used as a prior classification mark, and points in the corresponding point cloud are screened out to form a viewing cone. The operation greatly reduces the characteristic of high complexity of the previous input, so that the 3D object detector can obtain good effect while keeping real-time detection.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention provides a 3D target detection method for point cloud screening based on image semantic features. The method comprises the following steps: firstly, a 2D semantic segmentation method is used for segmenting image data to obtain semantic prediction. And projecting the generated semantic prediction into a LIDAR point cloud space through a known projection matrix, so that each point in the point cloud can obtain the semantic category attribute of the corresponding image position. We extract points related to vehicles, pedestrians, cyclists from the original point cloud and form the viewing cones. Secondly, the viewing cones are used as input of the depth 3D target detector, and a loss function which accords with the characteristics of the viewing cones is designed to conduct network training. Due to the existence of a large amount of background and noise in the point cloud, the original unstructured point cloud data is very difficult to process, and a large amount of special consideration needs to be carried out, so that a large amount of computing resources and training reasoning time need to be consumed. The invention designs a 3D target detection algorithm for point cloud screening based on image semantic features, thereby greatly reducing the time and the calculation requirements of 3D detection. Finally, the performance of the method on a reference data set KITTI of 3D target detection shows that the method has good real-time target detection performance.

The invention is further described with reference to the following figures and examples.

Examples

The invention provides a 3D target detection algorithm for point cloud screening based on image semantic features, and the specific flow is shown in figure 1.

Step (1): we segmented the image using the currently outstanding semantic segmentation method, deplab v3 +. The image data of the 3D object detection data set does not contain markers for segmentation. We first hand label the image portion of the training set in the dataset. We pre-train the deplab v3+ now cityscaps dataset for 200 epochs and then perform a 50epoch fine-tuning on the manually labeled semantic tags. The resulting semantic segmentation network is trained to classify each pixel in the picture as one of 19 classes.

Step (2): projecting the semantic prediction into a point cloud space, and screening points of a specific category to form a view cone, wherein the specific method comprises the following steps: based on the result predicted by the 2D semantic segmentation method, the region of each category in each image is projected into the point cloud space by using a known projection matrix, so that the region of the corresponding point cloud space has the category attribute consistent with the image region. Then we screen the points about the car, pedestrian, cyclist from the original point cloud, forming the viewing cone.

And (3): we build a deep target detection network using the pytorech depth framework. The network comprises three parts: point cloud feature extractor using mesh, convolution intermediate extraction layer and regional pre-selection network (RPN).

In the grid point cloud feature extractor, a viewing cone is firstly cut in order by using a 3D grid with a set size. And all of each mesh is taken as input to the mesh feature extractor. Our mesh feature extractor consists of a linear layer, a batch normalization layer (BatchNorm) and a nonlinear activation layer (ReLU).

In the convolution middle layer we use 3 convolution middle blocks in order to increase the receptive field to get more context. Each convolution intermediate module consists of a 3D convolution layer, a batch normalization layer (BatchNorm) and a nonlinear activation layer (ReLU) connected in sequence. It takes the output of the mesh point cloud extractor as input and converts this feature with 3D structure into a 2D pseudo-graph feature, which is taken as the final output.

The input to the Regional Preselection Network (RPN) is provided by the convolution interlayer. The architecture of the PRN consists of three full-volume modules. Each module contains a downsampled convolutional layer followed by two convolutional layers corresponding to the feature image size. After each convolutional layer, we apply the BatchNorm and ReLU operations. Then we upsample the output of each block to a feature map of the same size and concatenate these feature maps into one whole. Finally, three 2D convolutional layers are applied to the desired learning objective to generate: (1) probability score plot, (2) regression bias, and (3) directional prediction.

And (4): the screening process of the point cloud causes the view cones not to have original context information. The target point cloud data without reference makes the detection task more difficult, and a special loss function is added into the model to strengthen the sensitivity of the model to the target. Overall loss function L_totalAs follows:

L_total＝β₁L_cls+β₂(L_{reg_θ}+L_{reg_other})+β₃L_dir+β₄L_corner

wherein L is_clsTo classify the loss, L_{reg_θ}Is the angle loss of the 3D bounding box, L_{reg_other}Correction of losses for the remaining parameters of the 3Dbounding box, L_dirFor directional loss, L_cornerLoss of vertex coordinates for 3D bounding box β₁,β₂β₃,β₄For the hyper-parameter, are set to 1.0, 2.0, 0.2 and 0.5, respectively.

For L_{reg_θ}And L_{reg_other}Can beTo be determined from the following variables:

Δθ＝θ^g-θ^a

wherein

Figure 462170DEST_PATH_FDA0002398050340000023

w^g,l^g,h^g,θ^gThe tag is provided with parameters describing the corresponding bounding box,

prediction parameters being anchor points, where x_c,y_c,z_cAnd w, l, h and theta respectively refer to the central coordinate of the three-dimensional bounding box, the length, the width, the height and the overlooking course angle. Wherein d is^a＝(l^a)²+(w^a)²The length of the diagonal of the frame is detected for the anchor point look-down direction. For the predicted angle theta_pAngle loss L_{reg_θ}Specifically, it can be expressed as:

L_{reg_θ}＝SnoothL1(sin(θ_p-Δθ))

correcting the loss L for a parameter_{reg_other}Is a SmoothL1 function of the differences Δ x, Δ y, Δ z, Δ w, Δ l, Δ h, Δ θ. Loss of vertex coordinates L of 3D bounding box_cornerThe composition of (A) is as follows:

where NS, NH traverses all bounding boxes. P, P^*,P^**Indicating that the vertex of the bounding box is predicted, the vertex of the bounding box is labeled, and the label is inverted. In addition to the above loss function based on bounding box prediction, to solve the problem of imbalance of positive and negative anchor points existing in the RPNWe add focal length to address these drawbacks:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)

wherein p is_tIs the estimated probability of the model, α_tAnd gamma is a super parameter adjustment coefficient, which is set to 0.5 and 2 respectively; log (p)_t) The base numbers of the cross entropy are adopted, and the base numbers are both e and 10.

And (5): obtaining a total objective function and carrying out algorithm optimization:

the entire model is trained according to the previous steps 2, 3, 4. We train the 3D target detection network on the KITTI dataset. We trained on a 1080Ti GPU using random gradient descent (SGD) and Adam optimizers. Our model was trained 20 ten thousand times (160 epochs). The initial learning rate was set to 0.0002, the exponential decay factor was 0.8 and decayed every 15 epochs.

And (4) analyzing results:

to verify the superiority of the algorithm, we compared the proposed method with several of the most advanced target tests published recently, including MV3D, MV3D (LIDAR), F-PointNet, AVOD, AVOD-FCN and VoxelNet. As shown in tables 1 and 2, our method achieved the best performance in the most difficult target tests. Furthermore, table 3 provides a time-efficient comparison of the respective methods, and our method is also a real-time target detection method considering that it itself has used a 2D semantic segmentation method, which consumes too much time.

The experimental results are as follows:

table 1 counts the AP (%) values of 3D detection on the KITTI dataset.

Table 2 counts the AP (%) value of BEV detection on the KITTI dataset.

Table 3 counts the time(s) required for each method to process a scene over the KITTI dataset.

TABLE 1 AP-value comparison of 3D detection on KITTI data set

TABLE 2 AP-value comparison of BEV detection on KITTI data set

TABLE 3 comparison of time spent by each method on KITTI data sets

MV3D	MV3D(LIDAR)	F-PointNet	AVOD	AVOD-FCN	VoxelNet	ours
							0.36	0.24	0.17	0.08	0.10	0.23	0.18

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A3D target detection algorithm for carrying out point cloud screening based on image semantic features is characterized by comprising the following steps:

2. The 3D target detection algorithm for point cloud screening based on image semantic features as claimed in claim 1, wherein: the specific method for performing semantic segmentation on the image data in the step (1) is as follows:

3. The 3D object detection algorithm for point cloud screening based on image semantic features as claimed in claim 1, wherein in step (2), based on the result predicted by the 2D semantic segmentation method, the region of each category in each image is projected into the LIDAR point cloud space by using a known projection matrix, and the region corresponding to the LIDAR point cloud space has a category attribute consistent with the image region; points about vehicles, pedestrians and cyclists are then screened from the original point cloud and extracted to form viewing cones.

4. The 3D object detection algorithm for point cloud screening based on image semantic features as claimed in claim 1, wherein in step (3), a deep object detection network is constructed by using a pytorech, and the network comprises three parts: point cloud feature extractor using mesh, convolution intermediate extraction layer and regional pre-selection network RPN:

5. The 3D target detection algorithm for point cloud screening based on image semantic features as claimed in claim 1, wherein in step (4), an overall loss function L is added to the model_totalAs follows:

L_total＝β₁L_cls+β₂(L_{reg_θ}+L_{reg_other})+β₃L_dir+β₄L_corner

wherein L is_clsTo predict the loss of classification, L_{reg_θ}Predicted angle loss for 3D bounding box, L_{reg_other}To predict the loss of correction for the remaining parameters of the 3Dbounding box, L_dirTo predict loss of direction, L_cornerTo predict the loss of coordinates of the vertices of the 3D bounding box β₁,β₂β₃,β₄Are hyper-parameters, set to 1.0, 2.0, 0.2 and 0.5, respectively;

for L_{reg_θ}And L_{reg_other}The following variables were used:

Δθ＝θ^g-θ^a

wherein the content of the first and second substances,

w^g,l^g,h^g,θ^gthe parameters for each bounding box provided for the tag,

w^a,l^a,h^a,θ^ais a prediction parameter of an anchor point, where x_c,y_c,z_cW, l, h and theta respectively refer to the central coordinate of the three-dimensional bounding box, the length, the width, the height and the overlooking course angle of the three-dimensional bounding box; wherein d is^a＝(l^a)²+(w^a)²The length of the diagonal line of the anchor point floor; for the predicted angle theta_pAngle loss L_{reg_θ}The concrete expression is as follows:

L_{reg_θ}＝SnoothL1(sin(θ_p-Δθ))

6. The 3D target detection algorithm for point cloud screening based on image semantic features as claimed in claim 1, wherein in step (4), the positive and negative anchor point balance is adjusted using focal local:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)

7. The 3D object detection algorithm for point cloud screening based on image semantic features as claimed in claim 1, wherein in step (5), the whole model is trained according to steps (2), (3), (4), that is, the 3D object detection network is trained on KITTI data set, and the specific parameters and implementation method are as follows: training 20 ten thousand times, 160epochs, on a 1080Ti GPU using a random gradient descent SGD and Adam optimizer; the initial learning rate was set to 0.0002, the exponential decay factor was 0.8 and decayed every 15 epochs.