CN113657246A

CN113657246A - Three-dimensional point cloud two-stage target detection method based on self-supervision learning

Info

Publication number: CN113657246A
Application number: CN202110931081.7A
Authority: CN
Inventors: 冯鸿超; 夏桂华; 何芸倩
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-11-16
Anticipated expiration: 2041-08-13
Also published as: CN113657246B

Abstract

The invention provides a three-dimensional point cloud two-stage target detection method based on self-supervised learning, which comprises the following steps of: (1) and generating a reconstructed scene according to the original point cloud scene. (2) Two scenes are voxelized. (3) And performing sign extraction on the two scenes by using 3D sparse convolution and sub-stream type convolution. And projecting the finally generated 3-dimensional feature map onto 2 dimensions. (4) The proposed region is generated using 2-dimensional features of the original scene. (5) And completing the first self-supervision agent task. (6) And completing the second self-supervision agent task. (7) And (4) extracting the characteristics of the region of interest according to the suggested region in the step (4), and refining the position of the prediction target. (8) And training the self-supervision task and the target detection task by utilizing the normalized loss coefficient. According to the invention, original data are reconstructed to dynamically generate a reconstruction scene, and the reconstruction scene and the self-supervised learning task are used for a subsequent self-supervised learning task and a target detection task so as to improve the characteristic expression capability of a network on point cloud.

Description

Three-dimensional point cloud two-stage target detection method based on self-supervision learning

Technical Field

The invention belongs to the field of computer vision three-dimensional point cloud processing, and particularly relates to a three-dimensional point cloud two-stage target detection method based on self-supervision learning.

Background

Computer vision is facing increasingly complex application scenarios such as autopilot, robotic navigation, etc. Three-dimensional target detection based on lidar point clouds has gained more and more attention because two-dimensional target detection cannot provide accurate position information in such a scenario.

From the point cloud representation, the methods of interest to most researchers can be divided into three broad categories, namely voxel-based methods, point-based methods, and voxel-point fusion methods. VoxelNet, SECOND, etc. are typical voxel-based paradigms, dividing a three-dimensional euclidean space into regular voxel spaces, applying a three-dimensional or three-dimensional sparse convolution to obtain a feature representation, and providing it to a Regional Proposal Network (RPN). Besides being inspired by PointNet, some researchers have proposed a series of point-based working 3D-SSDs, pointrcnns, etc., by iteratively extracting sub-points and grouping neighboring points, and then extracting features directly from the original points. Subsequently, to combine the advantages of both voxel-based and point-based approaches, some studies have converted the voxel space to a point space and have undertaken a point-based approach or the like to process a particular module or to repeat the above process in reverse.

However, all of the above methods do not take full advantage of the three-dimensional bounding box information and the point cloud attributes. In particular, the objects and the environment in the point cloud are isolated from each other, which provides an opportunity to reconstruct a scene using the physical transformations mentioned in the auto-supervised approach. By using the difference and the connection between the target and the transformed target, we can explore richer feature representation. But there has been no similar attempt in previous approaches.

Disclosure of Invention

The invention aims to provide a three-dimensional point cloud two-stage target detection method based on self-supervised learning, which utilizes normalized loss weight to carry out self-supervised training on an agent task and carry out target detection training on a main task.

The purpose of the invention is realized as follows:

a three-dimensional point cloud two-stage target detection method based on self-supervision learning comprises the following steps:

the method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module.

Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided point cloud space is converted into a regular three-dimensional voxel space.

Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation, feature extraction is carried out on a regular voxel space by using sparse convolution and sub-flow type convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And splicing the 8 multiplied down-sampling feature map of the twin scene obtained in the last step along the z axis in the feature dimension to obtain a 2-dimensional feature map of the twin scene.

Step four: and generating a suggested region in the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point based on the 2-dimensional feature map of the original scene by utilizing 2-dimensional convolution.

Step five: the structure imagination task in this patent predicts the location and class of the proposed regions in the reconstructed scene based on the 2-dimensional feature map of the reconstructed scene using the same convolutional network as in step four.

Step six: according to the proposed areas generated in the fourth step and the fifth step, a twin proposed area matched with the twin target is generated, corresponding features of the twin target are extracted from the 3-dimensional features of the twin scene according to mapping of the twin proposed area, and then the corresponding features of the twin target are subjected to feature dimension splicing to obtain difference features of the twin target. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics.

Step seven: and extracting the region of interest from the suggested region in the original scene generated in the fourth step based on the position and the category prediction of the region of interest, and extracting the feature of the corresponding region of interest from the 3-dimensional features of the region of interest mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output.

Step eight: in order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the combined training with the normalization loss is carried out. And the action strength of the agent task is controlled by using the normalized loss coefficient.

Compared with the prior art, the invention has the beneficial effects that:

1. the method introduces the self-supervision learning into the field of 3D target detection for the first time, and improves the prediction precision of the detection network by means of a reconstruction scene and two agent tasks.

2. Due to the complexity of the target detection task, the combined training mode with the normalization loss effectively avoids the conflict between the target detection task and the agent task and avoids the inhibition of the agent task on the target detection task.

3. By utilizing the self-supervision learning concept, the corresponding reconstructed scene can be automatically generated for the original scene in each iteration without additional manual consumption.

4. Since the self-supervision learning only participates in the training process, the method in the patent does not increase extra computational burden while improving the prediction accuracy of the detection network in the inference stage.

Drawings

FIG. 1 is an explanatory diagram of the overall structure of an object detection network according to the present invention;

FIG. 2a is a diagram illustrating a conventional method of learning by self-supervision according to the present invention;

FIG. 2b is a diagram illustrating a conventional method of learning by self-supervision according to the present invention;

FIG. 3 is an illustration of the dynamic reconfiguration scene operation of the present invention;

Detailed Description

The following further describes the embodiments of the present invention with reference to the drawings.

The method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. For example, three targets are in the original scene, two targets are randomly selected to rotate, the two targets are respectively rotated by 10 degrees and 20 degrees, environment points in the original scene are not processed, and then the reconstructed scene is obtained. Except for different directions and orientations of targets in the original scene and the reconstructed scene, the rest information including the category, the position of the center point of the bounding box, the length and the width of the bounding box and the like are the same.

The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module. The twin scene provides support for subsequent self-supervised learning and target detection task training.

Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided twin scene point cloud space is converted into a regular three-dimensional voxel space. On one hand, the data complexity is reduced, and on the other hand, the subsequent feature extraction operation is facilitated.

Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation take a regular voxel space as input, feature extraction is carried out on the regular voxel space by using sparse convolution and sub-stream convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And performing characteristic dimension splicing on the 8 multiplied down-sampling 3-dimensional characteristic diagram of the twin scene obtained in the last step along the z axis to obtain a 2-dimensional characteristic diagram of the twin scene.

Step four: and generating a suggested region in the original scene, inputting a 2-dimensional feature map of the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point in the feature map by using 2-dimensional convolution. Generating RPN (regional precursor sales network) loss L at the same time in the training stage_RPN。

Step five: the structure imagination task inputs a 2-dimensional feature map of a reconstruction scene, and the position and the category of a suggested area in the reconstruction scene are predicted by using the same convolution network in the fourth step. Generating structural imagination task loss L at the same time in the training phase_SI。

Step six: and generating a twin suggestion region matching the twin target according to the suggestion regions generated in the fourth step and the fifth step in the angle-aware task. The target before rotation has a certain number of suggested regions in the original scene, and one of the suggested regions having the maximum 3DIoU value with the target is found as the most matched suggested region. Similar rotated targets also have a certain number of suggested regions in the reconstructed scene, and the corresponding best matching suggested region can also be found. The twin advice region refers to the pair of best matching advice regions of the twin target.

And then extracting corresponding features of the twin targets which are respectively matched with the suggestion region most from the 3-dimensional features of the twin scene according to the mapping of the twin suggestion region, and then carrying out feature dimension splicing on the extracted feature pairs to obtain the difference features of the twin targets. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics. Simultaneous generation of angle-aware task loss L during the training phase_AA。

Step seven: and predicting and inputting a suggested region in the original scene generated in the fourth step based on the position and the category of the region of interest, extracting the region of interest from the suggested region, and enabling the region of interest to be in accordance with the feelingAnd extracting the characteristics of the corresponding interested region from the 3-dimensional characteristics of the interesting region mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output. Simultaneous generation of RoI (regionsofinterest) loss L in training phase_RoI。

Step eight: the patent refers to the field of 'learning or learning'.

And in the training stage, the loss generated in the fourth, fifth, sixth and seventh steps is used for supervising the network. In order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the patent performs the joint training of the main task and the agent task. And the action strength of the agent task is controlled by using the normalized loss coefficient as shown in the following formula:

L＝αL_SI+(1-α)L_RPN+βL_AA+(1-β)L_RoI

wherein, L is the loss finally used for monitoring the network, and alpha and beta are the normalized loss coefficients of two agent tasks of the self-monitoring learning respectively. The structure imagination task acts on the first stage of the detection network, and the angle awareness task acts on the second stage of the detection network.

The inference stage does not relate to the proxy task of the self-supervision learning in the patent, so the patent can not increase the calculation efficiency of the detector while improving the detection precision of the detector. And in the inference stage, the steps of four, five, six and seven do not need to generate loss values, but directly generate corresponding prediction results. And a final prediction result is generated by the seventh step, and the class label and the 3D bounding box of the prediction target are output.

Claims

1. A three-dimensional point cloud two-stage target detection method based on self-supervision learning is characterized by comprising the following steps: the method comprises the following steps: