CN113657246A - Three-dimensional point cloud two-stage target detection method based on self-supervision learning - Google Patents
Three-dimensional point cloud two-stage target detection method based on self-supervision learning Download PDFInfo
- Publication number
- CN113657246A CN113657246A CN202110931081.7A CN202110931081A CN113657246A CN 113657246 A CN113657246 A CN 113657246A CN 202110931081 A CN202110931081 A CN 202110931081A CN 113657246 A CN113657246 A CN 113657246A
- Authority
- CN
- China
- Prior art keywords
- scene
- twin
- dimensional
- target
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 238000000034 method Methods 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000007670 refining Methods 0.000 abstract 1
- 239000003795 chemical substances by application Substances 0.000 description 10
- 238000010586 diagram Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 239000002243 precursor Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013403 standard screening design Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention provides a three-dimensional point cloud two-stage target detection method based on self-supervised learning, which comprises the following steps of: (1) and generating a reconstructed scene according to the original point cloud scene. (2) Two scenes are voxelized. (3) And performing sign extraction on the two scenes by using 3D sparse convolution and sub-stream type convolution. And projecting the finally generated 3-dimensional feature map onto 2 dimensions. (4) The proposed region is generated using 2-dimensional features of the original scene. (5) And completing the first self-supervision agent task. (6) And completing the second self-supervision agent task. (7) And (4) extracting the characteristics of the region of interest according to the suggested region in the step (4), and refining the position of the prediction target. (8) And training the self-supervision task and the target detection task by utilizing the normalized loss coefficient. According to the invention, original data are reconstructed to dynamically generate a reconstruction scene, and the reconstruction scene and the self-supervised learning task are used for a subsequent self-supervised learning task and a target detection task so as to improve the characteristic expression capability of a network on point cloud.
Description
Technical Field
The invention belongs to the field of computer vision three-dimensional point cloud processing, and particularly relates to a three-dimensional point cloud two-stage target detection method based on self-supervision learning.
Background
Computer vision is facing increasingly complex application scenarios such as autopilot, robotic navigation, etc. Three-dimensional target detection based on lidar point clouds has gained more and more attention because two-dimensional target detection cannot provide accurate position information in such a scenario.
From the point cloud representation, the methods of interest to most researchers can be divided into three broad categories, namely voxel-based methods, point-based methods, and voxel-point fusion methods. VoxelNet, SECOND, etc. are typical voxel-based paradigms, dividing a three-dimensional euclidean space into regular voxel spaces, applying a three-dimensional or three-dimensional sparse convolution to obtain a feature representation, and providing it to a Regional Proposal Network (RPN). Besides being inspired by PointNet, some researchers have proposed a series of point-based working 3D-SSDs, pointrcnns, etc., by iteratively extracting sub-points and grouping neighboring points, and then extracting features directly from the original points. Subsequently, to combine the advantages of both voxel-based and point-based approaches, some studies have converted the voxel space to a point space and have undertaken a point-based approach or the like to process a particular module or to repeat the above process in reverse.
However, all of the above methods do not take full advantage of the three-dimensional bounding box information and the point cloud attributes. In particular, the objects and the environment in the point cloud are isolated from each other, which provides an opportunity to reconstruct a scene using the physical transformations mentioned in the auto-supervised approach. By using the difference and the connection between the target and the transformed target, we can explore richer feature representation. But there has been no similar attempt in previous approaches.
Disclosure of Invention
The invention aims to provide a three-dimensional point cloud two-stage target detection method based on self-supervised learning, which utilizes normalized loss weight to carry out self-supervised training on an agent task and carry out target detection training on a main task.
The purpose of the invention is realized as follows:
a three-dimensional point cloud two-stage target detection method based on self-supervision learning comprises the following steps:
the method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module.
Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided point cloud space is converted into a regular three-dimensional voxel space.
Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation, feature extraction is carried out on a regular voxel space by using sparse convolution and sub-flow type convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And splicing the 8 multiplied down-sampling feature map of the twin scene obtained in the last step along the z axis in the feature dimension to obtain a 2-dimensional feature map of the twin scene.
Step four: and generating a suggested region in the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point based on the 2-dimensional feature map of the original scene by utilizing 2-dimensional convolution.
Step five: the structure imagination task in this patent predicts the location and class of the proposed regions in the reconstructed scene based on the 2-dimensional feature map of the reconstructed scene using the same convolutional network as in step four.
Step six: according to the proposed areas generated in the fourth step and the fifth step, a twin proposed area matched with the twin target is generated, corresponding features of the twin target are extracted from the 3-dimensional features of the twin scene according to mapping of the twin proposed area, and then the corresponding features of the twin target are subjected to feature dimension splicing to obtain difference features of the twin target. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics.
Step seven: and extracting the region of interest from the suggested region in the original scene generated in the fourth step based on the position and the category prediction of the region of interest, and extracting the feature of the corresponding region of interest from the 3-dimensional features of the region of interest mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output.
Step eight: in order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the combined training with the normalization loss is carried out. And the action strength of the agent task is controlled by using the normalized loss coefficient.
Compared with the prior art, the invention has the beneficial effects that:
1. the method introduces the self-supervision learning into the field of 3D target detection for the first time, and improves the prediction precision of the detection network by means of a reconstruction scene and two agent tasks.
2. Due to the complexity of the target detection task, the combined training mode with the normalization loss effectively avoids the conflict between the target detection task and the agent task and avoids the inhibition of the agent task on the target detection task.
3. By utilizing the self-supervision learning concept, the corresponding reconstructed scene can be automatically generated for the original scene in each iteration without additional manual consumption.
4. Since the self-supervision learning only participates in the training process, the method in the patent does not increase extra computational burden while improving the prediction accuracy of the detection network in the inference stage.
Drawings
FIG. 1 is an explanatory diagram of the overall structure of an object detection network according to the present invention;
FIG. 2a is a diagram illustrating a conventional method of learning by self-supervision according to the present invention;
FIG. 2b is a diagram illustrating a conventional method of learning by self-supervision according to the present invention;
FIG. 3 is an illustration of the dynamic reconfiguration scene operation of the present invention;
Detailed Description
The following further describes the embodiments of the present invention with reference to the drawings.
The method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. For example, three targets are in the original scene, two targets are randomly selected to rotate, the two targets are respectively rotated by 10 degrees and 20 degrees, environment points in the original scene are not processed, and then the reconstructed scene is obtained. Except for different directions and orientations of targets in the original scene and the reconstructed scene, the rest information including the category, the position of the center point of the bounding box, the length and the width of the bounding box and the like are the same.
The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module. The twin scene provides support for subsequent self-supervised learning and target detection task training.
Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided twin scene point cloud space is converted into a regular three-dimensional voxel space. On one hand, the data complexity is reduced, and on the other hand, the subsequent feature extraction operation is facilitated.
Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation take a regular voxel space as input, feature extraction is carried out on the regular voxel space by using sparse convolution and sub-stream convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And performing characteristic dimension splicing on the 8 multiplied down-sampling 3-dimensional characteristic diagram of the twin scene obtained in the last step along the z axis to obtain a 2-dimensional characteristic diagram of the twin scene.
Step four: and generating a suggested region in the original scene, inputting a 2-dimensional feature map of the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point in the feature map by using 2-dimensional convolution. Generating RPN (regional precursor sales network) loss L at the same time in the training stageRPN。
Step five: the structure imagination task inputs a 2-dimensional feature map of a reconstruction scene, and the position and the category of a suggested area in the reconstruction scene are predicted by using the same convolution network in the fourth step. Generating structural imagination task loss L at the same time in the training phaseSI。
Step six: and generating a twin suggestion region matching the twin target according to the suggestion regions generated in the fourth step and the fifth step in the angle-aware task. The target before rotation has a certain number of suggested regions in the original scene, and one of the suggested regions having the maximum 3DIoU value with the target is found as the most matched suggested region. Similar rotated targets also have a certain number of suggested regions in the reconstructed scene, and the corresponding best matching suggested region can also be found. The twin advice region refers to the pair of best matching advice regions of the twin target.
And then extracting corresponding features of the twin targets which are respectively matched with the suggestion region most from the 3-dimensional features of the twin scene according to the mapping of the twin suggestion region, and then carrying out feature dimension splicing on the extracted feature pairs to obtain the difference features of the twin targets. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics. Simultaneous generation of angle-aware task loss L during the training phaseAA。
Step seven: and predicting and inputting a suggested region in the original scene generated in the fourth step based on the position and the category of the region of interest, extracting the region of interest from the suggested region, and enabling the region of interest to be in accordance with the feelingAnd extracting the characteristics of the corresponding interested region from the 3-dimensional characteristics of the interesting region mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output. Simultaneous generation of RoI (regionsofinterest) loss L in training phaseRoI。
Step eight: the patent refers to the field of 'learning or learning'.
And in the training stage, the loss generated in the fourth, fifth, sixth and seventh steps is used for supervising the network. In order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the patent performs the joint training of the main task and the agent task. And the action strength of the agent task is controlled by using the normalized loss coefficient as shown in the following formula:
L=αLSI+(1-α)LRPN+βLAA+(1-β)LRoI
wherein, L is the loss finally used for monitoring the network, and alpha and beta are the normalized loss coefficients of two agent tasks of the self-monitoring learning respectively. The structure imagination task acts on the first stage of the detection network, and the angle awareness task acts on the second stage of the detection network.
The inference stage does not relate to the proxy task of the self-supervision learning in the patent, so the patent can not increase the calculation efficiency of the detector while improving the detection precision of the detector. And in the inference stage, the steps of four, five, six and seven do not need to generate loss values, but directly generate corresponding prediction results. And a final prediction result is generated by the seventh step, and the class label and the 3D bounding box of the prediction target are output.
Claims (1)
1. A three-dimensional point cloud two-stage target detection method based on self-supervision learning is characterized by comprising the following steps: the method comprises the following steps:
the method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module.
Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided point cloud space is converted into a regular three-dimensional voxel space.
Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation, feature extraction is carried out on a regular voxel space by using sparse convolution and sub-flow type convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And splicing the 8 multiplied down-sampling feature map of the twin scene obtained in the last step along the z axis in the feature dimension to obtain a 2-dimensional feature map of the twin scene.
Step four: and generating a suggested region in the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point based on the 2-dimensional feature map of the original scene by utilizing 2-dimensional convolution.
Step five: the structure imagination task in this patent predicts the location and class of the proposed regions in the reconstructed scene based on the 2-dimensional feature map of the reconstructed scene using the same convolutional network as in step four.
Step six: according to the proposed areas generated in the fourth step and the fifth step, a twin proposed area matched with the twin target is generated, corresponding features of the twin target are extracted from the 3-dimensional features of the twin scene according to mapping of the twin proposed area, and then the corresponding features of the twin target are subjected to feature dimension splicing to obtain difference features of the twin target. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics.
Step seven: and extracting the region of interest from the suggested region in the original scene generated in the fourth step based on the position and the category prediction of the region of interest, and extracting the feature of the corresponding region of interest from the 3-dimensional features of the region of interest mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output.
Step eight: in order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the combined training with the normalization loss is carried out. And the action strength of the agent task is controlled by using the normalized loss coefficient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110931081.7A CN113657246B (en) | 2021-08-13 | 2021-08-13 | Three-dimensional point cloud two-stage target detection method based on self-supervision learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110931081.7A CN113657246B (en) | 2021-08-13 | 2021-08-13 | Three-dimensional point cloud two-stage target detection method based on self-supervision learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113657246A true CN113657246A (en) | 2021-11-16 |
CN113657246B CN113657246B (en) | 2023-11-21 |
Family
ID=78479885
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110931081.7A Active CN113657246B (en) | 2021-08-13 | 2021-08-13 | Three-dimensional point cloud two-stage target detection method based on self-supervision learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113657246B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114494609A (en) * | 2022-04-02 | 2022-05-13 | 中国科学技术大学 | 3D target detection model construction method and device and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110930452A (en) * | 2019-10-23 | 2020-03-27 | 同济大学 | Object pose estimation method based on self-supervision learning and template matching |
CN111476822A (en) * | 2020-04-08 | 2020-07-31 | 浙江大学 | Laser radar target detection and motion tracking method based on scene flow |
US20210042929A1 (en) * | 2019-01-22 | 2021-02-11 | Institute Of Automation, Chinese Academy Of Sciences | Three-dimensional object detection method and system based on weighted channel features of a point cloud |
CN113221962A (en) * | 2021-04-21 | 2021-08-06 | 哈尔滨工程大学 | Three-dimensional point cloud single-stage target detection method for decoupling classification and regression tasks |
-
2021
- 2021-08-13 CN CN202110931081.7A patent/CN113657246B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210042929A1 (en) * | 2019-01-22 | 2021-02-11 | Institute Of Automation, Chinese Academy Of Sciences | Three-dimensional object detection method and system based on weighted channel features of a point cloud |
CN110930452A (en) * | 2019-10-23 | 2020-03-27 | 同济大学 | Object pose estimation method based on self-supervision learning and template matching |
CN111476822A (en) * | 2020-04-08 | 2020-07-31 | 浙江大学 | Laser radar target detection and motion tracking method based on scene flow |
CN113221962A (en) * | 2021-04-21 | 2021-08-06 | 哈尔滨工程大学 | Three-dimensional point cloud single-stage target detection method for decoupling classification and regression tasks |
Non-Patent Citations (1)
Title |
---|
宋一凡;张鹏;宗立波;马波;刘立波;: "改进的基于冗余点过滤的3D目标检测方法", 计算机应用, no. 09 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114494609A (en) * | 2022-04-02 | 2022-05-13 | 中国科学技术大学 | 3D target detection model construction method and device and electronic equipment |
CN114494609B (en) * | 2022-04-02 | 2022-09-06 | 中国科学技术大学 | 3D target detection model construction method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113657246B (en) | 2023-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6745328B2 (en) | Method and apparatus for recovering point cloud data | |
CN110288695B (en) | Single-frame image three-dimensional model surface reconstruction method based on deep learning | |
Xie et al. | Point clouds learning with attention-based graph convolution networks | |
CN112488210A (en) | Three-dimensional point cloud automatic classification method based on graph convolution neural network | |
CN107818580A (en) | 3D reconstructions are carried out to real object according to depth map | |
WO2022199135A1 (en) | Supine position and prone position breast image registration method based on deep learning | |
Li et al. | ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion | |
US20220277581A1 (en) | Hand pose estimation method, device and storage medium | |
WO2024060395A1 (en) | Deep learning-based high-precision point cloud completion method and apparatus | |
CN113516663B (en) | Point cloud semantic segmentation method and device, electronic equipment and storage medium | |
CN113052955A (en) | Point cloud completion method, system and application | |
CN113989340A (en) | Point cloud registration method based on distribution | |
CN111709270B (en) | Three-dimensional shape recovery and attitude estimation method and device based on depth image | |
US20220414974A1 (en) | Systems and methods for reconstructing a scene in three dimensions from a two-dimensional image | |
CN112163990A (en) | Significance prediction method and system for 360-degree image | |
Tong et al. | Normal assisted pixel-visibility learning with cost aggregation for multiview stereo | |
CN111340935A (en) | Point cloud data processing method, intelligent driving method, related device and electronic equipment | |
CN113657246A (en) | Three-dimensional point cloud two-stage target detection method based on self-supervision learning | |
Rao et al. | In-vehicle object-level 3D reconstruction of traffic scenes | |
Ahn et al. | Projection-based point convolution for efficient point cloud segmentation | |
CN113240584A (en) | Multitask gesture picture super-resolution method based on picture edge information | |
EP4207089A1 (en) | Image processing method and apparatus | |
Zhu et al. | CED-Net: contextual encoder–decoder network for 3D face reconstruction | |
Li et al. | SRIF-RCNN: Sparsely represented inputs fusion of different sensors for 3D object detection | |
CN116152800A (en) | 3D dynamic multi-target detection method, system and storage medium based on cross-view feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |