CN113657246A - Three-dimensional point cloud two-stage target detection method based on self-supervision learning - Google Patents

Three-dimensional point cloud two-stage target detection method based on self-supervision learning Download PDF

Info

Publication number
CN113657246A
CN113657246A CN202110931081.7A CN202110931081A CN113657246A CN 113657246 A CN113657246 A CN 113657246A CN 202110931081 A CN202110931081 A CN 202110931081A CN 113657246 A CN113657246 A CN 113657246A
Authority
CN
China
Prior art keywords
scene
twin
dimensional
target
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110931081.7A
Other languages
Chinese (zh)
Other versions
CN113657246B (en
Inventor
冯鸿超
夏桂华
何芸倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202110931081.7A priority Critical patent/CN113657246B/en
Publication of CN113657246A publication Critical patent/CN113657246A/en
Application granted granted Critical
Publication of CN113657246B publication Critical patent/CN113657246B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a three-dimensional point cloud two-stage target detection method based on self-supervised learning, which comprises the following steps of: (1) and generating a reconstructed scene according to the original point cloud scene. (2) Two scenes are voxelized. (3) And performing sign extraction on the two scenes by using 3D sparse convolution and sub-stream type convolution. And projecting the finally generated 3-dimensional feature map onto 2 dimensions. (4) The proposed region is generated using 2-dimensional features of the original scene. (5) And completing the first self-supervision agent task. (6) And completing the second self-supervision agent task. (7) And (4) extracting the characteristics of the region of interest according to the suggested region in the step (4), and refining the position of the prediction target. (8) And training the self-supervision task and the target detection task by utilizing the normalized loss coefficient. According to the invention, original data are reconstructed to dynamically generate a reconstruction scene, and the reconstruction scene and the self-supervised learning task are used for a subsequent self-supervised learning task and a target detection task so as to improve the characteristic expression capability of a network on point cloud.

Description

Three-dimensional point cloud two-stage target detection method based on self-supervision learning
Technical Field
The invention belongs to the field of computer vision three-dimensional point cloud processing, and particularly relates to a three-dimensional point cloud two-stage target detection method based on self-supervision learning.
Background
Computer vision is facing increasingly complex application scenarios such as autopilot, robotic navigation, etc. Three-dimensional target detection based on lidar point clouds has gained more and more attention because two-dimensional target detection cannot provide accurate position information in such a scenario.
From the point cloud representation, the methods of interest to most researchers can be divided into three broad categories, namely voxel-based methods, point-based methods, and voxel-point fusion methods. VoxelNet, SECOND, etc. are typical voxel-based paradigms, dividing a three-dimensional euclidean space into regular voxel spaces, applying a three-dimensional or three-dimensional sparse convolution to obtain a feature representation, and providing it to a Regional Proposal Network (RPN). Besides being inspired by PointNet, some researchers have proposed a series of point-based working 3D-SSDs, pointrcnns, etc., by iteratively extracting sub-points and grouping neighboring points, and then extracting features directly from the original points. Subsequently, to combine the advantages of both voxel-based and point-based approaches, some studies have converted the voxel space to a point space and have undertaken a point-based approach or the like to process a particular module or to repeat the above process in reverse.
However, all of the above methods do not take full advantage of the three-dimensional bounding box information and the point cloud attributes. In particular, the objects and the environment in the point cloud are isolated from each other, which provides an opportunity to reconstruct a scene using the physical transformations mentioned in the auto-supervised approach. By using the difference and the connection between the target and the transformed target, we can explore richer feature representation. But there has been no similar attempt in previous approaches.
Disclosure of Invention
The invention aims to provide a three-dimensional point cloud two-stage target detection method based on self-supervised learning, which utilizes normalized loss weight to carry out self-supervised training on an agent task and carry out target detection training on a main task.
The purpose of the invention is realized as follows:
a three-dimensional point cloud two-stage target detection method based on self-supervision learning comprises the following steps:
the method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module.
Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided point cloud space is converted into a regular three-dimensional voxel space.
Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation, feature extraction is carried out on a regular voxel space by using sparse convolution and sub-flow type convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And splicing the 8 multiplied down-sampling feature map of the twin scene obtained in the last step along the z axis in the feature dimension to obtain a 2-dimensional feature map of the twin scene.
Step four: and generating a suggested region in the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point based on the 2-dimensional feature map of the original scene by utilizing 2-dimensional convolution.
Step five: the structure imagination task in this patent predicts the location and class of the proposed regions in the reconstructed scene based on the 2-dimensional feature map of the reconstructed scene using the same convolutional network as in step four.
Step six: according to the proposed areas generated in the fourth step and the fifth step, a twin proposed area matched with the twin target is generated, corresponding features of the twin target are extracted from the 3-dimensional features of the twin scene according to mapping of the twin proposed area, and then the corresponding features of the twin target are subjected to feature dimension splicing to obtain difference features of the twin target. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics.
Step seven: and extracting the region of interest from the suggested region in the original scene generated in the fourth step based on the position and the category prediction of the region of interest, and extracting the feature of the corresponding region of interest from the 3-dimensional features of the region of interest mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output.
Step eight: in order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the combined training with the normalization loss is carried out. And the action strength of the agent task is controlled by using the normalized loss coefficient.
Compared with the prior art, the invention has the beneficial effects that:
1. the method introduces the self-supervision learning into the field of 3D target detection for the first time, and improves the prediction precision of the detection network by means of a reconstruction scene and two agent tasks.
2. Due to the complexity of the target detection task, the combined training mode with the normalization loss effectively avoids the conflict between the target detection task and the agent task and avoids the inhibition of the agent task on the target detection task.
3. By utilizing the self-supervision learning concept, the corresponding reconstructed scene can be automatically generated for the original scene in each iteration without additional manual consumption.
4. Since the self-supervision learning only participates in the training process, the method in the patent does not increase extra computational burden while improving the prediction accuracy of the detection network in the inference stage.
Drawings
FIG. 1 is an explanatory diagram of the overall structure of an object detection network according to the present invention;
FIG. 2a is a diagram illustrating a conventional method of learning by self-supervision according to the present invention;
FIG. 2b is a diagram illustrating a conventional method of learning by self-supervision according to the present invention;
FIG. 3 is an illustration of the dynamic reconfiguration scene operation of the present invention;
Detailed Description
The following further describes the embodiments of the present invention with reference to the drawings.
The method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. For example, three targets are in the original scene, two targets are randomly selected to rotate, the two targets are respectively rotated by 10 degrees and 20 degrees, environment points in the original scene are not processed, and then the reconstructed scene is obtained. Except for different directions and orientations of targets in the original scene and the reconstructed scene, the rest information including the category, the position of the center point of the bounding box, the length and the width of the bounding box and the like are the same.
The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module. The twin scene provides support for subsequent self-supervised learning and target detection task training.
Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided twin scene point cloud space is converted into a regular three-dimensional voxel space. On one hand, the data complexity is reduced, and on the other hand, the subsequent feature extraction operation is facilitated.
Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation take a regular voxel space as input, feature extraction is carried out on the regular voxel space by using sparse convolution and sub-stream convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And performing characteristic dimension splicing on the 8 multiplied down-sampling 3-dimensional characteristic diagram of the twin scene obtained in the last step along the z axis to obtain a 2-dimensional characteristic diagram of the twin scene.
Step four: and generating a suggested region in the original scene, inputting a 2-dimensional feature map of the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point in the feature map by using 2-dimensional convolution. Generating RPN (regional precursor sales network) loss L at the same time in the training stageRPN
Step five: the structure imagination task inputs a 2-dimensional feature map of a reconstruction scene, and the position and the category of a suggested area in the reconstruction scene are predicted by using the same convolution network in the fourth step. Generating structural imagination task loss L at the same time in the training phaseSI
Step six: and generating a twin suggestion region matching the twin target according to the suggestion regions generated in the fourth step and the fifth step in the angle-aware task. The target before rotation has a certain number of suggested regions in the original scene, and one of the suggested regions having the maximum 3DIoU value with the target is found as the most matched suggested region. Similar rotated targets also have a certain number of suggested regions in the reconstructed scene, and the corresponding best matching suggested region can also be found. The twin advice region refers to the pair of best matching advice regions of the twin target.
And then extracting corresponding features of the twin targets which are respectively matched with the suggestion region most from the 3-dimensional features of the twin scene according to the mapping of the twin suggestion region, and then carrying out feature dimension splicing on the extracted feature pairs to obtain the difference features of the twin targets. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics. Simultaneous generation of angle-aware task loss L during the training phaseAA
Step seven: and predicting and inputting a suggested region in the original scene generated in the fourth step based on the position and the category of the region of interest, extracting the region of interest from the suggested region, and enabling the region of interest to be in accordance with the feelingAnd extracting the characteristics of the corresponding interested region from the 3-dimensional characteristics of the interesting region mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output. Simultaneous generation of RoI (regionsofinterest) loss L in training phaseRoI
Step eight: the patent refers to the field of 'learning or learning'.
And in the training stage, the loss generated in the fourth, fifth, sixth and seventh steps is used for supervising the network. In order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the patent performs the joint training of the main task and the agent task. And the action strength of the agent task is controlled by using the normalized loss coefficient as shown in the following formula:
L=αLSI+(1-α)LRPN+βLAA+(1-β)LRoI
wherein, L is the loss finally used for monitoring the network, and alpha and beta are the normalized loss coefficients of two agent tasks of the self-monitoring learning respectively. The structure imagination task acts on the first stage of the detection network, and the angle awareness task acts on the second stage of the detection network.
The inference stage does not relate to the proxy task of the self-supervision learning in the patent, so the patent can not increase the calculation efficiency of the detector while improving the detection precision of the detector. And in the inference stage, the steps of four, five, six and seven do not need to generate loss values, but directly generate corresponding prediction results. And a final prediction result is generated by the seventh step, and the class label and the 3D bounding box of the prediction target are output.

Claims (1)

1. A three-dimensional point cloud two-stage target detection method based on self-supervision learning is characterized by comprising the following steps: the method comprises the following steps:
the method comprises the following steps: in the patent, for a certain iteration process, an original point cloud scene is input, then targets in part of the original point cloud scene are randomly selected, the selected targets are randomly rotated by different angles (namely the rotation angles of each target are different) according to a local coordinate system of each target, and the scene obtained after rotation is the reconstructed scene corresponding to the original scene. The objects before and after rotation are referred to as twin objects, the original scene and the corresponding reconstructed scene are referred to as twin scenes, and the twin scenes are output to the next module.
Step two: point cloud space voxelization divides a twin scene according to a fixed voxel size, and the divided point cloud space is converted into a regular three-dimensional voxel space.
Step three: and 3-dimensional sparse feature extraction and 2-dimensional feature map generation, feature extraction is carried out on a regular voxel space by using sparse convolution and sub-flow type convolution, and the feature map is continuously subjected to down-sampling by using stacking of convolution layers to respectively obtain 1 x, 2 x, 4 x and 8 x times of down-sampled 3-dimensional feature maps. And splicing the 8 multiplied down-sampling feature map of the twin scene obtained in the last step along the z axis in the feature dimension to obtain a 2-dimensional feature map of the twin scene.
Step four: and generating a suggested region in the original scene, and then generating a position and category prediction result of the suggested region for each super pixel point based on the 2-dimensional feature map of the original scene by utilizing 2-dimensional convolution.
Step five: the structure imagination task in this patent predicts the location and class of the proposed regions in the reconstructed scene based on the 2-dimensional feature map of the reconstructed scene using the same convolutional network as in step four.
Step six: according to the proposed areas generated in the fourth step and the fifth step, a twin proposed area matched with the twin target is generated, corresponding features of the twin target are extracted from the 3-dimensional features of the twin scene according to mapping of the twin proposed area, and then the corresponding features of the twin target are subjected to feature dimension splicing to obtain difference features of the twin target. And finally, predicting the direction angle difference of the twin target by utilizing the full-connection layer based on the difference characteristics.
Step seven: and extracting the region of interest from the suggested region in the original scene generated in the fourth step based on the position and the category prediction of the region of interest, and extracting the feature of the corresponding region of interest from the 3-dimensional features of the region of interest mapped to the original scene. Then, the target is predicted based on the characteristics of the region of interest, and the category and bounding box information of the predicted target are output.
Step eight: in order to avoid the possible conflict problem of the main task and the agent task in the traditional self-supervision learning, the combined training with the normalization loss is carried out. And the action strength of the agent task is controlled by using the normalized loss coefficient.
CN202110931081.7A 2021-08-13 2021-08-13 Three-dimensional point cloud two-stage target detection method based on self-supervision learning Active CN113657246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110931081.7A CN113657246B (en) 2021-08-13 2021-08-13 Three-dimensional point cloud two-stage target detection method based on self-supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110931081.7A CN113657246B (en) 2021-08-13 2021-08-13 Three-dimensional point cloud two-stage target detection method based on self-supervision learning

Publications (2)

Publication Number Publication Date
CN113657246A true CN113657246A (en) 2021-11-16
CN113657246B CN113657246B (en) 2023-11-21

Family

ID=78479885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110931081.7A Active CN113657246B (en) 2021-08-13 2021-08-13 Three-dimensional point cloud two-stage target detection method based on self-supervision learning

Country Status (1)

Country Link
CN (1) CN113657246B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494609A (en) * 2022-04-02 2022-05-13 中国科学技术大学 3D target detection model construction method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930452A (en) * 2019-10-23 2020-03-27 同济大学 Object pose estimation method based on self-supervision learning and template matching
CN111476822A (en) * 2020-04-08 2020-07-31 浙江大学 Laser radar target detection and motion tracking method based on scene flow
US20210042929A1 (en) * 2019-01-22 2021-02-11 Institute Of Automation, Chinese Academy Of Sciences Three-dimensional object detection method and system based on weighted channel features of a point cloud
CN113221962A (en) * 2021-04-21 2021-08-06 哈尔滨工程大学 Three-dimensional point cloud single-stage target detection method for decoupling classification and regression tasks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210042929A1 (en) * 2019-01-22 2021-02-11 Institute Of Automation, Chinese Academy Of Sciences Three-dimensional object detection method and system based on weighted channel features of a point cloud
CN110930452A (en) * 2019-10-23 2020-03-27 同济大学 Object pose estimation method based on self-supervision learning and template matching
CN111476822A (en) * 2020-04-08 2020-07-31 浙江大学 Laser radar target detection and motion tracking method based on scene flow
CN113221962A (en) * 2021-04-21 2021-08-06 哈尔滨工程大学 Three-dimensional point cloud single-stage target detection method for decoupling classification and regression tasks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宋一凡;张鹏;宗立波;马波;刘立波;: "改进的基于冗余点过滤的3D目标检测方法", 计算机应用, no. 09 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494609A (en) * 2022-04-02 2022-05-13 中国科学技术大学 3D target detection model construction method and device and electronic equipment
CN114494609B (en) * 2022-04-02 2022-09-06 中国科学技术大学 3D target detection model construction method and device and electronic equipment

Also Published As

Publication number Publication date
CN113657246B (en) 2023-11-21

Similar Documents

Publication Publication Date Title
JP6745328B2 (en) Method and apparatus for recovering point cloud data
CN110288695B (en) Single-frame image three-dimensional model surface reconstruction method based on deep learning
Xie et al. Point clouds learning with attention-based graph convolution networks
CN112488210A (en) Three-dimensional point cloud automatic classification method based on graph convolution neural network
CN107818580A (en) 3D reconstructions are carried out to real object according to depth map
WO2022199135A1 (en) Supine position and prone position breast image registration method based on deep learning
Li et al. ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion
US20220277581A1 (en) Hand pose estimation method, device and storage medium
WO2024060395A1 (en) Deep learning-based high-precision point cloud completion method and apparatus
CN113516663B (en) Point cloud semantic segmentation method and device, electronic equipment and storage medium
CN113052955A (en) Point cloud completion method, system and application
CN113989340A (en) Point cloud registration method based on distribution
CN111709270B (en) Three-dimensional shape recovery and attitude estimation method and device based on depth image
US20220414974A1 (en) Systems and methods for reconstructing a scene in three dimensions from a two-dimensional image
CN112163990A (en) Significance prediction method and system for 360-degree image
Tong et al. Normal assisted pixel-visibility learning with cost aggregation for multiview stereo
CN111340935A (en) Point cloud data processing method, intelligent driving method, related device and electronic equipment
CN113657246A (en) Three-dimensional point cloud two-stage target detection method based on self-supervision learning
Rao et al. In-vehicle object-level 3D reconstruction of traffic scenes
Ahn et al. Projection-based point convolution for efficient point cloud segmentation
CN113240584A (en) Multitask gesture picture super-resolution method based on picture edge information
EP4207089A1 (en) Image processing method and apparatus
Zhu et al. CED-Net: contextual encoder–decoder network for 3D face reconstruction
Li et al. SRIF-RCNN: Sparsely represented inputs fusion of different sensors for 3D object detection
CN116152800A (en) 3D dynamic multi-target detection method, system and storage medium based on cross-view feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant