CN117649515A - Digital twinning-based semi-supervised 3D target detection method, system and equipment - Google Patents

Digital twinning-based semi-supervised 3D target detection method, system and equipment Download PDF

Info

Publication number
CN117649515A
CN117649515A CN202311546436.6A CN202311546436A CN117649515A CN 117649515 A CN117649515 A CN 117649515A CN 202311546436 A CN202311546436 A CN 202311546436A CN 117649515 A CN117649515 A CN 117649515A
Authority
CN
China
Prior art keywords
face
module
distribution
point
uncertainty
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311546436.6A
Other languages
Chinese (zh)
Inventor
张天柱
杨文飞
潘晓扬
张哲�
王诗良
吴枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deep Space Exploration Laboratory Tiandu Laboratory
Original Assignee
Deep Space Exploration Laboratory Tiandu Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deep Space Exploration Laboratory Tiandu Laboratory filed Critical Deep Space Exploration Laboratory Tiandu Laboratory
Priority to CN202311546436.6A priority Critical patent/CN117649515A/en
Publication of CN117649515A publication Critical patent/CN117649515A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semi-supervised 3D target detection method, a system and equipment based on digital twinning, which belong to the field of computer vision and comprise the steps of receiving point clouds, respectively inputting the point clouds into a preset teacher model and a preset student model for preprocessing, and respectively obtaining an uncertain prediction result of the teacher module and an uncertain prediction result of the student module; performing pseudo tag screening and weight distribution on the uncertain prediction result of the teacher module obtained through pretreatment; and (3) supervising the unlabeled data of the student model by using the pseudo-labels and the weight scores, supervising the labeled data of the student model by using the group-Truth, and guiding the uncertain prediction result of the teacher module by using the NMS (network system) guided by IoU to obtain the final detection result. The invention finally helps the model to accurately and efficiently position the object instance and identify the object category by designing the uncertainty estimation method based on face perception and the pseudo tag screening strategy.

Description

Digital twinning-based semi-supervised 3D target detection method, system and equipment
Technical Field
The invention relates to the field of computer vision, in particular to a semi-supervised 3D target detection method, system and equipment based on digital twinning.
Background
3D target detection is a basic task for 3D scene understanding, and aims to predict semantic tags and spatial positioning frames of each object in a point cloud scene. With the popularity of AR/VR, 3D indoor scanning, and autopilot, 3D object detection has become a key technology to facilitate scene understanding. Over the past several decades, a number of fully supervised 3D object detection methods have been proposed, with significant progress in this field. However, these methods rely on a large amount of carefully annotated 3D scene data, which is an expensive and time-consuming collection. In order to reduce the high annotation cost of fully supervised three dimensional object detection methods, semi-supervised methods combining small amounts of marked data and large amounts of unmarked data for model training have gained increasing attention.
Currently, semi-supervised 3D target detection methods are broadly divided into two categories: a consistency constraint-based method and a pseudo tag-based method. The core idea of the method based on consistency constraints is to encourage consistent predictions for data with different data enhancements. Specifically, the data are respectively input into the teacher model and the student model through data enhancement, prediction results are respectively given to the data with different data enhancement, and the consistency loss is used for constraint. The pseudo tag based approach aims at selecting high quality pseudo tags from model predictions of unlabeled data, which are then used in combination with labeled data for model training. The pseudo tag based approach uses global metric scores (IoU, classification confidence, voting scores, etc.) to select pseudo tags, however, pseudo tags with higher global metric scores may not cover each face of an object well, while pseudo tags with lower global metric scores may provide correct predictions for certain faces, thus a semi-supervised 3D object detection approach based on digital twinning is proposed.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a semi-supervised 3D target detection method based on digital twinning, which optimizes three processes of pseudo tag prediction, quality evaluation and screening and realizes a high-precision 3D target detection result.
The aim of the invention can be achieved by the following technical scheme:
in a first aspect, the present application proposes a digital twinning-based semi-supervised 3D target detection method, including:
receiving point clouds, respectively inputting the point clouds into a preset teacher model and a preset student model for preprocessing, and respectively obtaining an uncertain prediction result of the teacher module and an uncertain prediction result of the student module, wherein the preprocessing comprises the following steps:
inputting point cloud, extracting network PointNet through point cloud characteristics, and obtaining candidate points P through translation operation and voting mechanism prop Spatial location and characteristics of (a);
based on the candidate point P prop The characteristics and the space position are predicted by using a multi-layer perceptron to predict the category of the object in the space and the space distribution of each surface of the boundary frame;
obtaining geometrical features of the surface by using a surface-based pooling operation in combination with the spatial positions of the predicted bounding boxes;
splicing the geometric features and the discrete probability distribution features of the surface together, inputting the geometric features and the discrete probability distribution features into a multi-layer perceptron, and outputting an uncertainty prediction result;
performing pseudo tag screening and weight distribution on the uncertain prediction result of the teacher module obtained through pretreatment;
monitoring unlabeled data of the student model by using the pseudo tags and the weight scores, and monitoring labeled data of the student model by using the group-Truth;
the indeterminate prediction results of the teacher module were obtained using IoU guided NMS to obtain the final detection results.
In some embodiments, the input point cloud acquires the spatial position and the characteristics of the candidate points through the point cloud characteristics extraction network PointNet, the translation operation and the voting mechanism;
wherein the input point cloud obtains a seed point P through furthest point sampling seed Seed point P seed The characteristics of surrounding points are aggregated by using k nearest neighbor and a multi-layer perceptron, and the characteristics are repeated twice as an input point cloud of the next stage;
using the last output seed point P seed To predict the probability that it is a foreground object point and its distance from the object center, and to translate its spatial position to the object center. Seed point P after translation seed Called candidate point P prop
In some embodiments, the using a multi-layer perceptron to predict the spatial distribution of each face of the bounding box and the class of objects in space includes the steps of:
we have devised a side-aware parameterization method to represent bounding boxes. Specifically, given a candidate point P prop Instead of predicting the offset to the center point and the size of the object, we predict the position and the feature from the candidate point P prop Distance to each side. Based on the observation that the predicted probability distribution can measure uncertainty, we modify the bounding box parameterization from deterministic to probabilistic. We compare each face with the candidate point P prop The distance of the position is divided into discrete grids in space, the probability that the front face falls in the grids is predicted, and the spatial position of the face is obtained through the expectations of distribution, and the formula is as follows:
wherein s is i Is that each lattice representsP is the predicted probability,is the spatial position of the predicted face;
in some embodiments, the face-based pooling operation yields geometric features of the face, including the steps of:
the face-based pooling operation requires selection of a face point P side A facet is a virtual grid point that contains a particular face of an object. More specifically, for the front of the object, we divide the width and height of the object into segments D. Subsequently, we generate 2×d×d grid points in front of and behind the front of the object. For each bin we find its k nearest neighbors and from the seed bin P seed Distance weighted interpolation of feature propagation to a surface point P side . Then we will all the points P side Inputting into a point cloud feature extraction network PointNet to obtain geometrical feature F geo
In some embodiments, the geometric features and the discrete probability distribution features of the surface are spliced together, input into a multi-layer perceptron, and output uncertainty prediction results, comprising the steps of:
the discrete probability distribution feature is the statistic of the discrete distribution of the face, and we choose the average value of k values with the largest probability, the variance of the distribution and all probability values of the discrete distribution as the distribution feature Fx of the face ist
Distribution characteristics F of the dough dist Geometric features of dough F geo Spliced together;
then the fused characteristics are input into a multi-layer perceptron, and the uncertainty measure u of the output surface of the Sigmoid activation layer is passed s The formula is as follows:
u s=Sigmoid(MLP(Cat(F geo ,F dist )))
the uncertainty measure of the surface can be obtained by using the calculation method, and in order to guide the training of the uncertainty estimation module, we use the direct absolute error of the predicted surface and the real surface as the tag of the uncertaintyConstraint is carried out, and a label calculation formula is as follows:
wherein y is s Is the spatial position of the real face,is the spatial position of the predicted surface, MIN is the minimum value, alpha is the scale-up and scale-down coefficient, and the invention is set to be 4; furthermore, the present invention uses absolute errors of predicted and real faces +.>And predicted uncertainty u s Mean square error between as uncertainty regression loss Lx ncer
Where U is the uncertainty estimate u= { U for each side of the object s |∈B}。
In some embodiments, the pseudo tag screening and weight allocation includes the steps of:
and the pseudo tag screening and weight distribution module is used for: the module consists of three parts: class specific filters, ioU guided low half NMS policies and face aware weight allocation; the category-specific filter uses category confidence, foreground confidence, and IoU predictors to filter out low quality false labels; ioU guided low half NMS strategy the low half NMS guided by IoU discarded only half of the predicted IoU lower prediction results; the face perception weight distribution is carried out by using the uncertainty of the face as each face of the pseudo tag, and the weight distribution formula is as follows:
wherein q is s Is the quality score (weight)),α 2 Is the scale factor, and finally we weight the loss using the quality score:
q B is q s The average value of (2) reflects the global positioning quality of the bounding box. In this way we reduce the disturbance of the less localized faces in the model training.
In some embodiments, the monitoring of unlabeled data of a student model using pseudo tags and weight scores and monitoring of labeled data of a student model using group-Truth includes the steps of:
since we decouple the bounding box into separate regression problems for each face, the general bounding box regression loss does not work well with the localization problem presented by the present invention. We therefore use the IoU loss of rotation and the smooth L1 loss based on face awareness in the face aware network:
plane-based smoothing L1 loss L reg (s) focus on the local positioning of each face, facilitating subsequent uncertainty predictions. Loss of L of IoU of rotation IoU (B) Focusing on the global localization of the bounding box, it is robust to changes in shape and dimensions.
For unlabeled data, we use the pseudo-labels as supervisory signals, weighted with quality scores:
in some embodiments, the NMS guided by IoU for the output of the teacher model in the reasoning stage comprises the following steps:
the detection model is output as a bounding box of the object, the category of the object, uncertainty of each face of the object and IoU predicted value of the object; according to the invention, the IoU predicted value of the object is used as an additional reference value of the non-maximum suppression algorithm, namely, the IoU predicted value of the object and the class confidence of the object are multiplied to be used as the total confidence of the object and are input into the non-maximum suppression algorithm, so that repeated predicted results with lower quality are filtered out, and a high-precision detection result is obtained.
In a second aspect, the present application also proposes a semi-supervised 3D object detection system based on face uncertainty estimation, comprising;
the preprocessing module is used for receiving the point cloud, inputting the point cloud into a preset teacher model and a preset student model for preprocessing, and respectively obtaining an uncertain prediction result of the teacher module and an uncertain prediction result of the student module; the preselection module comprises a candidate point feature extraction module, a spatial distribution prediction module of the surface, a geometric feature acquisition module of the surface and a discrete distribution prediction module of the surface;
the candidate point feature extraction module is used for inputting point clouds and obtaining candidate points P through point cloud feature extraction network PointNet, translation operation and voting mechanism prop Spatial location and characteristics of (a);
a spatial distribution prediction module of the surface for predicting the candidate point P prop The characteristics and the space position are predicted by using a multi-layer perceptron to predict the category of the object in the space and the space distribution of each surface of the boundary frame;
a surface geometric feature acquisition module, configured to obtain surface geometric features using a surface-based pooling operation in combination with the spatial positions of the predicted bounding boxes;
the surface discrete distribution prediction module is used for splicing the geometric characteristics and the discrete probability distribution characteristics of the surface together, inputting the geometric characteristics and the discrete probability distribution characteristics into the multi-layer perceptron, and outputting an uncertainty prediction result;
the uncertainty prediction module is used for pseudo tag screening and weight distribution, and is used for performing pseudo tag screening and weight distribution on the uncertainty prediction result of the teacher module obtained through pretreatment;
and the output module is used for supervising the unlabeled data of the student model by using the pseudo labels and the weight scores, and supervising the labeled data of the student model by using the group-Truth.
In a third aspect, the present application further proposes a terminal device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the memory stores the computer program capable of running on the processor, and when the processor loads and executes the computer program, a digital twinning-based semi-supervised 3D object detection method is adopted as described above.
In a fourth aspect, the present application further proposes a computer readable storage medium, in which a computer program is stored, which when loaded and executed by a processor, employs a digital twinning-based semi-supervised 3D object detection method as described above.
The invention has the beneficial effects that:
the patent provides a semi-supervised 3D target detection method, system and equipment based on digital twinning. In order to solve the problems of poor quality of pseudo labels generated by a teacher model and adverse model training, a prediction module and an uncertainty prediction module of discrete distribution of one surface are designed to ensure that the positioning problem of an object boundary box can be decoupled into the positioning problem of a plurality of surfaces, and the prediction reliability evaluation of each surface can be given. In addition, a pseudo tag screening and weight distribution module is designed to inhibit interference of false positioning of the face in the pseudo tag on a model training process so as to obtain better target detection performance. Through the design of this patent, the unreasonable and insufficient problem of effective information utilization of dataset of pseudo-label selection in the 3D target has been solved to a great extent, and accurate, the efficient detection object instance of model is finally helped, discernment object class.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a block diagram of a digital twinning-based semi-supervised 3D object detection system of the present application;
FIG. 2 is a flow chart of face aware pooling operations and uncertainty prediction in an embodiment of the present application;
fig. 3 is a flow chart of a semi-supervised 3D object detection method based on digital twinning.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Referring to fig. 2-3, the present application proposes a digital twinning-based semi-supervised 3D target detection method, which includes:
receiving point clouds, respectively inputting the point clouds into a preset teacher model and a preset student model for preprocessing, and respectively obtaining an uncertain prediction result of the teacher module and an uncertain prediction result of the student module, wherein the preprocessing comprises the following steps:
inputting point cloud, extracting network PointNet through point cloud characteristics, and obtaining candidate points P through translation operation and voting mechanism prop Spatial location and characteristics of (a);
based on the candidate point P prop The characteristics and the space position are predicted by using a multi-layer perceptron to predict the category of the object in the space and the space distribution of each surface of the boundary frame;
obtaining geometrical features of the surface by using a surface-based pooling operation in combination with the spatial positions of the predicted bounding boxes;
splicing the geometric features and the discrete probability distribution features of the surface together, inputting the geometric features and the discrete probability distribution features into a multi-layer perceptron, and outputting an uncertainty prediction result;
performing pseudo tag screening and weight distribution on the uncertain prediction result of the teacher module obtained through pretreatment;
monitoring unlabeled data of the student model by using the pseudo tags and the weight scores, and monitoring labeled data of the student model by using the group-Truth;
the indeterminate prediction results of the teacher module were obtained using IoU guided NMS to obtain the final detection results.
In some embodiments, the input point cloud acquires the spatial position and the characteristics of the candidate points through the point cloud characteristics extraction network PointNet, the translation operation and the voting mechanism;
wherein the input point cloud obtains a seed point P through furthest point sampling seed Seed point P seed The characteristics of surrounding points are aggregated by using k nearest neighbor and a multi-layer perceptron, and the characteristics are repeated twice as an input point cloud of the next stage;
using the last output seed point P seed To predict the probability that it is a foreground object point and its distance from the object center, and to translate its spatial position to the object center. Seed point P after translation seed Called candidate point P prop
In some embodiments, the using a multi-layer perceptron to predict the spatial distribution of each face of the bounding box and the class of objects in space includes the steps of:
we have devised a side-aware parameterization method to represent bounding boxes. Specifically, given a candidate point P prop Instead of predicting the offset to the center point and the size of the object, we predict the position and the feature from the candidate point P prop Distance to each side. Based on the observation that the predicted probability distribution can measure uncertainty, we modify the bounding box parameterization from deterministic to probabilistic. We compare each face with the candidate point P prop The distance of the position is divided into discrete grids in space, the probability that the front face falls in the grids is predicted, and the blank of the face is obtained through the distribution expectationThe formula of the inter-position is as follows:
wherein s is i Is the distance value represented by each bin, P is the predicted probability,is the spatial position of the predicted face;
in some embodiments, the face-based pooling operation yields geometric features of the face, including the steps of:
the face-based pooling operation requires selection of a face point P side A facet is a virtual grid point that contains a particular face of an object. More specifically, for the front of the object, we divide the width and height of the object into segments D. Subsequently, we generate 2×d×d grid points in front of and behind the front of the object. For each bin we find its k nearest neighbors and from the seed bin P seed Distance weighted interpolation of feature propagation to a surface point P side . Then we will all the points P side Inputting into a point cloud feature extraction network PointNet to obtain geometrical feature F geo
In some embodiments, the geometric features and the discrete probability distribution features of the surface are spliced together, input into a multi-layer perceptron, and output uncertainty prediction results, comprising the steps of:
the discrete probability distribution feature is the statistic of the discrete distribution of the face, and we choose the average value of k values with the largest probability, the variance of the distribution and all probability values of the discrete distribution as the distribution feature F of the face dist
Distribution characteristics F of the dough dist Geometric features of dough F geo Spliced together;
then the fused characteristics are input into a multi-layer perceptron, and the uncertainty measure u of the output surface of the Sigmoid activation layer is passed s The formula is as follows:
u s =Sigmoid(MLP(Cat(F geo ,F dist )))
the uncertainty measure of the surface can be obtained by using the calculation method, and in order to guide the training of the uncertainty estimation module, we use the direct absolute error of the predicted surface and the real surface as the tag of the uncertaintyConstraint is carried out, and a label calculation formula is as follows:
wherein y is s Is the spatial position of the real face,is the spatial position of the predicted surface, MIN is the minimum value, alpha is the scale-up and scale-down coefficient, and the invention is set to be 4; furthermore, the present invention uses absolute errors of predicted and real faces +.>And predicted uncertainty u s Mean square error between as uncertainty regression loss L uncer
Where U is the uncertainty estimate u= { U for each side of the object s |∈B}。
In some embodiments, the pseudo tag screening and weight allocation includes the steps of:
and the pseudo tag screening and weight distribution module is used for: the module consists of three parts: class specific filters, ioU guided low half NMS policies and face aware weight allocation; the category-specific filter uses category confidence, foreground confidence, and IoU predictors to filter out low quality false labels; ioU guided low half NMS strategy the low half NMS guided by IoU discarded only half of the predicted IoU lower prediction results; the face perception weight distribution is carried out by using the uncertainty of the face as each face of the pseudo tag, and the weight distribution formula is as follows:
wherein q is s Is the mass fraction (weight) of the face, α 2 Is the scale factor, and finally we weight the loss using the quality score:
q B is q s The average value of (2) reflects the global positioning quality of the bounding box. In this way we reduce the disturbance of the less localized faces in the model training.
In some embodiments, the monitoring of unlabeled data of a student model using pseudo tags and weight scores and monitoring of labeled data of a student model using group-Truth includes the steps of:
since we decouple the bounding box into separate regression problems for each face, the general bounding box regression loss does not work well with the localization problem presented by the present invention. We therefore use the IoU loss of rotation and the smooth L1 loss based on face awareness in the face aware network:
plane-based smoothing L1 loss L reg (s) focus on the local positioning of each face, facilitating subsequent uncertainty predictions. Loss of L of IoU of rotation IoU (B) Focusing on the global localization of the bounding box, it is robust to changes in shape and dimensions.
For unlabeled data, we use the pseudo-labels as supervisory signals, weighted with quality scores:
in some embodiments, the NMS guided by IoU for the output of the teacher model in the reasoning stage comprises the following steps:
the detection model is output as a bounding box of the object, the category of the object, uncertainty of each face of the object and IoU predicted value of the object; according to the invention, the IoU predicted value of the object is used as an additional reference value of the non-maximum suppression algorithm, namely, the IoU predicted value of the object and the class confidence of the object are multiplied to be used as the total confidence of the object and are input into the non-maximum suppression algorithm, so that repeated predicted results with lower quality are filtered out, and a high-precision detection result is obtained.
The embodiment of the application discloses a semi-supervised 3D target detection system based on surface uncertainty estimation, wherein a 3D target is a semantic label and a boundary box of each object in a system input point cloud scene, and the system comprises; the device comprises a candidate point feature extraction module, a discrete distribution prediction module of a face, an uncertainty prediction module and a pseudo tag screening and weight distribution module.
Yet another embodiment of the present invention provides a semi-supervised 3D target detection system based on face uncertainty estimation, comprising;
the preprocessing module is used for receiving the point cloud, inputting the point cloud into a preset teacher model and a preset student model for preprocessing, and respectively obtaining an uncertain prediction result of the teacher module and an uncertain prediction result of the student module; the preselection module comprises a candidate point feature extraction module, a spatial distribution prediction module of the surface, a geometric feature acquisition module of the surface and a discrete distribution prediction module of the surface;
the candidate point feature extraction module is used for inputting point clouds and obtaining candidate points P through point cloud feature extraction network PointNet, translation operation and voting mechanism prop Spatial location and characteristics of (a);
a spatial distribution prediction module of the surface for predicting the candidate point P prop The characteristics and the space position are predicted by using a multi-layer perceptron to predict the category of the object in the space and the space distribution of each surface of the boundary frame;
a surface geometric feature acquisition module, configured to obtain surface geometric features using a surface-based pooling operation in combination with the spatial positions of the predicted bounding boxes;
the surface discrete distribution prediction module is used for splicing the geometric characteristics and the discrete probability distribution characteristics of the surface together, inputting the geometric characteristics and the discrete probability distribution characteristics into the multi-layer perceptron, and outputting an uncertainty prediction result;
the uncertainty prediction module is used for pseudo tag screening and weight distribution, and is used for performing pseudo tag screening and weight distribution on the uncertainty prediction result of the teacher module obtained through pretreatment;
and the output module is used for supervising the unlabeled data of the student model by using the pseudo labels and the weight scores, and supervising the labeled data of the student model by using the group-Truth.
The overall execution flow of the system is shown in fig. 1, assuming that the input point cloud has N points, each containing location (x, y, z) information. First, we use a candidate point feature extraction module based on point cloud feature extraction network PointNet to encode the position information to obtain candidate point feature F prop And position P prop . Next, we will get the candidate point P obtained by feature aggregation prop Input to the discrete distribution prediction module of the surface, we can obtain the probability distribution of 6 surfaces of the object predicted by each candidate point, the orientation angle of the object and the object class. Then, we input the distribution of all the surfaces of the object and the geometric features obtained by the surface-based pooling operation into an uncertainty prediction module, and we can obtain the uncertainty measure of each surface of the object. Further, we use the pseudo tag screening and weight distribution module to screen the output of the teacher model, using high quality pseudo tag information for training of the student model. The supervision information of the final student model is GT constraint of labeled data and pseudo-label constraint of teacher model of unlabeled data; and the parameters of the teacher model are dynamically updated by using the parameters of the student model. In the reasoning stage, the teacher model outputs the final prediction result after removing the repeated prediction frames by the NMS guided by IoU. The different modules are detailed as follows:
and a candidate point feature extraction module. The candidate point feature extraction module mainly comprises three parts: the point feature extraction network module PointNet, translation operation and voting operation.
The point feature extraction network module PointNet: the point feature extraction network module aims at aggregating seed points P seed Mainly using a multi-layer perceptron and the furthest point sampling to realize multi-scale feature extraction;
where max is the maximum pooling, MLP is the multi-layer perceptron, FPS is the furthest point sample, and K represents the K nearest neighbors of the seed point.
The translation operation is to make the seed point P seed Inputting into a feedforward network, outputting the offset from the current position to the center of the object to obtain the position P of the candidate point prop
P prop =P seed +F seed W 1 +B 1
The voting operation is to input candidate points into the feedforward network and output the probability P of the candidate points in the object in
P in =F seed W 2 +B 2
Wherein W is 1 、W 2 Is the weight of the linear layer, B 1 、B 2 Is the bias of the linear layer.
A discrete distribution prediction module of the surface. The discrete distribution prediction module of the surface mainly comprises two parts: a position prediction module and a category prediction module for the object.
Position prediction module of object: given a candidate point P prop Instead of predicting the offset to the center point and the size of the object, we predict the position and the feature from the candidate point P prop Distance to each side. Based on the observation that the predicted probability distribution can measure uncertainty, we modify the bounding box parameterization from deterministic to probabilistic. We compare each face with the candidate point P prop The distance of the position is divided into discrete lattices in space, and the probability that the current plane falls in the lattice is predicted:
P(s=s i )=MLP(F prop )
we then calculate the expected position of the face using the spatial probability distribution, given by:
wherein s is i Is the distance value represented by each bin, P is the predicted probability,is the spatial position of the predicted face;
category prediction module of object: given a candidate point P prop We predict the class of objects as follows:
P cls =MLP(F prop )
i.e. by means of the above modules we can get the class and spatial probability distribution of the candidate points.
Uncertainty prediction module. The uncertainty prediction module mainly comprises two parts: a geometric feature extraction module and an uncertainty prediction network module of the face.
Face geometric feature extraction module: requiring selection of a point P side A facet is a virtual grid point that contains a particular face of an object. More specifically, for the front of the object, we divide the width and height of the object into segments D. Subsequently, we generate 2×d×d grid points in front of and behind the front of the object. For each bin we find its k nearest neighbors and from the seed bin P seed Distance weighted interpolation of feature propagation to a face point F side
F side =∑F i seed W i
Wherein F is i seed Is the feature of the i nearest neighbor, W i Is the weight value of the distance weighted interpolation. After obtaining the spatial position and feature of the face point, it is input to the point feature extraction network PointNet to aggregate the geometric feature F of the face of the prediction frame geo
Uncertainty prediction network module: the discrete probability distribution feature is the statistic of the discrete distribution of the face, and we choose the average value of k values with the largest probability, the variance of the distribution and all probability values of the discrete distribution as the distribution feature F of the face dist
Distribution characteristics F of the dough dist Geometric features of dough F geo Spliced together; then the fused characteristics are input into a multi-layer perceptron, and the uncertainty measure u of the output surface of the Sigmoid activation layer is passed s The formula is as follows:
u s =Sigmoid(MLP(Cat(F geo ,F dist )))
the uncertainty measure of the surface can be obtained by using the calculation method, and in order to guide the training of the uncertainty estimation module, we use the direct absolute error of the predicted surface and the real surface as the tag of the uncertaintyConstraint is carried out, and a label calculation formula is as follows:
wherein y is s Is the spatial position of the real face,is the spatial position of the predicted surface, MIN is the minimum value, alpha is the scale-up and scale-down coefficient, and the invention is set to be 4; furthermore, the present invention uses absolute errors of predicted and real faces +.>And predicted uncertainty u s Mean square error between as uncertainty regression loss L uncer
Where U is the uncertainty estimate u= { U for each side of the object s |∈B}。
And the pseudo tag screening and weight distribution module. The pseudo tag screening and weight distribution module mainly comprises three parts: a class-specific filter, a IoU guided low-half NMS module and a face-aware weight assignment module.
Category-specific filter: which uses category confidence, foreground confidence, and IoU predictors to filter out low quality false labels. The thresholds of the classification score, the foreground score, and the IoU score are expressed as τ cls 、τ obj And τ IoU The method comprises the steps of carrying out a first treatment on the surface of the The selected pseudo tag needs to satisfy:
P clscls ,P objobj ,P IoUIoU
IoU guided low half NMS policy: noise in the pseudo tag caused by repeated bounding box predictions is suppressed by a low half non-maximum suppression strategy guided by IoU. For a stack of highly overlapping pseudo-markers we discard only half of the predictions IoU lower suggestions;
face perception weight distribution module: the uncertainty of the using surface is weight distribution for each surface of the pseudo tag, and the weight distribution formula is as follows:
wherein q is s Is the mass fraction (weight) of the face, α 2 Is the scale factor, and finally we weight the loss using the quality score:
q B is q s The average value of (2) reflects the global positioning quality of the bounding box. In this way we reduce the disturbance of the less localized faces in the model training.
The method can be widely applied to systems in the fields of automatic driving, mechanical arm grabbing, augmented reality and the like, and can accurately position and identify the object in the point cloud scene. In practice, the method can be installed on front-end equipment, robots and automatic driving automobiles in a software mode to provide real-time object target detection; the method can also be installed in a background server to provide object positioning and recognition results of a large number of 3D point cloud scenes.
Table 1: comparison of experimental results on ScannetV2 dataset
As shown in Table 1, we compared our method with the best performance method at present on ScanNet V2 data set, and from the results, we can see that our method obtains the best performance on both AP@50 and AP@25 under the condition of using different amounts of tagged data, and verifies the effectiveness of our method.
The embodiment of the application also discloses a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein when the processor executes the computer program, any one of the semi-supervised 3D target detection methods based on digital twinning are adopted.
The terminal device may be a computer device such as a desktop computer, a notebook computer, or a cloud server, and the terminal device includes, but is not limited to, a processor and a memory, for example, the terminal device may further include an input/output device, a network access device, a bus, and the like.
The processor may be a Central Processing Unit (CPU), or of course, according to actual use, other general purpose processors, digital Signal Processors (DSP), application Specific Integrated Circuits (ASIC), ready-made programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the general purpose processor may be a microprocessor or any conventional processor, etc., which is not limited in this application.
The memory may be an internal storage unit of the terminal device, for example, a hard disk or a memory of the terminal device, or may be an external storage device of the terminal device, for example, a plug-in hard disk, a Smart Memory Card (SMC), a secure digital card (SD), or a flash memory card (FC) equipped on the terminal device, or the like, and may be a combination of the internal storage unit of the terminal device and the external storage device, where the memory is used to store a computer program and other programs and data required by the terminal device, and the memory may be used to temporarily store data that has been output or is to be output, which is not limited in this application.
Any of the semi-supervised 3D object detection methods based on digital twinning in the embodiment is stored in a memory of the terminal device through the terminal device, and is loaded and executed on a processor of the terminal device, so that the method is convenient to use.
The embodiment of the application also discloses a computer readable storage medium, and the computer readable storage medium stores a computer program, wherein when the computer program is executed by a processor, any of the digital twinning-based semi-supervised 3D object detection methods in the embodiment are adopted.
The computer program may be stored in a computer readable medium, where the computer program includes computer program code, where the computer program code may be in a source code form, an object code form, an executable file form, or some middleware form, etc., and the computer readable medium includes any entity or device capable of carrying the computer program code, a recording medium, a usb disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunication signal, a software distribution medium, etc., where the computer readable medium includes, but is not limited to, the above components.
The semi-supervised 3D object detection method based on digital twinning in any of the embodiments is stored in the computer readable storage medium through the computer readable storage medium, and is loaded and executed on a processor, so as to facilitate the storage and application of the method.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims.

Claims (10)

1. The semi-supervised 3D target detection method based on digital twinning is characterized by comprising the following steps of:
receiving point clouds, respectively inputting the point clouds into a preset teacher model and a preset student model for preprocessing, and respectively obtaining an uncertain prediction result of the teacher module and an uncertain prediction result of the student module, wherein the preprocessing comprises the following steps:
inputting point cloud, extracting network PointNet through point cloud characteristics, and obtaining candidate points P through translation operation and voting mechanism prop Spatial location and characteristics of (a);
based on the candidate point P prop The characteristics and the space position are predicted by using a multi-layer perceptron to predict the category of the object in the space and the space distribution of each surface of the boundary frame;
obtaining geometrical features of the surface by using a surface-based pooling operation in combination with the spatial positions of the predicted bounding boxes;
splicing the geometric features and the discrete probability distribution features of the surface together, inputting the geometric features and the discrete probability distribution features into a multi-layer perceptron, and outputting an uncertainty prediction result;
performing pseudo tag screening and weight distribution on the uncertain prediction result of the teacher module obtained through pretreatment;
monitoring unlabeled data of the student model by using the pseudo tags and the weight scores, and monitoring labeled data of the student model by using the group-Truth;
the indeterminate prediction results of the teacher module were obtained using IoU guided NMS to obtain the final detection results.
2. The digital twinning-based semi-supervised 3D object detection method according to claim 1, wherein the input point cloud, the extraction of the spatial location and characteristics of candidate points through the point cloud characteristics extraction network PointNet and the translation operation and voting mechanism, comprises the following steps:
the input point cloud obtains seed points through the farthest point sampling, the seed points aggregate the characteristics of surrounding points by using k nearest neighbors and a multi-layer perceptron, the seed points after the aggregation of the characteristics are used as the input point cloud of the next stage, and the operations are repeated twice to obtain the finally output seed points;
predicting probability that the last output seed point is a foreground object point and distance from the last output seed point to the center of the object by using the characteristics of the last output seed point, and translating the spatial position of the last output seed point to the center of the object; the translated seed point is called the candidate point P prop
3. The digital twinning-based semi-supervised 3D object detection method according to claim 1, wherein the multi-layer perceptron is used to predict the spatial distribution of each face of the bounding box and class of objects in space, comprising the following:
given a candidate point P prop Is predicted from the candidate point P prop Distance to each side; based on the observation of the uncertainty of the predicted probability distribution measure, correcting the parameterization of the boundary box from a deterministic method to a probabilistic method; each surface is opposite to the candidate point P prop The distance of the position is divided into discrete grids in space, the probability that the front face falls in the grids is predicted, and the spatial position of the face is obtained through the expectations of distribution, and the formula is as follows:
wherein s is i Is the distance value represented by each bin, P is the predicted probability,is the spatial position of the predicted face.
4. The digital twinning-based semi-supervised 3D target detection method of claim 1, wherein the face-based pooling operation yields face geometry, comprising the steps of:
the face-based pooling operation requires selection of a face point P side Pastry P side Is a virtual grid point that contains a particular face of the object; specifically, for the front surface of the object, dividing the width and the height of the object into sections D; generating 2×d×d grid points in front of and behind the front surface of the object; for each of the points P side Find each of the points P side And from seed point P seed Distance weighted interpolation of feature propagation to a surface point P side The method comprises the steps of carrying out a first treatment on the surface of the All the points P side Inputting into a point cloud feature extraction network PointNet to obtain geometrical feature F geo
5. The digital twinning-based semi-supervised 3D target detection method of claim 1, wherein the geometric features and the discrete probability distribution features of the faces are spliced together, input into a multi-layer perceptron, and output uncertainty prediction results, comprising the steps of:
the discrete probability distribution feature is the statistic of the discrete distribution of the face, and the average value of k values with the largest probability, the variance of the distribution and all probability values of the discrete distribution are selected as the distribution feature F of the face dist
Distribution characteristics F of the dough dist Geometric features of dough F geo Spliced together;
inputting the fused characteristics into a multi-layer perceptron, and outputting an uncertainty measure u of the surface through a Sigmoid activation layer s The formula is as follows:
u s =Sigmoid(MLP(Cat(F geo ,F dist )))
obtaining a measure of uncertainty of a face using predictionTag with uncertainty as the absolute error between the face and real faceAnd (3) performing constraint, guiding training of the uncertainty estimation module, and calculating a label according to the following formula:
wherein y is s Is the spatial position of the real face,is the spatial position of the predicted surface, MIN is the minimum value, alpha is the scale-up and scale-down coefficient, and absolute error of the predicted surface and the real surface is used>And predicted uncertainty u s Mean square error between as uncertainty regression loss L uncer
Where U is the uncertainty estimate u= { U for each side of the object s |∈B}。
6. The digital twinning-based semi-supervised 3D target detection method of claim 1, wherein said pseudo tag screening and weight distribution is implemented by a pseudo tag screening and weight distribution module, said pseudo tag screening and weight distribution module class specific screener, ioU guided low half NMS policy and face aware weight distribution; the category-specific filter uses category confidence, foreground confidence, and IoU predictor to filter out low quality false labels; ioU guided low half NMS strategy the low half NMS guided by IoU discards half of the predicted IoU lower prediction results; the face perception weight distribution is carried out by using the uncertainty of the face as each face of the pseudo tag, and the weight distribution formula is as follows:
wherein q is s Is the mass fraction of the face, alpha 2 Is a scale factor, the loss is weighted using the quality score:
q B is q s The average value of (2) reflects the global positioning quality of the bounding box.
7. The digital twinning-based semi-supervised 3D object detection method according to claim 1, wherein the using of pseudo labels and weight scores to supervise unlabeled data of student models and group-Truth to supervise labeled data of student models comprises the steps of:
IoU loss using rotation and smooth L1 loss based on face awareness in face aware networks:
using pseudo tags as supervisory signals for the unlabeled data, weighting with mass fractions:
8. a semi-supervised 3D object detection system based on face uncertainty estimation, comprising:
the preprocessing module is used for receiving the point cloud, inputting the point cloud into a preset teacher model and a preset student model for preprocessing, and respectively obtaining an uncertain prediction result of the teacher module and an uncertain prediction result of the student module; the preselection module comprises a candidate point feature extraction module, a spatial distribution prediction module of the surface, a geometric feature acquisition module of the surface and a discrete distribution prediction module of the surface;
the candidate point feature extraction module is used for inputting point clouds and obtaining candidate points P through point cloud feature extraction network PointNet, translation operation and voting mechanism prop Spatial location and characteristics of (a);
a spatial distribution prediction module of the surface for predicting the candidate point P prop The characteristics and the space position are predicted by using a multi-layer perceptron to predict the category of the object in the space and the space distribution of each surface of the boundary frame;
a surface geometric feature acquisition module, configured to obtain surface geometric features using a surface-based pooling operation in combination with the spatial positions of the predicted bounding boxes;
the surface discrete distribution prediction module is used for splicing the geometric characteristics and the discrete probability distribution characteristics of the surface together, inputting the geometric characteristics and the discrete probability distribution characteristics into the multi-layer perceptron, and outputting an uncertainty prediction result;
the uncertainty prediction module is used for pseudo tag screening and weight distribution, and is used for performing pseudo tag screening and weight distribution on the uncertainty prediction result of the teacher module obtained through pretreatment;
and the output module is used for supervising the unlabeled data of the student model by using the pseudo labels and the weight scores, and supervising the labeled data of the student model by using the group-Truth.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, characterized in that the memory stores the computer program capable of running on the processor, and that the processor, when loading and executing the computer program, employs a digital twinning-based semi-supervised 3D object detection method as defined in any of claims 1 to 7.
10. A computer readable storage medium having a computer program stored therein, wherein the computer program, when loaded and executed by a processor, employs a digital twinning-based semi-supervised 3D object detection method as claimed in any one of claims 1 to 7.
CN202311546436.6A 2023-11-20 2023-11-20 Digital twinning-based semi-supervised 3D target detection method, system and equipment Pending CN117649515A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311546436.6A CN117649515A (en) 2023-11-20 2023-11-20 Digital twinning-based semi-supervised 3D target detection method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311546436.6A CN117649515A (en) 2023-11-20 2023-11-20 Digital twinning-based semi-supervised 3D target detection method, system and equipment

Publications (1)

Publication Number Publication Date
CN117649515A true CN117649515A (en) 2024-03-05

Family

ID=90042587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311546436.6A Pending CN117649515A (en) 2023-11-20 2023-11-20 Digital twinning-based semi-supervised 3D target detection method, system and equipment

Country Status (1)

Country Link
CN (1) CN117649515A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975241A (en) * 2024-03-29 2024-05-03 厦门大学 Directional target segmentation-oriented semi-supervised learning method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975241A (en) * 2024-03-29 2024-05-03 厦门大学 Directional target segmentation-oriented semi-supervised learning method

Similar Documents

Publication Publication Date Title
Wang et al. A unified convolutional neural network integrated with conditional random field for pipe defect segmentation
CN106204522B (en) Joint depth estimation and semantic annotation of a single image
Li et al. MapReduce-based fast fuzzy c-means algorithm for large-scale underwater image segmentation
CN110084299B (en) Target detection method and device based on multi-head fusion attention
CN113780584B (en) Label prediction method, label prediction device, and storage medium
CN113888541B (en) Image identification method, device and storage medium for laparoscopic surgery stage
CN111368634B (en) Human head detection method, system and storage medium based on neural network
CN117649515A (en) Digital twinning-based semi-supervised 3D target detection method, system and equipment
Haznedar et al. Implementing PointNet for point cloud segmentation in the heritage context
Dong et al. Learning regional purity for instance segmentation on 3d point clouds
CN113989696A (en) Target tracking method and device, electronic equipment and storage medium
Panda et al. Kernel density estimation and correntropy based background modeling and camera model parameter estimation for underwater video object detection
CN115601629A (en) Model training method, image recognition method, medium, device and computing equipment
KR20160128869A (en) Method for visual object localization using privileged information and apparatus for performing the same
CN114118410A (en) Method, device and storage medium for extracting node feature of graph structure
Wang et al. Semantic segmentation of sewer pipe defects using deep dilated convolutional neural network
CN113298822B (en) Point cloud data selection method and device, equipment and storage medium
CN115272406A (en) Target tracking method, medium and device based on target detection and two-stage matching
CN114677508A (en) Point cloud instance semantic segmentation method based on dynamic filtering and point-by-point correlation
CN114596435A (en) Semantic segmentation label generation method, device, equipment and storage medium
CN113222066A (en) Image classification model training method and system and image classification method and system
Wang et al. SDDiff: Semi-supervised surface defect detection with Diffusion Probabilistic Model
Jing et al. Improved building MEP systems semantic segmentation in point clouds using a novel multi-class dataset and local–global vector transformer network
Nimma et al. Advancements in Deep Learning Architectures for Image Recognition and Semantic Segmentation.
CN115170585B (en) Three-dimensional point cloud semantic segmentation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination