CN114972758A - Instance segmentation method based on point cloud weak supervision - Google Patents

Instance segmentation method based on point cloud weak supervision Download PDF

Info

Publication number
CN114972758A
CN114972758A CN202210629786.8A CN202210629786A CN114972758A CN 114972758 A CN114972758 A CN 114972758A CN 202210629786 A CN202210629786 A CN 202210629786A CN 114972758 A CN114972758 A CN 114972758A
Authority
CN
China
Prior art keywords
point cloud
point
points
image
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210629786.8A
Other languages
Chinese (zh)
Inventor
李怡康
石博天
李想
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai AI Innovation Center
Original Assignee
Shanghai AI Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai AI Innovation Center filed Critical Shanghai AI Innovation Center
Priority to CN202210629786.8A priority Critical patent/CN114972758A/en
Publication of CN114972758A publication Critical patent/CN114972758A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06V10/7753Incorporation of unlabelled data, e.g. multiple instance learning [MIL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Abstract

The invention relates to an example segmentation method based on point cloud weak supervision, which comprises the following steps: projecting the laser radar point cloud to an image plane to form a projection point cloud; purifying the projection point cloud, and removing overlapped points caused by parallax between the laser radar and the camera to obtain purified point cloud; distributing foreground labels/background labels to the points in the purified point cloud; and training a divider by taking the purified point cloud with the foreground label/the background label as a supervision signal, and performing example division and prediction mask by using the divider.

Description

Instance segmentation method based on point cloud weak supervision
Technical Field
The invention relates to the field of artificial intelligence, in particular to a point cloud weak supervision-based instance segmentation method.
Background
In recent years, the importance of the automatic driving system has been increased in academic and industrial fields. The existing image Instance segmentation technology generally trains a deep learning model by using a supervised learning method, so that an Instance Mask (Instance Mask) representing an Instance segmentation result of a picture can be generated in inference. High-quality example segmentation techniques can provide significant assistance to an automatic driving system, for example, some algorithms use example segmentation results to fuse lidar and image data, thereby improving the performance of cross-modal three-dimensional target detection.
However, in order to implement supervised learning, the cost of performing instance segmentation labeling on the training data set is extremely high, and especially in an automatic driving scene, a large number of instances of people, vehicles, non-motor vehicles, other obstacles and the like are generally contained in the image. The desire to perform high quality labeling on millions of training samples often requires a significant amount of human and material effort. Existing supervised example segmentation methods, such as Mask R-CNN and CondInst, all rely heavily on the quality and quantity of manual labeling, making it difficult for these methods to utilize larger-scale data.
In order to reduce the cost, weak supervision example segmentation methods such as the BoxInst method and the PointSup method have been developed. And the weak supervision instance segmentation method utilizes partial manual labeling, although the cost is low, additional manpower and time are still required to be arranged for manual labeling, and the performance of the weak supervision instance segmentation method is low. Therefore, there is a need to further investigate example segmentation techniques that are cost effective and effective.
Disclosure of Invention
The invention aims to provide a point cloud weak supervision-based instance segmentation method, which can directly guide the weak supervision training of an instance segmentation model by using a point cloud acquired by a laser radar without carrying out complete mask marking on an image, and further improve the performance of the weak supervision instance segmentation model under the condition of not introducing additional manual marking cost.
In a first aspect of the present invention, to solve the problems existing in the prior art, the present invention provides an example segmentation method based on point cloud weak supervision, including:
projecting the laser radar point cloud to an image plane to form a projection point cloud;
purifying the projection point cloud, and removing overlapped points caused by parallax between the laser radar and the camera to obtain purified point cloud;
distributing foreground labels/background labels to the points in the purified point cloud; and
and training a divider by taking the purified point cloud with the foreground label/the background label as a supervision signal, and performing example division and prediction mask by using the divider.
In one embodiment of the present invention, before the step of mapping the lidar point cloud to the image plane to form a projection point cloud, the method further comprises: inputting an image, and extracting image features through an image feature extractor, wherein the image features are used as input features of a training segmenter; and
labeling a three-dimensional bounding box of the object in the image.
In one embodiment of the invention, a point error loss function and a graph consistency loss function are used to constrain the output of the segmenter when training the segmenter.
In one embodiment of the invention, wherein projecting the lidar point cloud onto an image plane to form a projected point cloud comprises:
the laser radar point cloud is expressed as
Figure BDA0003679272530000021
By transforming matrices
Figure BDA0003679272530000022
Projecting the laser radar point cloud from a laser radar coordinate system to a camera coordinate system, and then passing through a camera matrix
Figure BDA0003679272530000023
Further projecting the image plane to form a projection point cloud, wherein the projection point cloud is:
Figure BDA0003679272530000024
wherein
Figure BDA0003679272530000025
The method is a set of points which are projected to an image plane by the laser radar point cloud and are expressed by homogeneous coordinates.
In an embodiment of the present invention, wherein the purifying the projection point cloud, removing the overlapped points caused by the parallax between the laser radar and the camera to obtain the purified point cloud comprises:
each pixel P obtained by projection onto an image plane 2d And the depth truth values of the corresponding laser radar points form a sparse depth map
Figure BDA0003679272530000026
And
using a two-dimensional sliding window
Figure BDA0003679272530000027
To traverse the entire sparse depth map, within each window, the projection point cloud is segmented into near points according to relative depth
Figure BDA0003679272530000028
And a remote point
Figure BDA0003679272530000029
Wherein the point at which the relative depth exceeds the depth threshold is a distant point
Figure BDA00036792725300000210
The point at which the relative depth does not exceed the depth threshold is the proximity point
Figure BDA00036792725300000211
Figure BDA00036792725300000212
Wherein p (x, y) denotes sliding in two dimensions
Figure BDA0003679272530000031
A point in the lidar point cloud corresponding to a pixel with (x, y) coordinates, τ depth Representing a depth threshold with which points at greater distances can be filtered out, d (x, y) representing a depth value corresponding to a pixel with coordinates (x, y), d min And d max Respectively represented in two-dimensional sliding windows
Figure BDA0003679272530000032
A minimum depth value and a maximum depth value within;
by calculating the proximity point
Figure BDA0003679272530000033
Removing the points with longer distance in the minimum envelope range of the adjacent points as overlapped points to obtain the purified point cloud
Figure BDA0003679272530000034
Wherein the overlapping point
Figure BDA0003679272530000035
Comprises the following steps:
Figure BDA0003679272530000036
wherein x min ,x max Is near the point
Figure BDA0003679272530000037
Minimum and maximum values of (1) on the x-axis, y min ,y max Is near the point
Figure BDA0003679272530000038
With minimum and maximum values on the y-axis.
In one embodiment of the present invention, wherein assigning the points in the refined point cloud with foreground/background labels comprises:
according to the position relation between the purified point cloud and the three-dimensional bounding box, the purified point cloud is processed
Figure BDA0003679272530000039
Divided into points within a three-dimensional bounding box
Figure BDA00036792725300000310
And points outside the three-dimensional bounding box
Figure BDA00036792725300000311
Points within a three-dimensional bounding box
Figure BDA00036792725300000312
As a positive sample, and assigning a foreground label, will
Figure BDA00036792725300000313
Assigning a background label as a negative sample at a part of points around the three-dimensional bounding box, wherein the number of the positive samples and the negative samples is s;
and propagating the pseudo labels of the positive sample and the negative sample to the surrounding 8 pixels according to the image feature similarity.
In one embodiment of the present invention, wherein
Figure BDA00036792725300000314
As a negative example, assigning a background label to a portion of points surrounding the vicinity of the three-dimensional bounding box includes:
firstly, 8 fixed points of the three-dimensional bounding box are projected to an image plane, and then the minimum enveloping rectangle is obtained through calculation
Figure BDA00036792725300000315
Selecting
Figure BDA00036792725300000316
Can project partial points falling within the envelope rectangle b as negative samples
Figure BDA00036792725300000317
In one embodiment of the present invention, wherein propagating the pseudo labels of the positive and negative examples to the surrounding 8 pixels according to image feature similarity comprises:
when the image feature similarity exceeds the similarity threshold, selecting a candidate point p from a positive sample and a negative sample c Spread over 8 pixels around the image, so that these 8 pixels have candidate points p c The same category label, wherein the judgment formula of label propagation is:
Figure BDA00036792725300000318
where l (p) is the pseudo label to which point p is assigned,
Figure BDA0003679272530000041
is a candidate point p c The 8 pixels around on the image,
Figure BDA0003679272530000042
is the extractor of the point p from the image features obtained by pre-training
Figure BDA0003679272530000043
Of the extracted image features, tau dense Is a similarity threshold.
In one embodiment of the present invention, wherein in training the segmenter, constraining the output of the segmenter using the point error loss function and the graph consistency loss function comprises:
a point error loss function is constructed by adopting a bilinear interpolation method, wherein the loss between the prediction mask and the pseudo label can be measured through the point error loss function, and the point error loss function is as follows:
Figure BDA0003679272530000044
where K is the total number of instances in the image, S is all the points with false labels, p ks Is the s point in the k example, l ks Then is point p ks Pseudo label of (L) point Indicating a loss of point error.
In one embodiment of the present invention, wherein in training the segmenter, constraining the output of the segmenter using the point error loss function and the graph consistency loss function comprises:
using the refined point cloud
Figure BDA0003679272530000045
Constructing an undirected graph G ═<V,E>Wherein the refined point cloud
Figure BDA0003679272530000046
The points in (1) are used as nodes to form a V, E is an edge, whether an edge is formed between two nodes and whether the two nodes have the same pseudo label or not is determined according to the image feature similarity and the three-dimensional geometric feature similarity between the two nodes, wherein the weighted sum of the image feature similarity and the three-dimensional geometric feature similarity is as follows:
W ij =w 1 S image (i,j)+w 2 S geometry (i,j),
wherein w 1 And w 2 Is the balance weight of the similarity of the image features and the similarity of the three-dimensional geometric features, S image (i, j) and S geometry (i, j) respectively represent a node p i And p j Similarity of image features and similarity of three-dimensional geometric features between them, W ij Representing the overall similarity of the graphs; total similarity W of current drawing ij When the similarity threshold value tau is larger than the similarity threshold value tau, an edge is formed between the two nodes and has the same pseudo label, otherwise, a connecting edge does not exist between the two nodes and the pseudo labels of the two nodes are different;
according to the fact that when a connecting edge exists between nodes, the predicted masks of the divider are close, the output of the divider can be constrained by using a graph consistency loss function:
Figure BDA0003679272530000047
where N | V | is all nodes in the undirected graph,
Figure BDA0003679272530000048
and
Figure BDA0003679272530000049
are respectively a node p i And node p j Of the prediction mask, L consistency Indicating a loss of map consistency.
The invention has at least the following beneficial effects: the point cloud acquired by the laser radar can be directly used for guiding the weak supervision training of the example divider, complete mask marking is not needed to be carried out on the image, the performance of the weak supervision example division model is further improved under the condition of not introducing extra manual marking cost, and the method has the potential of finishing the example divider training by using mass unmarked laser radar data.
Drawings
To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope.
FIG. 1 shows a schematic diagram of a process for example segmentation according to the BoxInst method of the prior art;
FIG. 2 is a schematic diagram illustrating point fetching during example segmentation according to the PointSup method in the prior art;
FIG. 3 illustrates a flow of an example segmentation method based on point cloud unsupervised according to one embodiment of the invention; and
FIG. 4 shows a schematic diagram of a point tag assignment module, according to one embodiment of the invention.
Detailed Description
It should be noted that the components in the figures may be exaggerated and not necessarily to scale for illustrative purposes.
In the present invention, the embodiments are only intended to illustrate the aspects of the present invention, and should not be construed as limiting.
In the present invention, the terms "a" and "an" do not exclude the presence of a plurality of elements, unless otherwise specified.
It is further noted herein that in embodiments of the present invention, only a portion of the components or assemblies may be shown for clarity and simplicity, but those of ordinary skill in the art will appreciate that, given the teachings of the present invention, required components or assemblies may be added as needed in a particular scenario.
It is also noted herein that, within the scope of the present invention, the terms "same", "equal", and the like do not mean that the two values are absolutely equal, but allow some reasonable error, that is, the terms also encompass "substantially the same", "substantially equal".
It should also be noted herein that in the description of the present invention, the terms "central", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In addition, the embodiments of the present invention describe the process steps in a specific order, however, this is only for convenience of distinguishing the steps, and does not limit the order of the steps.
The existing image Instance segmentation technology generally trains a deep learning model by using a supervised learning method, so that a Mask (Instance Mask) representing an Instance segmentation result of a picture can be generated in the process of inference. In an automatic driving scene, a large number of examples of people, vehicles, non-motor vehicles and other obstacles are usually contained in an image, and the cost of supervised learning is extremely high by carrying out example segmentation and annotation on a training data set.
Because supervised learning has a great dependence on training data labeling, the prior art starts from weak supervised learning, performs low-cost weak labeling on original data only, and completes a weak supervised instance segmentation task by using a weak supervised learning technology based on weak labeling. The weak supervised learning technology based on the weak annotation mainly comprises a BoxInst method, a PointSup method and the like.
Fig. 1 shows a process diagram of an example segmentation according to the BoxInst method of the prior art.
As shown in fig. 1, the BoxInst method implements weakly supervised instance segmentation using only the three-dimensional bounding boxes of the respective instance as supervisory signals. And the method can achieve about 90% of the performance of supervised instance segmentation on the public data set. Although the performance of the BoxInst method cannot be compared with the supervised instance segmentation, the method has the capability of utilizing mass data due to extremely low labeling cost.
Specifically, the core of BoxInst builds on one hypothesis: when the bounding box closely surrounds the object, at least one pixel on the three-dimensional bounding box overlaps the object (considered a positive sample), while the outside of the detection box must be independent of the object (considered a negative sample). The method then achieves alignment of the predicted result and the three-dimensional bounding box result by introducing reprojection errors.
The purpose of introducing reprojection errors is to penalize the difference between the projections of the prediction mask and the true bounding box mask in the x-axis and y-axis directions, respectively. Therefore, the projection of the prediction mask as close to the true value bounding box mask as possible after projection is realized. The loss function of the reprojection error can be expressed in particular as:
Figure BDA0003679272530000061
wherein
Figure BDA0003679272530000062
Denotes a prediction mask, b denotes a true value bounding box mask,
Figure BDA0003679272530000063
and
Figure BDA0003679272530000064
represent the projection of the prediction mask on the x and y axes, respectively, and proj x (b) and proj y (b) represent the projection of the truth bounding box mask on the x and y axes, respectively. L (X, Y) represents the Dice Loss (Dice Loss) between two terms (prediction mask and truth bounding box mask):
Figure BDA0003679272530000071
in addition, to constrain that the prediction mask does not completely become a three-dimensional bounding box during training, the BoxInst method also utilizes pairwise penalties to constrain the prediction mask. The pairwise loss function is based on the following assumptions: if two pixels are similar in color, their class labels are likely to be the same. The pair-wise loss can then be expressed as:
Figure BDA0003679272530000072
Figure BDA0003679272530000073
wherein
Figure BDA0003679272530000074
The prediction class, P (y), representing a point at coordinates x, y e 1) represents whether the point at the coordinate x, y position and k, l two points belong to the same category (either foreground or background).
Fig. 2 shows a schematic diagram of point taking during example segmentation according to the prior art PointSup method.
Using only bounding boxes as supervisory signals can only achieve 85% of the performance of supervised learning methods on some public data sets. As shown in fig. 2, the PointSup method is based on box supervision (BoxInst method), and adds several points randomly sampled from the bounding box and manually labeled for classes (labeled as foreground/background) and trains the weakly supervised instance segmentation model by using the points as the weakly supervised signal.
The PointSup method realizes the great improvement of the model performance through the additional labeling with lower cost, and experiments prove that the PointSup method not only far exceeds the performance of the BoxInst method, but also can achieve about 97% of the performance of a supervised learning method.
An autonomous driving system typically carries a LiDAR (LiDAR) to capture point cloud data, which provides a strong instance segmentation monitor as a data that reflects depth truth with high accuracy. In particular, the point cloud can capture the contour of an object of interest, so when the lidar point cloud is projected onto a two-dimensional image, it is naturally possible to provide a surveillance signal at the point level. Furthermore, the three-dimensional geometry features can provide additional information for example segmentation.
The invention provides a method for automatically labeling points in a laser radar point cloud, and the points are projected to a two-dimensional space to serve as sample points for providing supervision signals, so that low-cost weak supervision instance segmentation is realized in an automatic driving scene.
The Point cloud weak supervision-based instance segmentation mainly comprises a Point Label Assignment (Point Label Assignment) module and a Graph-based Consistency Regularization (Graph-based Consistency Regularization) module. The point label distribution module is used for distributing foreground/background labels for the laser radar point cloud through a series of rule means. The graph consistency regularization module further constrains segmentor prediction to generate high quality masks by encoding geometric consistency and morphological consistency simultaneously.
FIG. 3 shows a flow of an example segmentation method based on point cloud weak surveillance according to an embodiment of the present invention.
As shown in fig. 3, the example segmentation method based on point cloud weak supervision includes an image example segmentation branch (upper part) and a point cloud processing branch (lower part). The image instance segmentation branch comprises an existing image-based weak supervision instance segmentation model, and the process is as follows: inputting an image, extracting image features by using an image feature extractor, training a divider by using the image features as input features, and finally predicting the image by using the divider to generate a mask. In the existing image-based weak supervision instance segmentation model, a three-dimensional Bounding Box (Bounding Box) of an object in an image needs to be manually labeled and several points of manually labeled categories in the three-dimensional Bounding Box are used as supervision signals in the training process. The point cloud processing branch aims to replace a part needing manual marking points in the existing image-based weak supervision instance segmentation model by using the point cloud acquired by the laser radar as a supervision signal. Here, a point label assignment module and a graph consistency regularization module are designed. The laser radar can provide additional weak supervision signals through a point label distribution module and a graph consistency regularization module, and finally the signals complete point cloud supervision training of the existing weak supervision instance segmentation model in a point error loss function and graph consistency loss function mode. In the whole process, extra marking is not needed for the point cloud manually, so that extra labor cost is not introduced.
The point label assignment module inputs the lidar point clouds and the three-dimensional bounding box and inputs pseudo labels (pseudo labels) of the points in these point clouds. Specifically, the lidar point cloud is first projected onto the image plane, and then noise points (overlapping points) due to parallax between the lidar and the camera are filtered out (since the lidar position is higher than the camera, part of the lidar point cloud does not have corresponding pixels in the image). A set of rules is then used to assign a foreground or background binary label to each point in the lidar point cloud. And finally, transmitting the point labels to adjacent pixels according to the feature similarity as weight.
FIG. 4 shows a schematic diagram of a point tag assignment module, according to one embodiment of the invention.
In order to utilize the supervision information provided by the laser radar, the inventor designs a point label distribution module to distribute a two-class label to three-dimensional points in the laser radar point cloud so as to train the divider as a label. As shown in fig. 4, the original point cloud (laser radar point cloud) is projected to an image plane to form a projection point cloud, the projection point cloud is purified, a part of overlapped points are deleted to obtain a purified point cloud, and finally, the points in the purified point cloud are assigned with foreground/background labels according to rules.
Point cloud projection
The process of projecting the original point cloud to the image plane to form a projection point cloud is called point cloud projection. A point cloud containing N points in three-dimensional space can be represented as a homogeneous coordinate system (homogenetic coordinate system)
Figure BDA0003679272530000091
Transformation matrix
Figure BDA0003679272530000092
Used for projecting the laser radar point cloud from the laser radar coordinate system to the camera coordinate system and then passing through the camera matrix
Figure BDA0003679272530000093
Further projected to the image plane. Thus, a two-dimensional point set (projection point cloud) of the original point cloud projected to the image plane can be represented as:
Figure BDA0003679272530000094
wherein
Figure BDA0003679272530000095
Is a collection of points represented in homogeneous coordinates after the original point cloud is projected onto the image plane.
Depth directed point purification
The projection point cloud is purified, and the process of deleting partial overlapped points is called depth-guided point purification. In many autopilot systems, the lidar is located on the roof of the vehicle and the camera is mounted in front of the vehicle or in the windshield, which results in parallax between the two sensors. Parallax causes some foreground pixels, projected to the image plane, not necessarily foreground points in the three-dimensional point cloud space. Here, these overlapping points are eliminated using a depth-directed point purification method based on the assumption that the depth change of the surface points of the same object should not be abrupt.
Specifically, each pixel P projected onto the image plane is first projected 2 d and the depth (z-axis) true value of the corresponding three-dimensional point (laser radar point) form a sparse depth map
Figure BDA0003679272530000096
The figure is a sparse image, if there is a corresponding three-dimensional point (lidar point) at a coordinate position, the value of that position is the pixel P projected onto the image plane 2 d, if there is no corresponding three-dimensional point at the position, the value of the position is 0. Then using a two-dimensional sliding window
Figure BDA0003679272530000097
To traverse the entire sparse depth map. Within each window, the projection point cloud is segmented into near points according to relative depth
Figure BDA0003679272530000098
And a remote point
Figure BDA0003679272530000099
Two sets.
Figure BDA00036792725300000910
Wherein p (x, y) represents a point in the point cloud corresponding to a pixel with coordinates (x, y) in the two-dimensional sliding window w; tau is depth Representing a depth threshold, wherein all points with relatively far distances can be filtered by adopting the depth threshold; d (x, y) represents a depth value corresponding to a pixel with coordinates (x, y), and p (x, y) represents a three-dimensional point corresponding to the pixel position when d (x, y) ≠ 0; d is a radical of min And d max Respectively represented in two-dimensional sliding window
Figure BDA0003679272530000101
Minimum and maximum withinA depth value.
Similarly, all points whose relative depth exceeds the depth threshold are distant points
Figure BDA0003679272530000102
However, not all distant points are overlapping points, so the near point is calculated
Figure BDA0003679272530000103
The points far away from the minimum envelope range of the adjacent points are taken as the overlapping points to be filtered out:
Figure BDA0003679272530000104
wherein x min ,x max ,y min ,y max Is all points of proximity
Figure BDA0003679272530000105
With minimum and maximum values on the x and y axes. This is because the depth of the foreground points should not change drastically over a small two-dimensional sliding window. When a point having a larger depth value is surrounded by points having smaller depth values, the point has a high probability of being an overlapped point. The purified point cloud after removing the overlapped points can be represented as:
Figure BDA0003679272530000106
wherein the content of the first and second substances,
Figure BDA0003679272530000107
the points of overlap are represented as such,
Figure BDA0003679272530000108
the point of proximity is represented by a point of proximity,
Figure BDA0003679272530000109
the remote point is represented by a line of sight,
Figure BDA00036792725300001010
representing the refined point cloud.
Label distribution
After the projection point cloud is purified, the points in the remaining purified point cloud are distributed as positive and negative samples (foreground/background), namely label distribution.
Firstly, the purified point cloud can be obtained according to the position relation between the purified point cloud and the three-dimensional bounding boxes of all the examples
Figure BDA00036792725300001011
Two subsets are divided:
Figure BDA00036792725300001012
points represented within the three-dimensional bounding box,
Figure BDA00036792725300001013
representing points outside the three-dimensional bounding box. All points appearing in the three-dimensional bounding box
Figure BDA00036792725300001014
As a positive sample, a foreground label is assigned. Generally, only a small fraction of the points belong to
Figure BDA00036792725300001015
Can be taken as a positive sample, while most points belong to points not within the three-dimensional bounding box
Figure BDA00036792725300001016
To reduce the amount of computation, it is usual to use only
Figure BDA00036792725300001017
A part of points around the three-dimensional bounding box is used as negative samples to participate in training. The specific sampling method is as follows: firstly, 8 fixed points of the three-dimensional bounding box (8 vertexes of the three-dimensional bounding box) are projected to an image plane, and then the minimum enveloping rectangle is obtained through calculation
Figure BDA00036792725300001018
The enveloping rectangle is represented as a relaxed two-dimensional bounding box (since the projection of a three-dimensional bounding box onto a two-dimensional plane typically does not completely overlap the instances in the image and is slightly larger in size). Final selection
Figure BDA00036792725300001019
The partial point of the envelope rectangle b can be projected as the negative sample after sampling, and is recorded as
Figure BDA0003679272530000111
In particular to
Figure BDA0003679272530000112
The label allocation strategy for each candidate point is as follows:
Figure BDA0003679272530000113
where 1 indicates that a point is assigned a positive sample, 0 indicates that a point is assigned a negative sample, and-1 indicates that a point is ignored. For parallel acceleration in the training process, the sum of the number of positive samples and negative samples is determined to be s, and the sum of the number of positive samples and the number of negative samples is determined to be s, and the positive samples and the negative samples are taken at a certain positive sampling rate and a certain negative sampling rate
Figure BDA0003679272530000114
And
Figure BDA0003679272530000115
the s points are co-sampled. If it is not
Figure BDA0003679272530000116
Figure BDA0003679272530000117
The deletions will be supplemented from the purified point cloud by gaussian distributed random sampling, otherwise from
Figure BDA0003679272530000118
S points are sampled. Finally obtaining s points and theirA pseudo tag.
Label propagation
The points reserved by label distribution are very sparse after being projected to the image, so that the pseudo labels of the s points are further spread to the surrounding pixels according to the image feature similarity (label spreading), and dense supervision signals are provided. Selecting a candidate point p from s positive and negative samples c Then judging whether to use the candidate point p according to the image feature similarity c The pseudo label of (a) is propagated to 8 pixels around the image, traversing s positive and negative samples. The judgment formula of the label propagation is as follows:
Figure BDA0003679272530000119
where l (p) is the pseudo label to which point p is assigned,
Figure BDA00036792725300001110
is a candidate point p c The 8 pixels around on the image,
Figure BDA00036792725300001111
is the extractor of the point p from the image features obtained by pre-training
Figure BDA00036792725300001112
Of the extracted image features, tau dense Is the similarity threshold. When the similarity of the image features exceeds the similarity threshold, the label of the candidate point px is spread to 8 pixels around the image, so that the 8 pixels have the same similarity with the candidate point p c The same class label. Otherwise, the label is not transmitted, because when the image feature similarity of two points is low, whether the two points belong to the same category cannot be judged. Finally, a dense point set with a pseudo label is obtained after label propagation.
Laser radar point loss
For example partitioning methods using masks, the masks output by the (partitioner) may be expressed as
Figure BDA00036792725300001113
Where h and w are the resolution of the slicer output mask. Predicting the position of point p by using bilinear interpolation method
Figure BDA00036792725300001114
A point-based binary cross entropy loss function (called a point error loss function) is constructed by adopting a bilinear interpolation method:
Figure BDA00036792725300001115
where K is the total number of instances in the image, S is all points with false labels, p ks Is the s point in the k example, l ks Then is point p ks Pseudo label of (L) point Indicating a loss of point error). Due to prediction mask)
Figure BDA0003679272530000121
Is obtained by interpolating the pixels immediately surrounding the point p, the loss function can not only be optimized at the current point, but also propagate the error back to the pixels adjacent to the point. An example segmentation mask with a higher edge sharpness can be obtained by a point-based binary cross entropy loss function.
Although the aforementioned point tag assignment module is capable of providing refined pseudo tags, inaccurate tags may still exist because of the following two points: (1) systematic errors due to calibration inaccuracies. For example, at the edges of some objects, the points acquired by the lidar may be projected to the background in the two-dimensional image plane. (2) Due to the low reflectivity and high transmissivity of materials such as vehicle windshields, the laser radar beam may pass through the glass and detect the background. This may result in the point label assignment module assigning false labels to these false points. To solve this problem, a graph consistency regularization module is designed. The graph consistency regularization module constrains the segmenter to generate reasonable prediction masks by exploring similarity relationships between various adjacent spatial points. The module firstly takes each point in the point cloud as a node, and takes the weighted sum of the similarity of the three-dimensional geometric features and the similarity of the image features as an edge to construct an undirected graph. This graph-based similarity can regularize the prediction mask of the instance partitions. The graph consistency regular module supervises the training process of the segmenter through a graph consistency loss function, so that the performance of the segmenter is improved. The graph consistency regularization module includes two parts: a graph establishing method based on similarity and consistency regularization.
Image establishing method based on similarity
Giving a set of points obtained according to the point label distribution module
Figure BDA0003679272530000122
Construct an undirected graph G ═<V,E>. Wherein V is represented by
Figure BDA0003679272530000123
The edge E needs to be measured according to the image feature similarity and the three-dimensional geometric feature similarity between two nodes:
W ij =w 1 S image (i,j)+w 2 S geometry (i,j),
wherein w 1 And w 2 Is the balance weight of the similarity of the image features and the similarity of the three-dimensional geometric features, S image (i, j) and S geometry (i, j) respectively represent a node p i And p j Similarity of image features and similarity of three-dimensional geometric features between them, W ij Representing the overall similarity of the graphs.
In order to construct the image feature similarity, first, a feature map of an image extracted by a convolutional neural network model (image feature extractor) is used
Figure BDA0003679272530000124
Point image features are then obtained using bilinear interpolation
Figure BDA0003679272530000125
Final two nodes p i And p j Image characteristic phase betweenThe similarity can be expressed as:
S image (i,j)=f(p i ) T f(p j ),
and for the similarity of the three-dimensional geometric features, each point set on the image is considered
Figure BDA0003679272530000131
Three-dimensional point P of pixel in original laser radar system 3d And calculating the similarity of the three-dimensional geometrical characteristics between the three-dimensional points:
Figure BDA0003679272530000132
wherein m is a normalization constant, | | 2 Is a number of 2-norm,
Figure BDA0003679272530000133
is a three-dimensional point P 3d The point (i) of (a) is,
Figure BDA0003679272530000134
is a three-dimensional point P 3d Point j in (d). Then the S is image (i, j) and S geometry (i, j) are combined together by weight to form the weight of the connecting edge between two points.
Consistency regularization
In weakly supervised learning, the assumption of consistency priors is that points in the same structure (usually referring to the same cluster or manifold) are more likely to have similar labels. Total similarity W of current drawing ij Larger, the two points will be more similar and they should then have the same label. A similarity threshold τ is defined to determine whether an edge is formed between two points.
Figure BDA0003679272530000135
Wherein e ij E, when two nodes p i And p i Overall graph similarity W between ij Greater than phaseIf the similarity threshold value tau is equal, a connecting edge exists between the two points, namely the node p i And p j The pseudo label of (b) is the same, otherwise, the node p i And p j There is no connecting edge between them, and the pseudo labels are different. Similar to the point label assignment module, the undirected graph G is utilized here to constrain the segmenter to predict similar label consistency.
The consistency rule is specifically expressed as: when edge e ij When 1, the mask of the segmenter prediction
Figure BDA0003679272530000136
And
Figure BDA0003679272530000137
should be as close as possible. This loss can be defined by a binary cross entropy loss function (graph consistency loss):
Figure BDA0003679272530000138
where N | V | is all nodes in the undirected graph,
Figure BDA0003679272530000139
and
Figure BDA00036792725300001310
are respectively a node p i And node p j Of the prediction mask, L consistency Indicating a loss of map consistency. The formula shows that when two nodes p i And p j When there is no connecting edge (i.e. e) ij 0) and when there is a connection between two points, the two points are predicted to be as similar as possible.
Finally, the two loss functions are merged together as a total loss function to supervise the training of the segmenter:
L=L point +L consistency
by the method, the laser radar point cloud is used as the supervision signal to restrict the prediction of the divider under the condition that no additional marking is carried out on the point cloud, the prediction performance of the divider can be improved, and the weak supervision instance division task is completed.
The point cloud weak supervision-based example segmentation method is added to a PointSup method and a BoxInst method for experimental verification, and the experimental results are shown in Table 1. The results of the example segmentation were evaluated using standard evaluation metrics for example segmentation, including Average Precision (AP), Average Precision when the merge ratio threshold was 50% (AP) 50 ) Average Accuracy (AP) when the cross-over ratio threshold is 75% 75 ) Average Accuracy (AP) of small-sized objects s ) Average Accuracy (AP) of medium-sized objects m ) Average Accuracy (AP) of large-size object l )。AP、AP 50 、AP 75 、AP s 、AP m And AP l The larger the value of (c), the better the performance of the segmenter (model) is represented. Performance verification is performed on some own annotated data set. The method comprises the steps of adding laser radar point cloud serving as a supervision signal to training of a divider of the existing weak supervision method, and predicting the mask of example division by using the divider after training.
Table 1 compares the experimental results of example segmentation methods based on point cloud weak surveillance, the example segmentation results of existing semi-surveillance methods, and the example segmentation results of supervised methods.
Figure BDA0003679272530000141
Compared with the existing weak supervision method, the example segmentation method based on the point cloud weak supervision does not introduce extra manual labeling cost at all, and can be used as a supplement of other weak supervision methods. The method is superimposed on the existing method, so that the overall performance of the weak supervised instance segmenter can be further improved, the performance close to that of a supervised learning method can be achieved with extremely low cost, and the method has the potential of utilizing mass data to train the segmenter in a large scale.
Although some embodiments of the present invention have been described herein, those skilled in the art will appreciate that they have been presented by way of example only. Numerous variations, substitutions and modifications will occur to those skilled in the art in light of the teachings of the present invention without departing from the scope thereof. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims (10)

1. An example segmentation method based on point cloud weak supervision comprises the following steps:
projecting the laser radar point cloud to an image plane to form a projection point cloud;
purifying the projection point cloud, and removing overlapped points caused by parallax between the laser radar and the camera to obtain purified point cloud;
distributing foreground labels/background labels to the points in the purified point cloud; and
and training a divider by taking the purified point cloud with the foreground label/the background label as a supervision signal, and performing example division and prediction mask by using the divider.
2. The point cloud unsupervised instance segmentation method of claim 1, further comprising, prior to the step of mapping the lidar point cloud to the image plane to form a projected point cloud: the method comprises the steps of inputting an image, and extracting image features through an image feature extractor, wherein the image features serve as input features of a training segmenter; and
labeling a three-dimensional bounding box of the object in the image.
3. The point cloud weak supervision-based instance segmentation method according to claim 2, wherein a point error loss function and a graph consistency loss function are used to constrain the output of the segmenter when training the segmenter.
4. The point cloud weakly supervised based instance segmentation method of claim 3, wherein projecting the lidar point cloud to an image plane to form a projected point cloud comprises:
the laser radar point cloud is expressed as
Figure FDA0003679272520000011
By transforming matrices
Figure FDA0003679272520000012
Projecting the laser radar point cloud from a laser radar coordinate system to a camera coordinate system and then passing through a camera matrix
Figure FDA0003679272520000013
Further projecting the image plane to form a projection point cloud, wherein the projection point cloud is:
Figure FDA0003679272520000014
wherein
Figure FDA0003679272520000015
The method is a set of points which are projected to an image plane by the laser radar point cloud and are expressed by homogeneous coordinates.
5. The point cloud weak surveillance-based instance segmentation method of claim 3, wherein the refining the projection point cloud, and removing overlapped points caused by parallax between the lidar and the camera to obtain a refined point cloud comprises:
each pixel P obtained by projection onto an image plane 2d And the depth truth values of the corresponding laser radar points form a sparse depth map
Figure FDA0003679272520000016
And
using a two-dimensional sliding window
Figure FDA0003679272520000017
To traverse the entire sparse depth map, within each window, the projection point cloud is segmented into near points according to relative depth
Figure FDA0003679272520000018
And a remote site
Figure FDA0003679272520000019
Wherein the point at which the relative depth exceeds the depth threshold is a distant point
Figure FDA0003679272520000021
The point at which the relative depth does not exceed the depth threshold is the proximity point
Figure FDA0003679272520000022
Figure FDA0003679272520000023
Wherein p (x, y) denotes sliding in two dimensions
Figure FDA00036792725200000219
A point in the lidar point cloud corresponding to a pixel with (x, y) coordinates, τ depth Representing a depth threshold with which points at greater distances can be filtered out, d (x, y) representing a depth value corresponding to a pixel with coordinates (x, y), d min And d max Respectively represented in two-dimensional sliding windows
Figure FDA00036792725200000220
A minimum depth value and a maximum depth value within;
by calculating the proximity point
Figure FDA0003679272520000024
Removing the points far away from the minimum envelope range of the adjacent points as overlapped points to obtain the purified point cloud
Figure FDA0003679272520000025
Wherein the overlapping point
Figure FDA0003679272520000026
Comprises the following steps:
Figure FDA0003679272520000027
wherein x min ,x max Is near the point
Figure FDA0003679272520000028
Minimum and maximum values of (1) on the x-axis, y min ,y max Is near the point
Figure FDA0003679272520000029
Of the minimum and maximum values on the y-axis.
6. The point cloud weak surveillance-based instance segmentation method of claim 5, wherein assigning foreground/background labels to points in the refined point cloud comprises:
according to the position relation between the purified point cloud and the three-dimensional bounding box, the purified point cloud is processed
Figure FDA00036792725200000210
Divided into points within a three-dimensional bounding box
Figure FDA00036792725200000211
And points outside the three-dimensional bounding box
Figure FDA00036792725200000212
Points within a three-dimensional bounding box
Figure FDA00036792725200000213
As a positive sample, and assigning a foreground label, will
Figure FDA00036792725200000214
A part of points around the three-dimensional bounding box is used as negative samples, and background labels are distributed, wherein the number of the positive samples and the negative samples is s;
and propagating the pseudo labels of the positive sample and the negative sample to the surrounding 8 pixels according to the image feature similarity.
7. The point cloud weakly supervised based instance segmentation method of claim 6, wherein
Figure FDA00036792725200000215
As a negative example, assigning a background label to a portion of points surrounding the vicinity of the three-dimensional bounding box includes:
firstly, 8 fixed points of the three-dimensional bounding box are projected to an image plane, and then a minimum enveloping rectangle is obtained through calculation
Figure FDA00036792725200000216
Selecting
Figure FDA00036792725200000217
Can project part of points falling within the envelope rectangle b as negative samples
Figure FDA00036792725200000218
8. The point cloud weak surveillance-based instance segmentation method of claim 6, wherein propagating the pseudo labels of the positive and negative examples onto the surrounding 8 pixels according to image feature similarity comprises:
when the image feature similarity exceeds a similarity threshold, selecting a candidate point p from a positive sample and a negative sample c Spread over 8 pixels around the image, so that these 8 pixels have candidate points p c The same category label, wherein the judgment formula of label propagation is:
Figure FDA0003679272520000031
where l (p) is the pseudo label to which point p is assigned,
Figure FDA0003679272520000032
is a candidate point p c The 8 pixels around on the image,
Figure FDA0003679272520000033
is the extractor of the point p from the image features obtained by pre-training
Figure FDA0003679272520000034
Of the extracted image features, tau dense Is the similarity threshold.
9. The point cloud weak supervision-based instance segmentation method of claim 6, wherein in training the segmenter, constraining the output of the segmenter using a point error loss function and a graph consistency loss function comprises:
a point error loss function is constructed by adopting a bilinear interpolation method, wherein the loss between the prediction mask and the pseudo label can be measured through the point error loss function, and the point error loss function is as follows:
Figure FDA0003679272520000035
where K is the total number of instances in the image, S is all points with false labels, p ks Is the s point in the k example, l ks Then is point p ks Pseudo label of L point Indicating a loss of point error.
10. The point cloud weak supervision-based instance segmentation method of claim 9, wherein in training the segmenter, employing the point error loss function and the graph consistency loss function to constrain the output of the segmenter comprises:
using the purified point cloud
Figure FDA0003679272520000036
Constructing an undirected graph G ═<V,E>Wherein the refined point cloud
Figure FDA0003679272520000037
The points in (1) are used as nodes to form a V, E is an edge, whether an edge is formed between two nodes and whether the two nodes have the same pseudo label or not is determined according to the image feature similarity and the three-dimensional geometric feature similarity between the two nodes, wherein the weighted sum of the image feature similarity and the three-dimensional geometric feature similarity is as follows:
W ij =w 1 S image (i,j)+w 2 S geometry (i,j),
wherein w 1 And w 2 Is the balance weight of the similarity of the image features and the similarity of the three-dimensional geometric features, S image (i, j) and S geometry (i, j) respectively represent nodes p i And p j Similarity of image features and similarity of three-dimensional geometric features between them, W ij Representing the overall similarity of the graphs; total similarity W of current drawing ij When the similarity threshold value tau is larger than the similarity threshold value tau, an edge is formed between the two nodes and has the same pseudo label, otherwise, a connecting edge does not exist between the two nodes and the pseudo labels of the two nodes are different;
according to the fact that when a connecting edge exists between nodes, the predicted masks of the divider are close, the output of the divider can be constrained by using a graph consistency loss function:
Figure FDA0003679272520000041
where N | V | is all nodes in the undirected graph,
Figure FDA0003679272520000042
and
Figure FDA0003679272520000043
are respectively a node p i And node p j Of the prediction mask, L consistency Indicating a loss of map consistency.
CN202210629786.8A 2022-06-06 2022-06-06 Instance segmentation method based on point cloud weak supervision Pending CN114972758A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210629786.8A CN114972758A (en) 2022-06-06 2022-06-06 Instance segmentation method based on point cloud weak supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210629786.8A CN114972758A (en) 2022-06-06 2022-06-06 Instance segmentation method based on point cloud weak supervision

Publications (1)

Publication Number Publication Date
CN114972758A true CN114972758A (en) 2022-08-30

Family

ID=82960452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210629786.8A Pending CN114972758A (en) 2022-06-06 2022-06-06 Instance segmentation method based on point cloud weak supervision

Country Status (1)

Country Link
CN (1) CN114972758A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703952A (en) * 2023-08-09 2023-09-05 深圳魔视智能科技有限公司 Method and device for filtering occlusion point cloud, computer equipment and storage medium
CN117058384A (en) * 2023-08-22 2023-11-14 山东大学 Method and system for semantic segmentation of three-dimensional point cloud

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703952A (en) * 2023-08-09 2023-09-05 深圳魔视智能科技有限公司 Method and device for filtering occlusion point cloud, computer equipment and storage medium
CN116703952B (en) * 2023-08-09 2023-12-08 深圳魔视智能科技有限公司 Method and device for filtering occlusion point cloud, computer equipment and storage medium
CN117058384A (en) * 2023-08-22 2023-11-14 山东大学 Method and system for semantic segmentation of three-dimensional point cloud
CN117058384B (en) * 2023-08-22 2024-02-09 山东大学 Method and system for semantic segmentation of three-dimensional point cloud

Similar Documents

Publication Publication Date Title
CN111553859B (en) Laser radar point cloud reflection intensity completion method and system
CN111462275B (en) Map production method and device based on laser point cloud
US10867190B1 (en) Method and system for lane detection
CN110148196B (en) Image processing method and device and related equipment
CN111461245B (en) Wheeled robot semantic mapping method and system fusing point cloud and image
CN108509820B (en) Obstacle segmentation method and device, computer equipment and readable medium
CN112581612B (en) Vehicle-mounted grid map generation method and system based on fusion of laser radar and all-round-looking camera
CN108470174B (en) Obstacle segmentation method and device, computer equipment and readable medium
CN114972758A (en) Instance segmentation method based on point cloud weak supervision
CN112967283B (en) Target identification method, system, equipment and storage medium based on binocular camera
CN111753698A (en) Multi-mode three-dimensional point cloud segmentation system and method
CN113706480B (en) Point cloud 3D target detection method based on key point multi-scale feature fusion
CN112258519B (en) Automatic extraction method and device for way-giving line of road in high-precision map making
CN115049700A (en) Target detection method and device
CN113516664A (en) Visual SLAM method based on semantic segmentation dynamic points
CN110619299A (en) Object recognition SLAM method and device based on grid
CN113255444A (en) Training method of image recognition model, image recognition method and device
CN112257668A (en) Main and auxiliary road judging method and device, electronic equipment and storage medium
CN117037103A (en) Road detection method and device
CN116071729A (en) Method and device for detecting drivable area and road edge and related equipment
CN115100741A (en) Point cloud pedestrian distance risk detection method, system, equipment and medium
CN115147798A (en) Method, model and device for predicting travelable area and vehicle
CN114550116A (en) Object identification method and device
CN112507891B (en) Method and device for automatically identifying high-speed intersection and constructing intersection vector
CN116597122A (en) Data labeling method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination