CN114638954A

CN114638954A - Point cloud segmentation model training method, point cloud data segmentation method and related device

Info

Publication number: CN114638954A
Application number: CN202210163274.7A
Authority: CN
Inventors: 许双杰; 万锐; 邹晓艺
Original assignee: DeepRoute AI Ltd
Current assignee: DeepRoute AI Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-06-17
Anticipated expiration: 2042-02-22
Also published as: CN114638954B

Abstract

The application discloses a point cloud segmentation model training method, a point cloud data segmentation method and a related device. The method comprises the following steps: acquiring training point cloud; the method comprises the following steps that data points in training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel; inputting the training point cloud into a point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel; determining at least one voxel pair, wherein each voxel pair comprises a corresponding prediction voxel and a corresponding real voxel; and adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair. By the method, the conversion from the voxel characteristics to the data point characteristics can be reduced, the loss in the conversion process is reduced, and the calculated amount is reduced, so that the segmentation accuracy of the point cloud segmentation model is improved.

Description

Point cloud segmentation model training method, point cloud data segmentation method and related device

Technical Field

The present application relates to the field of point cloud data processing technologies, and in particular, to a training method for a point cloud segmentation model, a point cloud data segmentation method, and a related apparatus.

Background

Three-dimensional scene segmentation is essential for many robotic applications, particularly autopilot. In the three-dimensional laser point cloud data processing process, the three-dimensional sparse convolution is often used for directly processing the voxelized point cloud characteristics at present.

But often map three-dimensional sparse features into dense features, which makes it easy to calculate the loss function in the training process. But this will increase the amount of computation a lot and the sparse features extracted in the convolution of the three-dimensional coefficients will also lose some of their effectiveness in the mapping process.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a training method of a point cloud segmentation model, a point cloud data segmentation method and a related device, which can reduce the conversion from voxel characteristics to data point characteristics, reduce the loss in the conversion process and reduce the calculated amount, thereby improving the segmentation accuracy of the point cloud segmentation model.

In order to solve the technical problem, the application adopts a technical scheme that: a point cloud segmentation model training method is provided, and comprises the following steps: acquiring training point cloud; the method comprises the following steps that data points in training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel; inputting the training point cloud into a point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel; determining at least one voxel pair, each voxel pair comprising a corresponding predicted voxel and a real voxel; and adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair.

Wherein determining at least one voxel pair comprises: acquiring a first coordinate of each real voxel and a second coordinate of each predicted voxel; determining a first target coordinate and a second target coordinate which have the same coordinate; and determining the real voxel of the target first coordinate and the predicted voxel corresponding to the target second coordinate as a voxel pair.

Wherein obtaining a first coordinate of each real voxel and a second coordinate of each predicted voxel comprises: carrying out Hash coding on the first coordinates of all real voxels to obtain a corresponding first Hash table; performing hash coding on the second coordinates of all the predicted voxels to obtain corresponding second hash tables; determining a target first coordinate and a target second coordinate having the same coordinate, comprising: and traversing by utilizing the second hash table of each predicted voxel and all the first hash tables, and determining a target first coordinate and a target second coordinate with the same coordinate.

Wherein, the point cloud segmentation model includes: the system comprises a sparse voxel characteristic coding network, a three-dimensional sparse network, a point voxel network, a first supervision network, a second supervision network, a third supervision network and a fourth supervision network; inputting the training point cloud into a point cloud segmentation model to obtain a prediction voxel output by the point cloud segmentation model and detection information corresponding to the prediction voxel, wherein the method comprises the following steps: inputting the training point cloud into a sparse voxel characteristic coding network for characteristic extraction to obtain voxel characteristics; inputting the voxel characteristics into a three-dimensional sparse network for sparse voxel characteristic extraction to obtain sparse voxel characteristics; inputting the sparse voxel characteristics into a point voxel network for characteristic conversion to obtain data point characteristics; inputting the sparse voxel characteristics into a first supervision network to perform semantic segmentation learning on the three-dimensional voxels, and obtaining first detection information corresponding to the predicted voxels; inputting the sparse voxel characteristics into a second supervision network to carry out three-dimensional thermodynamic diagram learning, and obtaining second detection information corresponding to the prediction voxel; inputting the data point characteristics into a third supervision network to perform point-level semantic segmentation learning, and obtaining third detection information corresponding to each data point; and inputting the data point characteristics into a fourth monitoring network to perform offset monitoring learning at the point level, so as to obtain the predicted offset corresponding to each data point.

The real information comprises first real information, second real information, third real information and fourth real information; the first real information represents semantic information of real voxels; the second real information represents object centroid information of real voxels; the third real information represents semantic information of the data point; the fourth real information characterizes offset information of the data point.

The method for adjusting the network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair comprises the following steps: determining the difference between the first detection information and the first real information to obtain a first loss value; determining the difference between the second detection information and the second real information to obtain a second loss value; determining the difference between the third detection information and the third real information to obtain a third loss value; determining the difference between the predicted offset and the fourth real information to obtain a fourth loss value; and adjusting the network parameters of the point cloud segmentation model by using the first loss value, the second loss value, the third loss value and the fourth loss value.

Wherein determining a difference between the first detected information and the first actual information to obtain a first loss value comprises: determining a first sub-loss value between the first detection information and the real information by using a Lovasz-Softmax loss function; determining a second sub-loss value between the first detection information and the real information by using a cross entropy loss function; and summing the first sub-loss value and the second sub-loss value to obtain a third loss value.

Wherein determining a difference between the second detected information and the second actual information, before obtaining the second loss value, comprises: inputting the sparse voxel characteristics into a second supervision network to carry out three-dimensional thermodynamic diagram learning, and obtaining the probability that each sparse voxel characteristic belongs to the object centroid; taking the probability as second detection information; and determining second real information by using the centroid of each object and the sparse voxel characteristics within the preset distance.

Wherein determining a difference between the second detected information and the second actual information to obtain a second loss value comprises: and determining the difference between the second detection information and the second real information by utilizing a Focal loss function to obtain a second loss value.

Wherein determining a difference between the third detected information and the third real information to obtain a third loss value comprises: determining a third sub-loss value between the first detection information and the third real information by using a Lovasz-Softmax loss function; determining a fourth sub-loss value between the first detection information and the third real information by using a cross entropy loss function; and summing the third sub-loss value and the fourth sub-loss value to obtain a third loss value.

Determining a difference between the prediction offset and the fourth real information to obtain a fourth loss value, including: determining the centroid of the real object by using the fourth real information of the data point; determining the real offset of the data point and the center of mass of the real object; and obtaining a fourth loss value by using the predicted offset and the real offset.

In order to solve the technical problem, the other technical scheme adopted by the application is as follows: provided is a point cloud data segmentation method, which comprises the following steps: acquiring point cloud data; and inputting the point cloud data into a point cloud segmentation model, and outputting the segmented point cloud data, wherein the point cloud segmentation model is obtained by utilizing the method provided by the technical scheme.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a processing apparatus for point cloud data, the processing apparatus comprising a processor and a memory coupled to the processor, the memory being used for storing a computer program, and the processor being used for executing the computer program to implement the method provided in the above technical solution.

In order to solve the technical problem, the other technical scheme adopted by the application is as follows: there is provided a computer readable storage medium for storing a computer program which, when executed by a processor, is adapted to carry out the method as provided in the above-mentioned solution.

The beneficial effects of the embodiment of the application are that: different from the prior art, the training method of the point cloud segmentation model provided by the application comprises the following steps: acquiring training point cloud; the method comprises the following steps that data points in training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel; inputting the training point cloud into a point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel; determining at least one voxel pair, wherein each voxel pair comprises a corresponding prediction voxel and a corresponding real voxel; and adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair. By the method, the corresponding predicted voxel and the corresponding real voxel are determined by utilizing the determined voxel pair, and then the network parameters of the point cloud segmentation model are adjusted by utilizing the difference between the real information and the detection information of the voxel pair, so that the conversion from the voxel characteristics to the data point characteristics can be reduced, the loss in the conversion process is reduced, the calculated amount is reduced, and the segmentation accuracy of the point cloud segmentation model is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a schematic flow chart diagram illustrating a first embodiment of a point cloud segmentation model training method provided in the present application;

FIG. 2 is a schematic flow chart diagram illustrating a second embodiment of a point cloud segmentation model training method provided in the present application;

FIG. 3 is a flowchart illustrating a third embodiment of a point cloud segmentation model training method provided in the present application;

fig. 4 and 5 are schematic diagrams of determining voxel pairs provided in the present application;

FIG. 6 is a schematic flowchart of a fourth embodiment of a point cloud segmentation model training method provided in the present application;

FIG. 7 is a flowchart illustrating an embodiment of step 603 provided herein;

FIG. 8 is a schematic flow chart diagram illustrating an embodiment of step 607 provided herein;

FIG. 9 is a schematic flow chart diagram illustrating an embodiment of a process prior to step 610 provided herein;

FIG. 10 is a schematic flow chart diagram illustrating an embodiment of step 612 provided herein;

FIG. 11 is a schematic flow chart diagram illustrating one embodiment of step 614 provided herein;

FIG. 12 is a schematic structural diagram of a point cloud segmentation model provided in the present application;

FIG. 13 is a schematic flowchart illustrating a point cloud data segmentation method according to a first embodiment of the present disclosure;

FIG. 14 is a schematic structural diagram of an embodiment of a device for processing point cloud data provided in the present application;

FIG. 15 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart of a first embodiment of a point cloud segmentation model training method provided in the present application. The method comprises the following steps:

step 11: acquiring a training point cloud; the data points in the training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel.

It is to be understood that the training point cloud may be collected manually or automatically. For example, the point cloud within the coverage range of the radar sensor is collected by artificially controlling the radar sensor. And (3) point clouds automatically collected in the moving process through a radar sensor on the automatic mobile equipment. The training point cloud includes objects in the scene. Such as buildings, trees, people, vehicles, etc.

In some embodiments, the corresponding information in the training point cloud, such as the corresponding type of the point cloud, is marked by means of manual marking. And classifying the data points in the training point cloud according to the voxels to obtain corresponding real voxels, labeling each real voxel, determining the type corresponding to the real voxel, the mass center of the real voxel, and labeling the offset of each data point and the mass center of the object corresponding to the data point.

The labeled information is used as real information to participate in subsequent supervised learning.

In some embodiments, after the training point cloud is obtained, the training point cloud is preprocessed, for example, by operations such as random rotation, mirror image turning, random blurring, random clipping, and the like, the training point cloud is processed to obtain a plurality of training point clouds corresponding to the operations, and labels of the training point clouds are changed according to actual operations, so that the number of the training point clouds is greatly expanded. Training can be completed without inputting other related training point clouds again.

Where a voxel is a pixel of 3D space. Quantized, fixed-size point clouds. Each cell is a fixed size and discrete coordinates. The size of the voxels may be set in advance, such as 0.1 mm by 0.1 mm cube, or 0.2 mm by 0.2 mm cube. That is, several data points in the point cloud data can be included in one voxel.

Step 12: and inputting the training point cloud into the point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel.

In some embodiments, the point cloud segmentation model includes a feature extraction network, configured to perform voxel feature extraction on an input training point cloud to obtain corresponding voxel features and detection information corresponding to the voxel features. These voxel characteristics may be used as prediction voxels.

In some embodiments, a multi-layered perceptron, sub-manifold sparse convolution and sparse convolution, etc. may be employed to build the point cloud segmentation model.

Step 13: at least one voxel pair is determined, each voxel pair comprising a corresponding predicted voxel and a real voxel.

After the point cloud segmentation model determines the prediction voxels, the prediction voxels cannot be made to correspond to the real voxels because the prediction voxels are unordered, and if the prediction voxels cannot be made to correspond to the real voxels, the difference between the prediction voxels and the real voxels at the voxel level cannot be determined.

Based on the method, a corresponding mode of the prediction voxel and the real voxel is provided, a corresponding relation is formed, and difference comparison is carried out based on the corresponding relation.

In some embodiments, the characteristics of the same coordinate between the predicted voxel and the real voxel may be utilized to determine the predicted voxel and the real voxel corresponding to the same coordinate. If only the predicted voxel exists at that coordinate, the predicted voxel is discarded. If only the real voxel exists in the coordinate, the detection information of the corresponding prediction voxel is determined to be 0, which indicates that the accuracy of the point cloud segmentation model is low and the training needs to be continued.

Step 14: and adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair.

In some embodiments, the training times of the point cloud segmentation model may be adjusted according to the difference between the real information and the detected information of the voxel pair, thereby achieving the purpose of adjusting the network parameters of the point cloud segmentation model. If the real information is A and the detection information is B, the training times of the point cloud segmentation model can be adjusted, and the network parameters of the point cloud segmentation model are adjusted; and if the real information is A and the detection information is B, but the confidence coefficient is lower than a set threshold value, adjusting the training times of the point cloud segmentation model, and further adjusting the network parameters of the point cloud segmentation model.

In some embodiments, network parameters of the point cloud segmentation model may be adjusted according to a difference between real information and detected information of the voxel pair, and if there is a convolutional neural network in the point cloud segmentation model, the number, step length, and filling of convolutional kernels may be set, an excitation function may be adjusted, parameters of the pooling layer may be adjusted, and the like.

In some embodiments, a loss value may be calculated according to data of the real information and the detection information of the voxel pair, and if the loss value is different from a preset loss threshold, a network parameter of the point cloud segmentation model is adjusted.

In the embodiment, training point cloud is obtained; the method comprises the following steps that data points in training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel; inputting the training point cloud into a point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel; determining at least one voxel pair, wherein each voxel pair comprises a corresponding prediction voxel and a corresponding real voxel; according to the difference between the real information and the detection information of the voxel pair, the network parameters of the point cloud segmentation model are adjusted in a mode, the corresponding prediction voxel and the corresponding real voxel are determined by utilizing the determined voxel pair, and then the network parameters of the point cloud segmentation model are adjusted by utilizing the difference between the real information and the detection information of the voxel pair, so that the conversion from the voxel characteristics to the data point characteristics can be reduced, the loss in the conversion process is reduced, the calculated amount is reduced, and the segmentation accuracy of the point cloud segmentation model is improved.

Referring to fig. 2, fig. 2 is a schematic flowchart of a second embodiment of a point cloud segmentation model training method provided in the present application. The method comprises the following steps:

step 21: acquiring training point cloud; the data points in the training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel.

Step 22: and inputting the training point cloud into the point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel.

Steps 21 to 22 have the same or similar technical solutions as those of the above embodiments, and are not described herein again.

Step 23: a first coordinate of each real voxel and a second coordinate of each predicted voxel are obtained.

It will be appreciated that the first coordinates of each real voxel may be determined at the time of labeling. The second coordinate of each prediction voxel may be determined when determining the prediction voxel.

Step 24: a target first coordinate and a target second coordinate having the same coordinate are determined.

At this time, the target first coordinate and the target second coordinate having the same coordinate may be determined.

Step 25: and determining the real voxel of the first coordinate of the target and the predicted voxel corresponding to the second coordinate of the target as a voxel pair.

For example, the first coordinate of the real voxel a is a, the first coordinate of the real voxel B is B, the first coordinate of the real voxel C is C, the first coordinate of the real voxel D is D, the first coordinate of the real voxel E is E, and the first coordinate of the real voxel F is F. The second coordinate of the predicted voxel a 'is a, the second coordinate of the predicted voxel B' is C, the second coordinate of the predicted voxel C 'is D, the second coordinate of the predicted voxel D' is B, the second coordinate of the predicted voxel E 'is E, the second coordinate of the predicted voxel F' is F, the second coordinate of the predicted voxel G 'is G, and the second coordinate of the predicted voxel H' is H.

The real voxel A and the predicted voxel A 'are a voxel pair, the real voxel B and the predicted voxel D' are a voxel pair, the real voxel C and the predicted voxel B 'are a voxel pair, the real voxel D and the predicted voxel C' are a voxel pair, the real voxel E and the predicted voxel E 'are a voxel pair, and the real voxel F and the predicted voxel F' are a voxel pair.

The remaining unpaired prediction voxels may then be discarded.

Step 26: and adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair.

In this embodiment, the corresponding predicted voxel and real voxel are determined by using the determined voxel pair, and then the network parameters of the point cloud segmentation model are adjusted by using the difference between the real information and the detection information of the voxel pair, so that the conversion from the voxel characteristics to the data point characteristics can be reduced, the loss in the conversion process is reduced, the calculation amount is reduced, and the segmentation accuracy of the point cloud segmentation model is improved.

Referring to fig. 3, fig. 3 is a schematic flowchart of a second embodiment of a point cloud segmentation model training method provided in the present application. The method comprises the following steps:

step 31: acquiring training point cloud; the data points in the training point cloud are classified according to voxels to obtain corresponding real voxels, and each real voxel is marked with real information.

Step 32: and inputting the training point cloud into the point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel.

Steps 31 to 32 have the same or similar technical solutions as any of the above embodiments, and are not described herein again.

Step 33: and carrying out Hash coding on the first coordinates of all the real voxels to obtain a corresponding first Hash table.

The first coordinates of all real voxels are subjected to Hash coding to form key value pairs, the first coordinates are used as values in the key value pairs, and a value is set for the first coordinates, so that a first Hash table is formed.

For example, the first coordinate of the real voxel a is a, the first coordinate of the real voxel B is B, the first coordinate of the real voxel C is C, the first coordinate of the real voxel D is D, the first coordinate of the real voxel E is E, and the first coordinate of the real voxel F is F. The corresponding first hash table can be represented as { (0, a), (1, b), (2, c), (3, d) (4, e), (5, f) }.

Step 34: and carrying out hash coding on the second coordinates of all the predicted voxels to obtain a corresponding second hash table.

And similarly, performing hash coding on the second coordinates of all the predicted voxels to obtain a corresponding second hash table.

Step 35: and traversing by using the second hash table of each predicted voxel and all the first hash tables, and determining a target first coordinate and a target second coordinate with the same coordinate.

For example, according to the data in the second hash table, each data is subjected to traversal matching with the data in all the first hash tables, and a target first coordinate and a target second coordinate having the same coordinate are determined.

In other embodiments, the target first coordinate and the target second coordinate having the same coordinate may be determined by traversing the first hash table and all the second hash tables for each real voxel.

Step 36: and determining the real voxel of the first coordinate of the target and the predicted voxel corresponding to the second coordinate of the target as a voxel pair.

The following description is made with reference to fig. 4 and 5:

as shown in fig. 4, the sparse voxel-level prediction is denoted as a matrix of N × C, where N represents the number of all non-empty voxels and C represents the feature dimension, and the true value of the voxel level that can be denoted as sparse is N1 × C1. The number of real voxels is different, such as N and N1, and the order of arrangement of the predicted voxels and the real voxels is different, such as the feature with voxel coordinates (x, y, z) is located 2 nd in N × C, but may be 4 th or not at all in N1 × C1, which is a characteristic of sparse features, i.e., random arrangement. Unlike images that have a one-to-one coordinate relationship. The reason for this is that since the distributed points in space are sparse, some voxels are necessarily empty and thus have no value after the whole space is voxelized, and the sparse representation discards the empty values, so that the phenomenon of feature misalignment occurs.

The technical scheme is provided for the operation between the real voxel and the prediction voxel, such as the cross entropy loss function or the L1 loss function. Firstly, a hash table is built to code coordinates of all effective real voxels, then a hash value is used as a key value of the position coordinate, the hash value of the coordinate of the hash table is continuously used for inquiring the characteristics of the corresponding coordinate hash value in a target, so that the characteristics of the positions with the same coordinate between two real voxels and a predicted voxel can be obtained, and then a general method for calculating a loss function can be used for supervision.

Specifically, as shown in fig. 5, given two sparse features S and S ', the coordinate values of the two sparse features S and S' are respectively hashed to obtain hash tables X and Y

For each value S in S, matching the value of Y by using hash matching for the value in X.

And after the matching value is obtained, the S 'is removed, the feature S' of the corresponding position is taken, and the S is used as a loss function.

Not matching, skip to next.

Step 37: and adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair.

In this embodiment, the predicted voxels and the real voxels are encoded in a hash encoding manner, the determined voxel pairs are further utilized to determine the corresponding predicted voxels and real voxels, and then the difference between the real information and the detected information of the voxel pairs is utilized to adjust the network parameters of the point cloud segmentation model, so that the conversion from the voxel characteristics to the data point characteristics can be reduced, the loss in the conversion process is reduced, the calculation amount is reduced, and the segmentation accuracy of the point cloud segmentation model is improved.

Referring to fig. 6, fig. 6 is a schematic flow chart diagram of a fourth embodiment of a point cloud segmentation model training method provided in the present application. The point cloud segmentation model comprises the following steps: the system comprises a sparse voxel characteristic coding network, a three-dimensional sparse network, a point voxel network, a first supervision network, a second supervision network, a third supervision network and a fourth supervision network. The method comprises the following steps:

step 601: acquiring training point cloud; the data points in the training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel.

In some embodiments, each data point is acquired by a radar sensor, and the data point has four-dimensional data, i.e., a three-dimensional coordinate and a reflection intensity corresponding to the radar sensor.

Step 602: and inputting the training point cloud into a sparse voxel characteristic coding network for characteristic extraction to obtain voxel characteristics.

The method comprises the steps of performing voxelization feature extraction on training point cloud through a sparse voxel feature coding network to obtain a feature vector at a point level and a voxel feature at a voxel level. Firstly, carrying out feature extraction on each data point in the training point cloud through a point-by-point multilayer perceptron (MLP) in a sparse voxel feature coding network to obtain the point cloud feature of each data point. Such as two-layer linear multi-layer sensors, the output channels of the sensors of each layer are 32 or 64.

And then dividing the training point cloud according to the size of the voxel to obtain a data point corresponding to each voxel. Because the data points all obtain corresponding point cloud characteristics, the point cloud characteristics of the data points can be aggregated to form voxel characteristics.

Specifically, all data points in the target voxel are determined, and the operation of taking the maximum value, or taking the minimum value, or taking the average value is performed on the point cloud features corresponding to the data points to obtain a target point cloud feature. And taking the target point cloud characteristic as the voxel characteristic of the target voxel.

Further, the voxel features can be combined with the point cloud features corresponding to all the data points in the target voxel again, and the combined features are subjected to multi-layer perceptron feature extraction operation, so that the point cloud features of the last data point have voxel feature information, namely the point cloud features contain context information of the voxel features.

And then, carrying out maximum value taking or minimum value taking or average value taking on the point cloud characteristics corresponding to all the data points in the target voxel to obtain a target point cloud characteristic. And taking the target point cloud characteristic as the voxel characteristic of the target voxel.

At this time, the voxel characteristic and the point cloud characteristic have stronger correlation.

Step 603: and inputting the voxel characteristics into a three-dimensional sparse network for sparse voxel characteristic extraction to obtain sparse voxel characteristics.

In some embodiments, the three-dimensional sparse network comprises: a first network block, a second network block, a third network block, a fourth network block, and a fusion layer. The first network block comprises 2 basic units, the second network block comprises 2 basic units, the third network block comprises 3 basic units, the fourth network block comprises 4 basic units, and each basic unit comprises two layers of sub-manifold sparse convolution and one layer of sparse convolution.

In an application scenario, referring to fig. 7, step 603 may be the following process:

step 6031: and performing sparse feature extraction on the voxel features by utilizing the first network block to obtain first sparse voxel features.

And performing sparse feature extraction on the voxel features by using the sub-manifold sparse convolution and the sparse convolution in the 2 basic units in the first network block to obtain first sparse voxel features.

Step 6032: and performing sparse feature extraction on the first sparse voxel feature by using a second network block to obtain a second sparse voxel feature.

And performing sparse feature extraction on the first sparse voxel feature by utilizing the sub-manifold sparse convolution and the sparse convolution in the 2 basic units in the second network block to obtain a second sparse voxel feature.

Step 6033: and performing sparse feature extraction on the second sparse voxel feature by using a third network block to obtain a third sparse voxel feature.

And performing sparse feature extraction on the second sparse voxel feature by using sub-manifold sparse convolution and sparse convolution in 3 basic units in the third network block to obtain a third sparse voxel feature.

Step 6034: and performing sparse feature extraction on the third sparse voxel feature by using a fourth network block to obtain a fourth sparse voxel feature.

And performing sparse feature extraction on the third sparse voxel feature by utilizing the sub-manifold sparse convolution and the sparse convolution in the 4 basic units in the fourth network block to obtain a fourth sparse voxel feature.

Step 6035: and splicing and fusing the second sparse voxel characteristic, the third sparse voxel characteristic and the fourth sparse voxel characteristic by utilizing the fusion layer to obtain a fifth sparse voxel characteristic.

The fifth sparse voxel characteristic has more information.

In the above process, the sub-manifold sparse convolution can maintain the feature sparsity in the calculation. Sparse convolution produces a dilution of the activation region for the out-diffusion feature to cover the true object centroid, which may otherwise be data-point free. Therefore, the comprehensive application of the sub-manifold sparse convolution and the sparse convolution is very suitable for the sparse point cloud distributed on the surface of the object only.

Specifically, the sub-manifold sparse convolution in each base unit is used for feature extraction, and the sparse convolution is used for performing short-circuit connection on the input and the output of the base unit to complete splicing.

In some embodiments, the first network block and the second network block employ sub-beamforming sparse max pooling to expand the voxel receptive field.

Step 604: and inputting the sparse voxel characteristics into a point voxel network for characteristic conversion to obtain data point characteristics.

In some embodiments, feature transformation may be performed on the point voxel network based on the second sparse voxel feature, the third sparse voxel feature, the fourth sparse voxel feature, and the point cloud feature output in the sparse voxel feature coding network, so as to obtain a data point feature.

Specifically, the second sparse voxel characteristic, the third sparse voxel characteristic, the fourth sparse voxel characteristic and the point cloud characteristic are subjected to characteristic combination to obtain a data point characteristic. I.e. the information of voxel features with different dimensions of the data point features, i.e. the point voxel features contain context information of the voxel features.

Step 605: and inputting the sparse voxel characteristics into a first supervision network to carry out semantic segmentation learning on the three-dimensional voxels, so as to obtain first detection information corresponding to the predicted voxels.

Step 606: at least one voxel pair is determined, each voxel pair comprising a corresponding predicted voxel and a real voxel.

Step 606 has the same or similar technical solutions as any of the above embodiments, and is not described herein.

Step 607: and determining the difference between the first detection information and the first real information to obtain a first loss value.

In some embodiments, referring to fig. 8, step 607 may be the following flow:

step 6071: a first sub-loss value between the first detected information and the real information is determined using a Lovasz-Softmax loss function.

The Lovasz-Softmax loss function is expressed as follows:

step 6072: a second sub-loss value between the first detected information and the real information is determined using a cross-entropy loss function.

The cross entropy loss function is expressed as follows:

step 6073: and summing the first sub-loss value and the second sub-loss value to obtain a third loss value.

Step 608: and inputting the sparse voxel characteristics into a second supervision network to carry out three-dimensional thermodynamic diagram learning, so as to obtain second detection information corresponding to the prediction voxel.

Step 609: at least one voxel pair is determined, each voxel pair comprising a corresponding predicted voxel and a real voxel.

Step 609 has the same or similar technical scheme as any of the above embodiments, and is not described herein.

Step 610: and determining the difference between the second detection information and the second real information to obtain a second loss value.

And determining the difference between the second detection information and the second real information by utilizing a Focal loss function to obtain a second loss value.

Wherein the Focal loss function is expressed as follows:

FL(p_t)＝-(1-p_t)^γlog(p_t)。

in some embodiments, referring to fig. 9, prior to step 610, the following may be a flow:

step 81: and inputting the sparse voxel characteristics into a second supervision network to carry out three-dimensional thermodynamic diagram learning, and obtaining the probability that each sparse voxel characteristic belongs to the object centroid.

Step 82: and taking the probability as second detection information.

Step 83: and determining second real information by using the centroid of each object and the sparse voxel characteristics within the preset distance.

Step 611: and inputting the data point characteristics to a third supervision network for semantic segmentation learning at the point level to obtain third detection information corresponding to each data point.

Step 612: and determining the difference between the third detection information and the third real information to obtain a third loss value.

In some embodiments, referring to fig. 10, step 612 may be the following flow:

step 6121: and determining a third sub-loss value between the first detection information and the third real information by using a Lovasz-Softmax loss function.

Step 6122: and determining a fourth sub-loss value between the first detection information and the third real information by using a cross entropy loss function.

Step 6123: and summing the third sub-loss value and the fourth sub-loss value to obtain a third loss value.

Step 613: and inputting the data point characteristics into a fourth monitoring network to perform point-level offset monitoring learning to obtain a predicted offset corresponding to each data point.

Step 614: and determining the difference between the prediction offset and the fourth real information to obtain a fourth loss value.

In some embodiments, referring to fig. 11, step 614 may be the following flow:

step 6141: and determining the centroid of the real object by using the fourth real information of the data point.

It is understood that the real object centroid may not be a data point.

Step 6142: and determining the real offset of the data point and the center of mass of the real object.

In some embodiments, this true offset may be determined in advance at the time of annotation.

Step 6143: and obtaining a fourth loss value by using the predicted offset and the real offset.

In some embodiments, a smooth L1 loss function may be employed to obtain the fourth loss value.

Step 615: and adjusting the network parameters of the point cloud segmentation model by using the first loss value, the second loss value, the third loss value and the fourth loss value.

The network parameters of the point cloud segmentation model can be adjusted by adopting a gradient descent algorithm and a random gradient descent algorithm.

In an application scenario, the following is explained with reference to fig. 12: the point cloud segmentation model comprises the following steps: the system comprises a sparse voxel characteristic coding network, a three-dimensional sparse network, a point voxel network, a first supervision network, a second supervision network, a third supervision network and a fourth supervision network.

The point voxel network comprises a first point cloud feature extraction layer, a first point voxel feature extraction layer, a second point voxel feature extraction layer and a third point voxel feature extraction layer. The three-dimensional sparse network includes: a first network block, a second network block, a third network block, a fourth network block, and a fusion layer.

The second monitoring network is a three-dimensional thermodynamic diagram network, the third monitoring network is a point-by-point semantic network and the fourth monitoring network is a point-by-point offset network.

Inputting the training point cloud data into a sparse voxel characteristic coding network for characteristic extraction, and correspondingly obtaining the point cloud characteristic and the voxel characteristic of each data point.

And inputting the voxel characteristics to a first network block for characteristic extraction to obtain first sparse voxel characteristics. And inputting the first voxel characteristic into a second network block for characteristic extraction to obtain a second sparse voxel characteristic.

And inputting the second voxel characteristic into a third network block for characteristic extraction to obtain a third sparse voxel characteristic.

And inputting the third voxel characteristic into a fourth network block for characteristic extraction to obtain a fourth sparse voxel characteristic.

And inputting the point cloud characteristics to a first point cloud characteristic extraction layer for characteristic extraction, and correspondingly obtaining the first point cloud characteristics.

And inputting the first point cloud feature and the second sparse voxel feature into a first point voxel feature extraction layer for feature extraction and fusion to obtain a first point voxel feature.

And inputting the first point voxel characteristic and the third sparse voxel characteristic into a second point voxel characteristic extraction layer for characteristic extraction and fusion to obtain a second point voxel characteristic.

And inputting the second point voxel characteristic and the fourth voxel characteristic into a third point voxel characteristic extraction layer for characteristic extraction and fusion to obtain a third point voxel characteristic.

And respectively inputting the third point voxel characteristics into a point-by-point semantic network and a point-by-point offset network. Semantic information and offsets for each data point are obtained.

And inputting the second sparse voxel characteristic, the third sparse voxel characteristic and the fourth sparse voxel characteristic into the fusion layer for splicing and fusion to obtain a fifth sparse voxel characteristic.

And inputting the fifth sparse voxel characteristic into a three-dimensional thermodynamic diagram network for thermodynamic diagram learning to obtain second detection information corresponding to the fifth sparse voxel characteristic.

And inputting the second sparse voxel characteristic, the third sparse voxel characteristic and the fourth sparse voxel characteristic into a first supervision network for semantic segmentation learning on the three-dimensional voxels, so as to obtain first detection information corresponding to the predicted voxels.

Specifically, hybrid sparse surveillance consists of four surveillance networks responsible for different tasks: a point-by-point semantic network for predicting an amorphous surface; shifting the network point by point; 3D class-independent sparse coding centroid thermodynamic diagram network for object clustering; and an auxiliary sparse voxel semantic network for better feature learning, i.e. a first supervised network. The four networks share the backbone network, are trained end to end, and play an effective role in the joint learning of semantics and example segmentation.

Where a point-by-point semantic network consists of a series of linear layers, is applied to many of the previous works. Use of Lov-sz-Softmax loss and cross-entropy loss sum for point-by-point semantic networksAnd (4) supervision. Let this loss be L_Sp。

A point-by-point offset network for supervising the offset of each data point. Assuming that the number of points belonging to the object is, the offset prediction is noted as

I.e. the prediction offset. The points after the shift are obtained by adding O to the original coordinates of the point cloud, which are expected to be distributed around the object centroid. For truth values, an example tensor I is established_P＝{R_P·M_II, where I represents the example partitioning truth label, M_IA truth binary mask representing only object points. Use of

Representing the centroid truth value of the object point. To obtain R_CTo 1, pair_PUsing F_POperating operator to get N → V_CCenter of mass, then using phi to refer to F of the mean_PThe → V operator comes back to each instance point. As expressed by the following equation:

R_C＝F_P→V(F_P→V(I_P,Φ))。

the offset was regressed using smooth L1 loss, where only object points participate in the calculation of this loss, expressed using the following formula:

L_O＝L_SmoothL1(O-(R_C-R_P·M_I))。

3D class independent sparse coding centroid thermodynamic network. Note the number of voxels activated as N_VThe thermodynamic network models a probability that each 3D voxel is the centroid of an object, i.e.

We therefore compute the sparsely coded truth H by the inverse of the distance of each object centroid and its surrounding activated voxels_GT. Specifically, the following formula is adopted for expression:

R_H＝F_P→V(I_P,Φ),V_I＝F_P→V({R_P·M_I,I},Φ')；

where Φ represents the operator to average, Φ' represents the operator to compute the maximum number of instance labels, V_IExample labels t, R representing voxels_HRepresenting the corresponding centroids v of different instance labels t_c. To realize R efficiently_HAnd V_IIs calculated, an instance vector I is established_P＝{I,R_P·M_IWhere I denotes an instance tag.

In addition, in H_GTThe voxel of the centroid is set to 1, and the voxels around the centroid are set to 0.8, so as to ensure that the true value can contain the true centroid. On the other hand, the SC convolution is adopted in the sparse convolution layer in the sparse coding centroid heat map network, so that the characteristics of the heat map can be diffused to the true object centroid. Thus, F_PThe → V operator needs to be applied here to align H that does not match_GTAnd H. Loss calculations were performed with focal losses, expressed using the following formula:

L_H＝L_focal(F_P→V(H,H_GT))。

sparse voxel semantic network. Sparse voxel features from multiple levels in a backbone network

Are each input to a sparse voxel semantic network containing a series of SSC convolution layers for maintaining the activated region. Noting the ith level of sparse voxel prediction as

The corresponding true value is

One feature is the sparsely encoded tensor of the majority of point classes in the active voxel.By using F_VAlignment with → V

And

the loss is calculated using the following equation:

wherein L is_LVExpressed as Lov-sz-Softmax loss, L_CEIndicating a cross entropy loss. The sparse voxel semantic network is used as an auxiliary supervision to obtain more sufficient feature learning in the joint training with the network.

The overall loss of the point cloud segmentation model is the sum of the above, expressed by the following formula:

the operation operators in the above process are described below.

For most voxel-based methods, feature alignment is a common operation of a voxel feature encoder or collecting point-like features to pass between points and voxel-like features. However, previous work has considered only two cases: 1. voxelization of point features into F of voxels_P→ V; 2. collecting F of point features from voxels_V→ P, neither of these approaches solves the alignment problem between unmatched voxel features. In order to supervise the sparse voxel characteristics, a new operator F is introduced into the application_V→V。

Data of unordered points and sparse voxels (including predictions and labels) are consolidated into one sparse representation. The sparse tensor is represented as:

S＝{C,F}，C＝{c_k＝(x,y,z),k∈[1,N]}。

wherein, C is the space coordinate in the 3D voxel or point cloud, and F is the feature vector corresponding to the coordinate. More specifically, point cloud segmentation networks operate on two broad classes of tensors: point cloud tensor T ═ R_P,F_PR and sparse voxel tensor S ═ R_V,F_V}. T and S are transformed into each other to align features between points and voxels.

1)F_P→ V: given a point cloud feature T, pass F_PThe → V operator converts it into a sparse voxel tensor S.

{R_V,F_V}＝F_P→V({R_P,F_P},Φ)；

Where s denotes the voxel size and Φ by default denotes the operator taking the maximum value. F_P→ V actually shows simultaneous voxelization of coordinates and features.

2)F_V→ P: to derive a point tensor T, F from a sparse voxel tensor S_VThe → P operator specifies each point feature of the voxel at which it is located, which is represented as:

{R_P,F_P}＝F_V→P({R_V,F_V})；

3)F_V→ V: above F_P→ V and F_VThe → P operator only considers the transformation between points and voxels, and cannot deal with sparse voxel tensor alignment or supervision, etc. Two tensors S and S', F, given possible mismatch of coordinates_V→ V according to the corresponding coordinatesTheir characteristics are matched in a greedy manner, e.g., a hash table is first constructed to encode the coordinates of all activated voxels. Then, the target sparse element coordinates are used as a key.

The working principles of the sparse voxel characteristic coding network, the three-dimensional sparse network and the point voxel network are respectively introduced below.

The sparse voxel characteristic coding network distributes each data point of the training point cloud to voxels which are evenly distributed in space, and simultaneously extracts point-by-point characteristics and sparse voxel characteristics. For point cloud { R_PF, from F ← { F, c ← B }_m,v_mDenotes, where the original feature F is centered at the center of mass c of the voxel_mAnd the voxel center point coordinate v_mAre connected together. After several linear layers, F_p→ V and F_VThe → P operators are jointly used to extract the output of each layer, represented by the following formula:

F＝MLP(F),F_V＝F_p→V(F,Φ)；

wherein the content of the first and second substances,

indicating a characteristic concat operation. In the sparse voxel characteristic coding network, the point-by-point characteristic comprises the geometrical context of the voxel, and simultaneously, the sparse voxel characteristic F_VIs fed into the next three-dimensional sparse network. Φ represents the averaging operator.

Two kinds of sparse convolution (SC and SSC) are comprehensively used in the three-dimensional sparse network. SSC keeps characteristic sparsity in calculation and is widely used in the network; at the same time SC produces a dilution of the active area, which is only used in the special graph network head to spread out features, covering the true instance centroid, which may be otherwise pointless. This comprehensive application is well suited for sparse point clouds that are only distributed over the surface of an object.

The three-dimensional sparse network includes four network blocks. The basic block SUBM is defined as a basic unitIncluding two layers of SSCs with convolution kernel size of 3 and one layer of SCs with convolution kernel size of 1. The former is used for feature extraction and the latter is used for short-circuiting the input and output of the cell. The network blocks 1 to 4 contain 2, 2, 3, 4 basic block units, respectively. In addition, the first two network blocks employ sub-beamforming sparse max pooling to expand the voxel receptive field. Input sparse feature F_VThe output characteristics of each network block are noted

Where i equals 1 to 4.

A point voxel network. And jointly encoding the multi-level sparse features and the point-by-point features in the point voxel network. Such joint coding is a very efficient aggregation of features. However, only the non-empty voxels corresponding to the key points of the neighborhood are indexed in the related art, and the extraction of the present application is through F_VThe → P operator covers the entire point cloud, expressed by the following formula:

in this way, the sparse voxel characteristics in the last three network blocks and the data point characteristics output by the sparse voxel characteristic coding network are aggregated, so that the output P of the point branch integrates the shallow geometrical information and the deep context information.

Further, when the trained point cloud segmentation model is used for panoramic instance reasoning, the reasoning of the object centroid is as follows:

at inference time, to further obtain centroid prediction C_P∈R^K×3The activated voxels in H are first sparsely maximally pooled, and then the feature-invariant voxel coordinates before and after pooling are retained. Expressed using the following formula:

here, SMP denotes a 3D sparse maximum pooling layer with a core size Γ. Since many unclean predictions are involved, the present application sets a threshold T to filter out predictions of low or medium confidence. The K most confident centroids are then taken as the final centroid prediction.

Category independent instance label assignment. By predicted K centroids C_PAnd a point-by-point offset O, each offset data point assigned to its nearest centroid prediction by:

wherein R is_I＝R_P·M_ICoordinates representing predicted object points, and I_LE { 0., K-1} represents the predicted instance ID. Some predicted centroids cannot be assigned to any point, since K should be set to a value greater than the maximum number of objects in a single scene, and thus these centroids are deleted in the inference. Further, the instance ID of the point of the amorphous surface class is set to 0.

And the final panoramic segmentation result is obtained by fusing the instance segmentation result and the point-by-point semantic result which are irrelevant to the category. The application adopts a parallelizable fusion strategy: for each centroid C ∈ C_PIts semantic tag s is obtained by_c: semantic prediction of a set of points assigned to the centroid s_PThe line votes and the category with the most votes is set as the semantic label for the centroid. The set of points is then labeled s_PIs modified to s_cThis operation allows semantic prediction and instance prediction to improve upon each other.

Referring to fig. 13, fig. 13 is a schematic flow chart of a point cloud data segmentation method according to a first embodiment of the present disclosure. The method comprises the following steps:

step 131: and acquiring point cloud data.

Step 132: and inputting the point cloud data into the point cloud segmentation model, and outputting the segmented point cloud data.

The point cloud segmentation model is obtained by training by using the method provided by the technical scheme.

Thus, the present embodiment can be applied to a scene in which the robot performs panoramic division in the autonomous vehicle.

Referring to fig. 14, fig. 14 is a schematic structural diagram of an embodiment of a device for processing point cloud data provided by the present application. The processing device 140 comprises a processor 141 and a memory 142 coupled to the processor 141, the memory 142 being configured to store computer programs, the processor 141 being configured to execute the computer programs to implement the following method:

acquiring training point cloud; the method comprises the following steps that data points in a training point cloud are classified according to voxels to obtain corresponding real voxels, and each real voxel is marked with real information; inputting the training point cloud into a point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel; determining at least one voxel pair, each voxel pair comprising a corresponding predicted voxel and a real voxel; adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair;

or acquiring point cloud data; and inputting the point cloud data into the point cloud segmentation model, and outputting the segmented point cloud data.

The point cloud segmentation model is obtained by training by using the training method of the point cloud segmentation model in any embodiment.

It can be understood that the processor 141 is further configured to execute a computer program to implement the technical solution of any of the above embodiments, which is not described herein again.

Referring to fig. 15, fig. 15 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application. The computer-readable storage medium 150 is for storing a computer program 151, the computer program 151, when executed by a processor, being for implementing the method of:

acquiring training point cloud; the method comprises the following steps that data points in training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel; inputting the training point cloud into a point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel; determining at least one voxel pair, wherein each voxel pair comprises a corresponding prediction voxel and a corresponding real voxel; adjusting network parameters of the point cloud segmentation model according to the difference between the real information and the detection information of the voxel pair;

It can be understood that, when being executed by the processor, the computer program 151 is further configured to implement the technical solution of any of the above embodiments, which is not described herein again.

In conclusion, the point cloud segmentation model provided by the application solves the problem that the sparse point cloud is only distributed on the surface of the object by directly regressing the 3D sparse coding voxel centroid heat map and shifting the centroids point by point. Two types of sparse convolution are comprehensively used, so that accurate centroid regression can be obtained, and the calculation efficiency is very high. In addition, sparse supervision of multiple levels of voxel features allows for better feature learning. To further enable our efficient hybrid sparse surveillance to be used on hybrid feature representations, we propose three sparse alignment operation operators, including inter-conversion between point-wise features and voxel features and alignment-mismatched voxel features.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, and for example, the division of the circuits or units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made according to the content of the present specification and the accompanying drawings, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for training a point cloud segmentation model, the method comprising:

acquiring training point cloud; the data points in the training point cloud are classified according to voxels to obtain corresponding real voxels, and real information is marked on each real voxel;

inputting the training point cloud into a point cloud segmentation model to obtain a predicted voxel output by the point cloud segmentation model and detection information corresponding to the predicted voxel;

determining at least one voxel pair, each voxel pair comprising a corresponding one of the predicted voxels and the real voxel;

adjusting network parameters of the point cloud segmentation model according to a difference between the real information and the detected information of the voxel pair.

2. The method of claim 1, wherein determining at least one voxel pair comprises:

obtaining a first coordinate of each of the real voxels and a second coordinate of each of the predicted voxels;

determining a target first coordinate and a target second coordinate having the same coordinate;

and determining the real voxel of the target first coordinate and the predicted voxel corresponding to the target second coordinate as the voxel pair.

3. The method of claim 2, wherein said obtaining a first coordinate of each of said real voxels and a second coordinate of each of said predicted voxels comprises:

performing hash coding on the first coordinates of all the real voxels to obtain a corresponding first hash table;

performing hash coding on the second coordinates of all the predicted voxels to obtain a corresponding second hash table;

the determining a target first coordinate and a target second coordinate having the same coordinate includes:

and traversing by utilizing the second hash table of each predicted voxel and all the first hash tables, and determining a target first coordinate and a target second coordinate with the same coordinate.

4. The method of claim 1, wherein the point cloud segmentation model comprises: the system comprises a sparse voxel characteristic coding network, a three-dimensional sparse network, a point voxel network, a first supervision network, a second supervision network, a third supervision network and a fourth supervision network;

inputting the training point cloud into a point cloud segmentation model to obtain a prediction voxel output by the point cloud segmentation model and detection information corresponding to the prediction voxel, wherein the method comprises the following steps:

inputting the training point cloud into a sparse voxel characteristic coding network for characteristic extraction to obtain voxel characteristics;

inputting the voxel characteristics into a three-dimensional sparse network for sparse voxel characteristic extraction to obtain sparse voxel characteristics;

inputting the sparse voxel characteristics into a point voxel network for characteristic conversion to obtain data point characteristics;

inputting the sparse voxel characteristics to the first supervision network to perform semantic segmentation learning on the three-dimensional voxels, so as to obtain first detection information corresponding to the predicted voxels;

inputting the sparse voxel characteristics into the second supervision network to carry out three-dimensional thermodynamic diagram learning, so as to obtain second detection information corresponding to the prediction voxel;

inputting the data point characteristics to the third supervision network for semantic segmentation learning at a point level to obtain third detection information corresponding to each data point;

and inputting the data point characteristics to the fourth monitoring network for point-level offset monitoring learning to obtain the predicted offset corresponding to each data point.

5. The method of claim 4, wherein the real information comprises first real information, second real information, third real information, and fourth real information; the first real information characterizes semantic information of the real voxels; the second real information characterizes object centroid information of the real voxels; the third real information represents semantic information of the data point; the fourth truth information characterizes offset information for the data point.

6. The method of claim 5, wherein the adjusting network parameters of the point cloud segmentation model according to differences between the real information and the detected information of the voxel pair comprises:

determining a difference between the first detection information and the first real information to obtain a first loss value;

determining a difference between the second detection information and the second real information to obtain a second loss value;

determining a difference between the third detection information and the third real information to obtain a third loss value;

determining a difference between the predicted offset and the fourth real information to obtain a fourth loss value;

and adjusting the network parameters of the point cloud segmentation model by using the first loss value, the second loss value, the third loss value and the fourth loss value.

7. The method of claim 6, wherein determining the difference between the first detected information and the first actual information to obtain a first loss value comprises:

determining a first sub-loss value between the first detected information and the real information using a Lovasz-Softmax loss function;

determining a second sub-loss value between the first detection information and the real information by using a cross entropy loss function;

and summing the first sub-loss value and the second sub-loss value to obtain the third loss value.

8. The method of claim 6, wherein determining the difference between the second detected information and the second actual information before obtaining the second loss value comprises:

inputting the sparse voxel characteristics into the second supervision network to carry out three-dimensional thermodynamic diagram learning, and obtaining the probability that each sparse voxel characteristic belongs to the object centroid; taking the probability as the second detection information;

and determining the second real information by using the center of mass of each object and the sparse voxel characteristics within a preset distance.

9. The method of claim 8, wherein determining the difference between the second detected information and the second actual information to obtain a second loss value comprises:

and determining the difference between the second detection information and the second real information by utilizing a Focal loss function to obtain the second loss value.

10. The method of claim 6, wherein determining the difference between the third detected information and the third real information to obtain a third loss value comprises:

determining a third sub-loss value between the first detected information and the third real information using a Lovasz-Softmax loss function;

determining a fourth sub-loss value between the first detection information and the third real information by using a cross entropy loss function;

and summing the third sub-loss value and the fourth sub-loss value to obtain the third loss value.

11. The method of claim 6, wherein determining the difference between the predicted offset and the fourth real information to obtain a fourth loss value comprises:

determining a centroid of a real object by using the fourth real information of the data point;

determining a real offset of the data point from the center of mass of the real object;

and obtaining the fourth loss value by using the predicted offset and the real offset.

12. A point cloud data segmentation method, the method comprising:

acquiring point cloud data;

inputting the point cloud data into a point cloud segmentation model, and outputting segmented point cloud data, wherein the point cloud segmentation model is obtained by training according to the method of any one of claims 1-11.

13. A device for processing point cloud data, comprising a processor and a memory coupled to the processor, the memory being configured to store a computer program, the processor being configured to execute the computer program to implement the method according to any one of claims 1 to 12.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program which, when being executed by a processor, is used to carry out the method according to any one of claims 1-12.