CN114882224B

CN114882224B - Model structure, model training method, singulation method, device and medium

Info

Publication number: CN114882224B
Application number: CN202210629730.2A
Authority: CN
Inventors: 谭可成; 刘昊; 何维; 刘承照; 许强红
Original assignee: PowerChina Zhongnan Engineering Corp Ltd
Current assignee: PowerChina Zhongnan Engineering Corp Ltd
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2024-04-05
Anticipated expiration: 2042-06-06
Also published as: CN114882224A

Abstract

The invention discloses a model structure, a model training method, a singulation method, equipment and a medium, wherein the model training method comprises the steps of obtaining original three-dimensional point cloud data of ground objects of a large scene; manufacturing the original three-dimensional point cloud data into a standard sample format file; preprocessing the point cloud sample in the standard sample format file to generate a PKL format sample file; constructing a large-scene ground object monomer model, wherein the large-scene ground object monomer model comprises a coding module, a backbone network, a target generation module, a feature fusion module, a Point-RoIAlign module and an instance prediction network; and training the large-scene ground feature monomerization model by using the point cloud sample in the PKL format sample file to obtain a trained large-scene ground feature monomerization model. According to the invention, the prediction of a single ground object is realized by minimizing the matching cost function, and the final ground object segmentation is realized by the point mask prediction, so that the defects of the traditional processing means such as clustering and the like are effectively eliminated.

Description

Model structure, model training method, singulation method, device and medium

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a model structure, a large-scene ground feature monomer model training method, a large-scene ground feature monomer method, electronic equipment and a computer readable storage medium.

Background

The three-dimensional modeling of oblique photography has become an important means for large-scale large-scene three-dimensional reconstruction due to the advantages of high efficiency, high sense of reality and low production cost, but the oblique photography three-dimensional model cannot be used for independently selecting single ground objects due to the limitation of a data structure, so that the value and the practicability of model data are reduced. Taking the land shifting as an example, statistics of areas such as houses, farmlands, forest lands and the like are mainly carried out actual measurement by a large number of field investigation personnel through field visit and measurement or manual sketching is carried out through satellite images, and the statistics is extremely difficult. Therefore, the oblique photography singulation technique is a bottleneck that needs to be broken through.

The ground object identification mode based on the remote sensing image has the problems that the house can only extract the roof area of the eave, the cement roof and the cement ground are difficult to distinguish, the roof is covered by the tree, and the like; and the two-dimensional image only contains RGB color information, and can not be linked with the three-dimensional model during use.

Compared with a two-dimensional image, the three-dimensional point cloud has richer spatial structure information, and is more advantageous for acquiring local detail features in the oblique photography singulation process. With the application of deep learning in the field of three-dimensional point cloud, the monomerization based on the point cloud data becomes a new solution idea.

In patent literature with application publication number CN113822914a, named oblique photogrammetry model singulation method, computer device, product, and medium, three-dimensional point cloud large scene ground feature singulation is realized by clustering, but in fact, it is very difficult to directly cluster a point cloud into multiple instance objects, which is caused by the following reasons:

(1) A point cloud typically contains a large number of points, resulting in extremely slow clustering efficiency;

(2) The number of instances in different 3D scenes typically varies greatly, and the clustering algorithm cannot adaptively adjust parameters;

(3) The scale difference of the examples is obvious, some of the similar ground features are very small in size, some of the similar ground features are huge in volume, and the examples are difficult to extract in the integrity of the clustering algorithm;

(4) Each point has only one very weak feature, namely 3D coordinates and color; resulting in a huge semantic gap between point and instance definitions.

Therefore, the singulation method is generally easy to over-cut or under-cut for large-scene ground feature segmentation, and the technical route is too ideal to realize application.

Disclosure of Invention

The invention aims to provide a model structure, a model training method, a monomerization method, equipment and a medium, so as to solve the problem that small target ground objects are difficult to be monomerized and segmented under a large three-dimensional point cloud scene, and the problem that the ground object monomerization efficiency is low and the accuracy is poor under the large scene by a clustering algorithm.

The invention solves the technical problems by the following technical scheme: a structure of a model, comprising:

the encoding module is used for encoding the large scene ground feature point cloud in the PKL format into an input vector;

the backbone network is used for extracting the characteristics of the input vector to obtain a first characteristic vector;

the target generation module is used for carrying out feature extraction on the first feature vector to obtain a global feature vector, and carrying out feature extraction on the global feature vector to obtain a second feature vector; calculating the second feature vector to obtain a third feature vector, and carrying out normalization processing on each element in the third feature vector to obtain the confidence score of each candidate frame; calculating the second feature vector to obtain a fifth feature vector, wherein each (1, 6) dimension of the fifth feature vector represents a maximum coordinate point and a minimum coordinate point of a candidate frame; splicing the maximum coordinate point, the minimum coordinate point and the corresponding confidence score of the candidate frame to obtain a parameter vector of the candidate frame;

the feature fusion module is used for extracting features of the first feature vector to obtain a sixth feature vector; splicing the sixth feature vector and the global feature vector, and extracting features to obtain an eighth feature vector;

the Point-RoIAlign module is used for carrying out coordinate mapping processing on the parameter vector and the eighth feature vector of the candidate frame to obtain a Point cloud set corresponding to each candidate frame;

and the example prediction network is used for outputting a prediction Point cloud set of a single ground feature according to the Point cloud set of each candidate frame output by the Point-RoIAlign module.

Further, the backbone network adopts a RandLA-Net structure.

Further, the target generation module comprises a first feature extraction layer, a second feature extraction layer, a prediction branch, a regression branch and a splicing layer;

the first feature extraction layer comprises 1 MLP layer, and the first feature extraction layer performs feature extraction on the first feature vector by using the 1 MLP layer to obtain a global feature vector;

the second feature extraction layer comprises 2 MLP layers, and the second feature extraction layer performs feature extraction on the global feature vector by using the 2 MLP layers to obtain a second feature vector;

the prediction branch comprises a first full-connection layer and a first activation layer, the second feature vector is calculated through the first full-connection layer to obtain a third feature vector, and each element in the third feature vector is normalized through the first activation layer to obtain the confidence score of each candidate frame;

the regression branch comprises a second full-connection layer, and the second feature vector is calculated through the second full-connection layer to obtain a fifth feature vector;

and the splicing layer is used for splicing the maximum coordinate point, the minimum coordinate point and the corresponding confidence score of the candidate frame to obtain the parameter vector of the candidate frame.

Further, the feature fusion module comprises a third feature extraction layer, a splicing layer and a fourth feature extraction layer;

the third feature extraction layer comprises 2 MLP layers, the third feature extraction layer performs feature extraction on the first feature vector by using 1 MLP layer to obtain a point feature vector, and then performs feature extraction on the point feature vector by using another 1 MLP layer to obtain a sixth feature vector;

the splicing layer is used for splicing the sixth feature vector and the global feature vector to obtain a seventh feature vector;

the fourth feature extraction layer comprises 2 MLP layers, and the fourth feature extraction layer performs depth feature extraction on the seventh feature vector by using the 2 MLP layers to obtain an eighth feature vector.

Further, the instance prediction network comprises a fifth feature extraction layer, a Mask prediction branch and an instance output layer;

the fifth feature extraction layer adopts a PointNet network structure, and performs feature extraction on the Point cloud set of the candidate frame output by the Point-RoIALign module by using the PointNet network structure to obtain a ninth feature vector;

the Mask prediction branch comprises an MLPs layer and a second activation layer, and the ninth feature vector is calculated through the MLPs layer and the second activation layer to obtain a prediction Mask of the ground object;

the example output layer is configured to reject noise points in the ninth feature vector by using a prediction mask to obtain a tenth feature vector; and calculating the tenth feature vector through the MLPs layer and the third activation layer to obtain the confidence score of each ground feature, selecting the category with the highest confidence score as the prediction category of the ground feature, and outputting the prediction point cloud set of the ground features of different categories.

The invention also provides a large-scene ground feature monomer model training method, which comprises the following steps:

acquiring original three-dimensional point cloud data of a large-scene ground object;

manufacturing the original three-dimensional point cloud data into a standard sample format file;

preprocessing the point cloud sample in the standard sample format file to generate a PKL format sample file;

constructing a large scene ground feature monomalization model, wherein the large scene ground feature monomalization model comprises:

an example prediction network, configured to output a prediction Point cloud set of a single ground feature according to a Point cloud set of each candidate frame output by the Point-RoIAlign module;

and training the large-scene ground feature monomerization model by using the point cloud sample in the PKL format sample file to obtain a trained large-scene ground feature monomerization model.

Further, the specific implementation process of manufacturing the original three-dimensional point cloud data into the standard sample format file comprises the following steps:

importing the original three-dimensional point cloud data into CloudCompare software, and manually dividing each real ground object by utilizing a cutting function of the CloudCompare software;

labeling a classification label mask for each real ground object, merging all real ground objects with the classification label mask, and deriving a txt format point cloud file;

and converting the txt format point cloud file into a Senmantic3d data set format to obtain the standard sample format file.

Further, the specific implementation process of preprocessing the point cloud sample of the standard sample format file is as follows:

performing grid sampling on the point cloud samples in the standard sample format file;

and carrying out normalization processing on the sampled sample data, and establishing a data index structure on the sample data subjected to normalization processing by using a Kd tree algorithm to generate a PKL format sample file.

Further, the specific implementation process for training the large-scene ground feature monomerization model comprises the following steps:

constructing an objective function, and solving an optimal matching index matrix, wherein the specific expression of the objective function is as follows:

wherein A is an optimal allocation index matrix, H is the number of candidate frames, T is the number of boundary frames of a real ground object, A _ij For the matching coefficient of the ith candidate frame and the jth boundary frame, when A _ij When=1, it means that the ith candidate box is associated with the jth bounding box, when a _ij When=0, the i candidate box is not associated with the j boundary box, C _ij To assign the ith candidate to the associated cost of the jth bounding box;

searching corresponding candidate frames for each boundary frame according to the optimal matching index matrix to obtain T candidate frames matched with each boundary frame;

parameter optimization is carried out on the T candidate frames through a loss function, so that the coordinate value of each candidate frame approximates to the coordinate value of the boundary frame matched with the candidate frame, and the loss function expression is as follows:

wherein C is _tt To assign a t candidate to an associated cost of the t bounding box;

optimizing the confidence scores of the T candidate frames to enable the confidence scores of the T candidate frames to approach 1, and setting the confidence scores of the rest H-T candidate frames to be 0, wherein the confidence score optimizing function has the following expression:

wherein,a confidence score assigned to the first candidate box;

training the predicted mask according to the predicted mask and the classification label mask obtained by the example prediction network calculation to obtain a trained mask; the predictive mask training loss function expression is as follows:

wherein N is _ins For the number of ground object examples, N _i Points, iou, for the ith feature instance _i Intersection as the ith ground object exampleThe mixing ratio, L _mask Loss value of mask, y _j The label of the point in the ground object example is that the positive label is 1, the negative label is 0,the probability of the point prediction being a positive label for the ground object example; sign () is a sign function, when iou _i At > 0.5 sign (iou _i > 0.5) =1; when iou _i At a value of less than or equal to 0.5, sign (iou) _i ＞0.5)＝0；

And removing noise points by using the trained mask, calculating the confidence score of the ground object, selecting the category with the highest confidence score as the prediction category of the ground object, and outputting the prediction point cloud set of the ground objects of different categories.

The invention also provides a large-scene ground feature monomerization method, which comprises the following steps:

acquiring original three-dimensional point cloud data of a ground object of a target scene;

converting and preprocessing the original three-dimensional point cloud data to generate a PKL format file;

classifying and predicting the point clouds in the PKL format file by using the large-scene ground feature monomerization model trained by the large-scene ground feature monomerization model training method to obtain a classification label of each point cloud;

and outputting a point cloud set of a single ground object according to the classification label of each point cloud, so as to realize the ground object individualization.

Further, the specific implementation process of converting and preprocessing the original three-dimensional point cloud data is as follows:

converting the original three-dimensional point cloud data into a Senmantic3d data set format;

grid sampling and normalization processing are carried out on three-dimensional point cloud data in a Senmanic 3d data set format, an index structure is built on the data after normalization processing by using a Kd tree algorithm, and PKL format files are generated.

The invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the large-scene ground feature monomalization model training method when running the computer program.

The present invention also provides a computer readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, having stored thereon a computer program which, when executed by a processor, performs the steps of the large-scene ground feature monomers model training method as described above.

Advantageous effects

Compared with the prior art, the invention has the advantages that:

according to the model structure, the model training method, the singulation method, the device and the medium, the single ground feature is predicted by minimizing the matching cost function, and the final ground feature segmentation is realized by the point mask prediction, so that defects of traditional processing means such as clustering are effectively eliminated, and compared with the traditional means, the method has higher precision and efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawing in the description below is only one embodiment of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a large scene ground feature monomer model training method in an embodiment of the invention;

FIG. 2 is a network structure diagram of a large scene ground feature monomer model in an embodiment of the invention;

FIG. 3 is a network configuration diagram of a target generation module in an embodiment of the invention;

FIG. 4 is a diagram of an example predictive network architecture in an embodiment of the invention;

FIG. 5 is original three-dimensional point cloud data for scene one in an embodiment of the invention;

FIG. 6 is original three-dimensional point cloud data for scene two in an embodiment of the invention;

FIG. 7 is a diagram showing the recognition result of scene one by the method according to the embodiment of the present invention;

FIG. 8 is a diagram showing the recognition result of scene two by the method according to the embodiment of the present invention;

FIG. 9 is an enlarged view of the recognition result of scene two in the embodiment of the present invention;

fig. 10 is an enlarged view of the connected object recognition result of scene two in the embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made more apparent and fully by reference to the accompanying drawings, in which it is shown, however, only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the present application is described in detail below with specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

As shown in fig. 1, the method for training the large-scene ground feature monomerized model provided by the embodiment of the invention comprises the following steps:

s1, acquiring original three-dimensional point cloud data of the ground object of the large scene.

S2, manufacturing the original three-dimensional point cloud data obtained in the step S1 into a standard sample format file.

The original three-dimensional point cloud data is in PLY format, each point comprises (x, y, z, r, g, b) six-dimensional information, x, y, z represents three-dimensional coordinates of the point, and r, g, b represents RGB information of the point. In order to train a model by utilizing the original three-dimensional point cloud data, a standard sample (namely, a sample of the ground object monomers) is firstly manufactured according to the original three-dimensional point cloud data, and the specific implementation process is as follows:

s11, importing original three-dimensional point cloud data into CloudCompare software, manually dividing each real ground object by utilizing a cutting function of the CloudCompare software, namely, drawing a boundary frame by utilizing the cutting function, and dividing each real ground object;

s12, labeling a classification label mask for each real ground object, merging all real ground objects with the classification label mask, and deriving a txt format point cloud file;

s13, extracting the front 7 columns of data of the txt format point cloud file, storing 1-6 columns of data in the data into a txt format point cloud data file, storing the 7 th column of data into a txt format tag file, namely converting the txt format point cloud file into a Senmanic 3d data set format, and obtaining a standard sample format file. Each row of the txt format point cloud file represents a point, each row has N columns, the first 7 columns of data respectively represent x, y, z, r, g, b, label, and label represents class labels, and the labels are represented by numbers 1 to N.

S3, preprocessing the point cloud sample in the standard sample format file in the step S2 to generate a PKL format sample file in order to adapt to the input data format requirement of the model, wherein the specific implementation process is as follows:

s31, performing grid sampling on point cloud samples in a standard sample format file; in this embodiment, the sampling rate is set to 0.06;

s31, carrying out normalization processing on the sampled sample data, and establishing an index structure on the sample data after normalization processing by using a Kd tree algorithm to generate a PKL format sample file.

In this embodiment, the sample data after normalization is processed by using the Kd-Tree algorithm, and a PKL format sample file is generated as in the prior art, which can be seen in Hu, qingyong, et al, "Randla-net: efficient semantic segmentation of large-scale point groups," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reconnaistion.2020.

S4, constructing a large-scene ground feature monomerization model

As shown in fig. 2, the structure of the large-scene ground feature monomerization model includes an encoding module, a backbone network, a target generation module, a feature fusion module, a Point-RoIAlign module, and an instance prediction network.

The encoding module encodes the large-scene ground feature point cloud samples in the PKL format into input vectors (N, d), wherein N is the number of point clouds, and d is the characteristic dimension of each point cloud; in this embodiment, d is 6, i.e., (x, y, z, r, g, b) for each point cloud.

The backbone network adopts a RandLA-Net structure (the RandLA-Net network structure can be seen in Hu, qingyong, et al, "RandLANet: efficient semantic segmentation of large-scale point groups," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recovery.2020.), so as to extract the point cloud characteristics, and the random sampling strategy and the characteristic aggregation module of the RandLA-Net structure are effectively applicable to the characteristic extraction of large-scale point cloud data. The backbone network adopts a RandLA-Net structure to perform feature extraction on the input vector (N, 6) to obtain a first feature vector (N/256, 512).

As shown in fig. 3, the target generating module includes 3 MLP layers, a prediction branch, a regression branch and a splicing layer, and the target generating module performs feature extraction on the first feature vector (N/256, 512) by using 1 MLP layer to obtain a global feature vector (1, k), wherein the global feature vector is a one-dimensional vector of 1×k, k is a feature dimension, and the value of k depends on the structure of the MLP layer; extracting features of the global feature vector (1, k) by using 2 MLP layers to obtain a second feature vector (1, 256), respectively inputting the second feature vector (1, 256) into a prediction branch and a regression branch, and predicting confidence scores of all candidate frames through the prediction branchNamely confidence score of single prediction feature +.>And determining the range of the corresponding candidate frame by the maximum coordinate point and the minimum coordinate point of the regression branch candidate frame. The prediction branch comprises a full connection layer and an activation layer, and is communicated withCalculating the second feature vector (1, 256) through the full connection layer fc to obtain a third feature vector (1, H), wherein H is the number of candidate frames (or the number of predicted features), and normalizing each element in the third feature vector (1, H) to [0,1 ] through the activation layer sigmoid]Interval, get confidence score of each candidate frame +.>The regression branch comprises a full connection layer, and the second feature vector (1, 256) is calculated through the full connection layer fc to obtain a fifth feature vector (1,6H), wherein each (1, 6) dimension of the fifth feature vector (1,6H) represents a maximum coordinate point and a minimum coordinate point of the candidate frame. The splicing layer splices the maximum coordinate point, the minimum coordinate point and the corresponding confidence score of each candidate frame to obtain a parameter vector +_of the candidate frame>Wherein (1)>Coordinates of the maximum coordinate point of the candidate frame, +.>Is the coordinates of the smallest coordinate point of the candidate frame.

The feature fusion module comprises 5 MLP layers and a splicing layer, wherein the feature extraction is firstly carried out on a first feature vector (N/256, 512) by using 1 MLP layer to obtain a point feature vector (N/256, k), and then the feature extraction is carried out on the point feature vector (N/256, k) by using 1 MLP layer to obtain a sixth feature vector (N/256,256); and splicing the sixth feature vector (N/256,256) and the global feature vector (1, k) (k=256 in the embodiment) by using a splicing layer to obtain a seventh feature vector (N/256, 512), and performing deep feature extraction on the seventh feature vector (N/256, 512) through two MLPs to obtain an eighth feature vector (N/256,128).

The Point-RoIAlign module pairs parameter vectors of candidate boxesAnd the eighth feature vector (N/256,128) is subjected to coordinate mapping processing to obtain a point cloud set of each candidate frame, namely, a point cloud set of each predicted ground feature.

In this example, the coordinate mapping process is prior art, see Li Yi, "GSPN: generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud",2019IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp.3942-3951.

As shown in fig. 4, the example prediction network fifth feature extraction layer, mask prediction branches, and example output layer. The fifth feature extraction layer adopts a PointNet network structure (the network structure can be seen in Qi C R, su H, mo K, et al PointNet: deep learning on point sets for 3d classification and segmentation[C)]The method comprises the steps of (1) performing feature extraction on a Point cloud set (N, 6) of each candidate frame output by a Point-RoIAlign module by utilizing a PointNet network structure to obtain a ninth feature vector (N, 256); the Mask prediction branch comprises an MLPs layer and an activation layer sigmoid, and a ninth feature vector (N, 256) is calculated through the MLPs layer and a second activation layer to obtain a prediction Mask of the ground object; an example output layer performs Hadamard product operation by using the prediction mask and a ninth feature vector (N, 256) to remove noise points to obtain a tenth feature vector (N) ₁ 256), finally, the tenth feature vector (N) is sigmoid through the MLPs layer and the activation layer ₁ 256) calculating to obtain the confidence coefficient score of each ground feature, selecting the category with the highest confidence coefficient score as the prediction category of the ground feature, and outputting the prediction point cloud set of the ground feature of different categories.

Each MLP layer comprises a plurality of fully connected layers fc and an activation function LRelu, and each MLPs layer comprises a plurality of fully connected layers fc.

And S5, training the large-scene ground feature monomerization model by using the point cloud sample in the PKL format sample file to obtain a trained large-scene ground feature monomerization model.

Training the large-scene ground feature monomerized model comprises optimizing and adjusting candidate frames, optimizing an instance mask and optimizing instance category confidence coefficient, so that each candidate frame predicted by the model is matched with each candidate frame predicted by the modelThe bounding boxes of the real ground objects are correspondingly associated, namely, the training process is converted into the optimal matching problem. Let a be a binary matching index matrix, a= { a _ij I=1, 2,3, …, H; j=1, 2,3, …, T }, H is the number of candidate boxes, T is the number of bounding boxes of the real ground object, a if and only if the ith candidate box is assigned to the jth bounding box (i.e., the ith candidate box is associated with or matches the jth bounding box) _ij =1, otherwise a _ij =0 (the i-th candidate box is not associated or matched with the j-th bounding box). Let C be a binary associated cost matrix, c= { C _ij I=1, 2,3, …, H; j=1, 2,3, …, T }, each element C in C _ij With each element A in A _ij One-to-one correspondence, C _ij To assign the ith candidate to the associated cost of the jth bounding box. The closer the i candidate box is to the j boundary box, the associated cost C _ij The smaller the representation, the associated cost is calculated by:

wherein,representing a maximum coordinate point of the candidate frame; />Representing a minimum coordinate point of the candidate frame; />Representing a maximum coordinate point of the bounding box;representing the minimum coordinate point of the bounding box.

The problem of optimal matching between the candidate frame and the boundary frame is converted into the problem of searching an optimal matching index matrix A with the minimum total association cost, so that the constructed objective function is as follows:

wherein,indicating that each bounding box must have a candidate box associated or matched with it, +.>Indicating that there are bounding boxes in all candidate boxes that are not associated with them.

Searching corresponding candidate frames for each boundary frame according to the obtained optimal matching index matrix to obtain T candidate frames matched with each boundary frame; the number of the found candidate frames is the same as that of the boundary frames, and the number of the found candidate frames is T, and parameter tuning is carried out on the T found candidate frames through a loss function in the formula (4):

wherein C is _tt Represented as an associated cost for assigning the t candidate to the t bounding box. By minimizing l _box (loss value) such that the coordinate value of each candidate frame approximates to the coordinate value of the bounding box to which it matches.

Confidence score B obtained by predicting branches during model training _s i is allocated to the H candidate frames of the regression branch prediction one by one, confidence scores of T candidate frames associated with the T boundary frames in the H candidate frames are optimized by using a formula (5), the confidence scores of the T candidate frames approach 1, the confidence scores of the rest H-T candidate frames are set to 0, and the T candidate frames with high confidence scores are reserved and used as subsequent inputs of a Point-RoIAlign module.

Wherein,confidence score for the t-th candidate box by minimizing l _scores The value of (loss value) approximates the confidence score of T candidate boxes to 1.

The use of candidate boxes to intercept the corresponding point cloud set may incorrectly incorporate portions belonging to other feature instances, resulting in inaccurate instance predictions. Therefore, an example prediction network for further refining the example is proposed, a point cloud set (N, 6) intercepted by a candidate box is taken as an input of the example prediction network, and semantic features are extracted by using a PointNet++ network to obtain a ninth feature vector (N, 256); and a ninth feature vector (N, 256) is used as input of a mask branch, a predictive mask is obtained through the MLPs layer and the active layer sigmoid, and the predictive mask is a binary vector. Training of the predictive mask by calculating IoU values (i.e., the ratio of the intersection of the predictive mask and the label mask to the union of the predictive mask and the label mask) between the predictive mask and the class label mask as constraints, predictive masks with IoU values higher than 0.5 are used as training samples, overlapping portions between the predictive mask and the label mask are assigned positive labels, and other portions are assigned negative labels. While IoU below 0.5 is ignored and does not participate in the predictive mask training process. The predictive mask training loss function is as follows:

wherein N is _ins For the number of ground object examples, N _i Points, iou, for the ith feature instance _i Is the intersection ratio of the ith ground object example, L _mask Loss value of mask, y _j The label of the point in the ground object example is that the positive label is 1, the negative label is 0,presumably, the point is predicted to be a positive label for the ground object instanceA rate; sign () is a sign function, when iou _i At > 0.5 sign (iou _i > 0.5) =1; when iou _i At a value of less than or equal to 0.5, sign (iou) _i ＞0.5)＝0。

Performing Hadamard product operation according to the prediction mask and the ninth feature vector (N, 256) to remove noise points to obtain a tenth feature vector (N ₁ 256) (i.e., example feature vector), tenth feature vector (N) ₁ 256) obtaining confidence scores of the ground object examples through the MLPs layer and the activating layer sigmoid, taking IoU values between the predictive mask and the label mask as measurement of the quality of the predictive mask, and improving the accuracy of the confidence scores of the ground object examples by using the predictive mask.

Removing noise points according to the prediction mask, and performing category prediction training on the residual characteristics through the following formula:

wherein L is _cls Confidence score for ground object example, N _ins Representing the number of ground object examples, iou _i Is the intersection ratio of the ith ground object example,the prediction probability of belonging to the category C for the ith ground object example is given, and M is the number of the categories; y is _iC For the sign function (0 or 1), if the true category of the ith ground object instance is equal to C, 1 is taken, otherwise 0 is taken.

And sorting the prediction categories of the ground object examples according to the confidence scores, and selecting the category with the highest confidence score as the prediction category of the ground object examples, namely outputting the point cloud set of the target ground object examples.

The embodiment of the invention also provides a large-scene ground feature monomerization method, which comprises the following steps:

step 1: the original three-dimensional point cloud data of the ground object of the target scene is obtained, as shown in fig. 5 and 6, fig. 5 is the original three-dimensional point cloud data of the first scene, and fig. 6 is the original three-dimensional point cloud data of the second scene.

Step 2: converting and preprocessing original three-dimensional point cloud data to generate PKL format files, wherein the specific implementation process is as follows:

step 2.1: importing the original three-dimensional point cloud data into CloudCompare software, and then exporting txt format point cloud files;

step 2.2: extracting the first 6 columns of data of the txt format point cloud file, and storing the extracted data into a txt format point cloud data file, namely converting the txt format point cloud file into a Senmantic3d data set format;

step 2.3: grid sampling and normalization processing are carried out on three-dimensional point cloud data in a Senmanic 3d data set format, an index structure is built on the data after normalization processing by using a Kd tree algorithm, and PKL format files are generated.

Step 3: and (3) carrying out classification prediction on the point clouds in the PKL format file in the step (2) by using the large-scene ground feature monomerization model trained by the large-scene ground feature monomerization model training method to obtain classification labels of each point cloud.

Step 4: and outputting a point cloud set of a single ground object according to the classification label of each point cloud to realize ground object singulation, wherein as shown in fig. 7 and 8, fig. 7 is a recognition result of a first scene, and fig. 8 is a recognition result of a second scene. Fig. 9 is an enlarged view of the recognition result, and the black frame in fig. 9 represents the small target, which indicates that the invention can accurately recognize and monomer the small target, and solves the problem of poor recognition effect of the small target object in the prior algorithm. The black dotted line box in fig. 10 indicates that the method can effectively monomer the adhesion target, and solves the problem that the conventional clustering algorithm has poor segmentation effect on the connected objects.

The foregoing disclosure is merely illustrative of specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art will readily recognize that changes and modifications are possible within the scope of the present invention.

Claims

1. A structure of a model, comprising:

2. The structure of the model of claim 1, wherein: the backbone network adopts a RandLA-Net structure.

3. The structure of the model of claim 1, wherein: the target generation module comprises a first feature extraction layer, a second feature extraction layer, a prediction branch, a regression branch and a splicing layer;

4. The structure of the model of claim 1, wherein: the feature fusion module comprises a third feature extraction layer, a splicing layer and a fourth feature extraction layer;

5. The structure of a model according to any one of claims 1 to 4, characterized in that: the instance prediction network comprises a fifth feature extraction layer, mask prediction branches and an instance output layer;

6. The large-scene ground feature monomer model training method is characterized by comprising the following steps of:

7. The method for training the large-scene ground feature monomalization model according to claim 6, wherein the specific implementation process of making the original three-dimensional point cloud data into a standard sample format file is as follows:

8. The training method of the large-scene ground feature monomalization model according to claim 6, wherein the specific implementation process of preprocessing the point cloud sample of the standard sample format file is as follows:

9. The large-scene ground feature monomalization model training method according to any one of claims 6 to 8, wherein the training of the large-scene ground feature monomalization model is specifically implemented as follows:

wherein,a confidence score assigned to the first candidate box;

wherein N is _ins For the number of ground object examples, N _i Points, iou, for the ith feature instance _i Is the intersection ratio of the ith ground object example, L _mask Loss value of mask, y _j The label of the point in the ground object example is that the positive label is 1, the negative label is 0,the probability of the point prediction being a positive label for the ground object example; sign () is a sign function, when iou _i At > 0.5 sign (iou _i > 0.5) =1; when iou _i At a value of less than or equal to 0.5, sign (iou) _i ＞0.5)＝0；

10. The method for singulating the ground objects in the large scene is characterized by comprising the following steps of:

classifying and predicting point clouds in the PKL format file by using a large-scene ground feature monomerization model trained by the large-scene ground feature monomerization model training method according to any one of claims 6 to 9 to obtain a classification label of each point cloud;

11. The method for monomerizing the ground feature of the large scene as set forth in claim 10, wherein the specific implementation process of converting and preprocessing the original three-dimensional point cloud data is as follows:

12. An electronic device, characterized in that: the method comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the large-scene ground feature monomer model training method according to any one of claims 6 to 9 when the processor runs the computer program.

13. A computer-readable storage medium, the computer-readable storage medium being a non-volatile storage medium or a non-transitory storage medium, having a computer program stored thereon, characterized by: the computer program, when run by a processor, performs the steps of the large scene ground feature monomers model training method according to any of claims 6 to 9.