CN116630622A

CN116630622A - Urban area vegetation point cloud semantic segmentation method based on HPCT model

Info

Publication number: CN116630622A
Application number: CN202310577928.5A
Authority: CN
Inventors: 黄方; 强晓勇; 何伟丙; 陈胜亿; 吕清哲; 葛镔赋
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-22

Abstract

The invention belongs to the technical field of forestry remote sensing, and particularly relates to a city region vegetation point cloud semantic segmentation method based on a deep learning model. The invention combines the deep learning technology to construct an HPCT model: constructing a model into a plurality of layers, wherein each layer can process the characteristics of the three-dimensional point cloud at different levels so as to capture semantic characteristics of different spatial scales and ground object relative relations related to vegetation; the self-attention mechanism is adopted, different weights are carried out according to the importance of each part of input data, and long-distance dependency relations among different positions in the three-dimensional point cloud are captured, so that the expression capacity and understanding capacity of the HPCT model are remarkably improved; meanwhile, the urban area point cloud data of the three data sources are collected and semantically labeled to form the vegetation point cloud data set, the vegetation point cloud data set is used for training and predicting a model, a data base is provided for HPCT model training, and the current situation that the point cloud data set is scarce is also relieved.

Description

Urban area vegetation point cloud semantic segmentation method based on HPCT model

Technical Field

The invention belongs to the technical field of forestry remote sensing, and particularly relates to a city region vegetation point cloud semantic segmentation method based on a deep learning model.

Background

The purpose of vegetation point cloud semantic segmentation is to extract tree point cloud from complex ground object distribution in urban areas, and in forestry remote sensing, vegetation point cloud semantic segmentation is an important pre-task and is a precondition for subsequent processing. With the rapid development of deep learning technology, deep neural networks become the mainstream technology of image processing tasks due to their powerful high-dimensional feature extraction capability. However, the point cloud data has the characteristics of spatial dispersion and disorder, so that a model based on a two-dimensional image cannot be directly migrated to the point cloud, and therefore, the point cloud organization mode needs to be studied. Three Point cloud organization methods are commonly used for the Point cloud processing task, namely, multi-View projection (View projection), three-dimensional reconstruction (Voxel) and direct processing of the Point cloud (Point-set).

Multi-view projection represents the point cloud dimension reduction as a plurality of two-dimensional depth images, which can then be processed using a two-dimensional convolutional neural network. Such as SnapNet, reduces the number of point clouds by preprocessing, calculates features, and generates a mesh. And generating a multi-view image of the grid through the virtual camera, and finally completing point cloud extraction and projection back to the three-dimensional space by using an image semantic segmentation technology. The application of multi-view projection to point cloud extraction has two obvious defects, namely, geometric structure loss can be caused, because the multi-view projection is only one approximation of the point cloud; the second is that for complex scenes, multi-view projection images are difficult to contain all point clouds.

Three-dimensional reconstruction is to reconstruct a regular three-dimensional data structure of the point cloud and then use a three-dimensional full convolution neural network for processing. Such as SegCloud, re-construct the point cloud into a regular three-dimensional pixel matrix in the preprocessing stage, and then complete the point cloud extraction task through a three-dimensional full convolutional neural network, interpolation and full-connection conditional random field, which however causes huge calculation and memory burden due to the redundant three-dimensional structure. To address this problem, octNet and O-CNN use more structured octree to accomplish reconstruction, greatly reducing redundancy. The VV-Net reconstructs the point cloud into a structure that is richer than the voxel matrix information based on the self-encoder structure. MinkowskiNet provides a four-dimensional convolution network to extract spatio-temporal features for processing three-dimensional point cloud video, which can also be used in point cloud extraction tasks. Two serious defects exist in the three-dimensional reconstruction-based method, namely, loss of spatial information in the reconstruction process and redundancy of the reconstructed space.

The direct processing of the point cloud can make full use of the semantic information of the point cloud, and is a very hot spot and potential research direction. PointNet is the most influential framework in this direction, which is designed based on three aspects. Firstly, in order to solve the problem of disorder of point clouds, the whole network is of a symmetrical structure based on a maximum pooling layer and a multi-layer perceptron (the input sequence of the point clouds does not influence the result); secondly, setting a global feature and point cloud individual feature combination mechanism for a point cloud extraction task aiming at feature extraction; thirdly, in order to eliminate the influence of geometric transformation (such as translation and rotation) on the result, a small network T-net is designed to perform affine coordinate transformation, and affine transformation matrix approximates to an orthogonal matrix by setting a loss function. However, pointNet also has a significant drawback in that regional semantic features are not considered in feature extraction. To solve this problem, pointNet++ uses a hierarchical network structure to extract regional semantic features on its basis. This type of framework has a limitation in that it is capable of extracting features at a fixed spatial scale (sampling rate) and cannot effectively extract semantic information at different levels.

Disclosure of Invention

Aiming at the problems or the defects, the invention provides a deep learning model-based urban area vegetation point cloud semantic segmentation method for solving the problem that the accuracy of a two-dimensional image model moving to a point cloud organization mode in the existing vegetation point cloud semantic segmentation is poor.

The specific technical scheme of the invention is as follows:

a city region vegetation point cloud semantic segmentation method based on an HPCT model comprises the following steps:

step 1, for a target city area, three-source point cloud data of the same time phase are collected and respectively a vehicle-mounted laser radar point cloud, an unmanned aerial vehicle inclined image reconstruction point cloud and an unmanned aerial vehicle laser radar point cloud;

step 2, labeling the three-source point cloud data acquired in the step 1 by utilizing point cloud processing software (such as cloudcomputer) to label pixel-by-pixel semantic tags, extracting vegetation point clouds and labeling the vegetation point clouds as 1 and labeling other points as 0, thereby obtaining a point cloud data set for model training and testing;

step 3, because the heterogeneous point clouds have different spatial scales, the invention provides a processing method for unifying the spatial scales of heterogeneous data input, and the training data set in step 2 is processed by the method, and the specific process is as follows:

and 3.1, converting all the point cloud coordinates into a world coordinate system, wherein the unit is meter. The world coordinate system is a geocentric coordinate system, the origin is the centroid of the earth, and the three axes are defined relative to the shape and orientation of the earth. The world coordinate system belongs to a Cartesian coordinate system, namely, rectangular coordinates are used for representing points in space;

step 3.2, inputting the training point cloud data set obtained in the step 2, and simultaneously giving a region segmentation parameter bs, a sampling point number ns and a sampling parameter sr;

step 3.3, normalizing the coordinates of the point cloud data set (coordinate range to 0-1), and randomly selecting a point (x _c ,y _c ,z _c ) As a sampling center, we get a set P of all points within bs/2 around this point x and y coordinate:

step 3.4, randomly sampling ns points from the point set PFor the input of the HPCT model, the sampling parameter sr is used to control the number of epoch samples N in each iteration round _iteration ；

N _iteration ＝N _training /ns×sr (2)

wherein N_training Representing training lumped points.

Step 4, the invention provides an HPCT model suitable for vegetation point cloud semantic segmentation; constructing an HPCT model by combining deep learning technologies Hierarchical Structure and Transformer Block, setting model parameters, and inputting the training data set processed in the step 3 into the model for training;

the Hierarchical Structure is to construct a model into a plurality of layers, and each layer can process the characteristics of the three-dimensional point cloud at different levels so as to capture the semantic characteristics of different spatial scales and the relative relationship of the ground object relative to the vegetation; transformer Block by adopting a Self-Attention mechanism (Self-Attention), the method can carry out different weighting according to the importance of each part of input data, capture long-distance dependency relationship between different positions in the three-dimensional point cloud, thereby remarkably improving the expression capability and understanding capability of the model on the characteristics of large spatial scale, complex ground object relationship and the like of the three-dimensional point cloud in the urban area, and simultaneously, the transducer has the characteristic of unchanged sequence arrangement and is suitable for the three-dimensional point cloud with the characteristics of dispersion and disorder in space.

And 5, inputting the test data set into the trained model in the step 4, automatically dividing the vegetation point cloud of the input training data set, and checking the semantic division result of the vegetation point cloud.

Further, the structure of the HPCT model in the step 4 is specifically as follows (as shown in fig. 2):

the system consists of three cascaded different spatial scale feature extraction modules, wherein the Linear extraction and Grid merger layers in the downsampling link of each module are realized by combining Farthest Point Sampling (FPS), ball Query and Linear layers.

Let the input beThe FPS will choose a subset { x }, among _i1 ,x _i2 ,…,x _im}, wherein x_ij The FPS can better represent the original point set with the same number of sampling points compared to random sampling, which is the m points furthest apart in the linear space of all channels.

Thereafter, ball Query first calculates to find x _ij All points within the specified radius r of the linear space are obtained by channel superposition of K points sampled randomly

Finally, the linear layer downsamples the channel transform outputThrough downsampling transformation, the dimension of the point cloud is defined by +.>To->

Point Transformer Block of the same spatial scale are connected in cascade mode and can be represented by formula (3):

F _i ＝AT ⁱ (F _i-1 ),1＝1,2,...,M _j ,(3)

wherein ATⁱ Representing the ith Attention layer, F _i-1 The i-1 layer output is represented, wherein the 0 th layer is the output of the Linear coding or Grid coding layer. Because the point cloud itself carries the coordinate information, the Positional Embedding module of the transducer is omitted in the design.

Furthermore, as shown in fig. 3, the Attention layer adopts Offset-Attention to calculate semantic similarity between different point cloud features to realize semantic modeling, and meanwhile, a prediction residual block but not the features can obtain better training effect. Let Query, key and Value be Q, K and V, respectively, the Offset-attribute principle is as in formula (4):

(6,K,V)＝F _in ·(W _q ,W _k ,W _v )(4)

wherein ,a learnable linear transformation shared for the layer; d, d _e ＝C _j ，d _a ＝d _e R, R is an adjustable super parameter. N (N) _j and C₎ The number of feature points and the number of dimensions of each spatial scale layer are respectively calculated. Attention layer input F _out The calculation is shown as formula (5):

A＝Softmax(6·K ^T )

F _sa ＝A·V(5)

F _out ＝LBR(F _ib -F _sa )+F _in

a represents the Attention Score, LBR represents the combination of the linear layer, the BathNorm layer and the ReLU layer.

The adjustable super parameters in the HPCT model are as follows: input points N per layer _i Point Transformer Block number M of layers _i Number of FPS proximities K per layer _i (i=1, 2, 3), the number of channels C, offset-Attention adjustable parameter R, entered by the first layer.

Classifying Head as a combination of a maximum pooling layer and a linear layer, wherein the class uses one-hot coding, the loss function uses cross loss entropy, and the cross loss entropy is shown as a formula (6):

wherein y_i) Indicating label, h _θ (x _i ) _j Representing the predicted value.

The splitting Head part adopts a jump connection form, the same-layer semantic information is transferred to the decoding part to strengthen the splitting effect, and a Point Interpolate layer is used in the up-sampling link;

firstly, the up-sampling of the point cloud is completed through the distance weighting characteristics of k nearest neighbors, as shown in the formula (7):

wherein d (x, x _i ) The Euclidean distance from the point x to the ith neighbor.

And then combining the same-layer decoders in the feature dimension to obtain output, wherein the feature combination uses the same Offset-attribute as the backbone network. The loss function uses cross-loss entropy. It is noted that HPCT is a general backbone network that can be used on different visual tasks (e.g., classification, detection, semantic segmentation, etc.).

Further, the specific process of the step 4 is as follows:

step 4.1, constructing codes of the HPCT model, and using a GPU display card for model training and testing;

step 4.2, setting model training parameters: initial learning rate, weight attenuation rate, batch Size (Batch Size), number of input points per layer, point Transformer Block number of input points per layer, number of FPS adjacent points per layer, number of channels input by the first layer, offset-attribute adjustable parameters; when the training data set is input, the RGB information of the point cloud is ignored, and only the coordinate xyz is input; the region segmentation parameters and the sampling points are set, and the cross loss entropy with label smoothing is used as the loss function.

In step 4.3, the accuracy of the semantic segmentation task is greatly affected by data enhancement, and in order to improve the segmentation accuracy as much as possible, the invention sequentially uses three data enhancement methods, including: the original PointNet++ uses data enhancement schemes, namely random rotation, spatial scaling and displacement; the Point-BERT performs scale transformation by utilizing a resampling mechanism, and randomly acquires 1024 points from the original Point cloud; randLA-Net and Point Transformer load the entire scene when training the semantic segmentation task.

Step 4.4, adopting a more effective Optimization method (Optimization) during training is also beneficial to improving the model performance, and the improvement of the Optimization method mainly comprises the steps of using AdamW to replace Adam, using cosine learning rate to replace step learning rate to reduce and using Label Smoothing (Label Smoothing);

and 4.5, inputting the training data set for training to obtain a trained model file, wherein various parameters of the model are recorded in the file.

The technology in computer vision is applied to vegetation point cloud semantic segmentation, and three main difficulties exist: (1) Compared with CAD (Computer-Aided Design) model point clouds (such as ModelNet40 and shape Net-Part data sets) and closed room point clouds (such as S3DIS data sets) which are commonly processed in the field of Computer vision, the spatial scale of static point cloud data of the urban area is larger, and the relative relationship between ground features is more complex. Therefore, the model is required to have stronger semantic feature understanding capability of different layers under a large spatial scale; the city point cloud data acquisition mode adopts the following three modes: vehicle-mounted laser radar point cloud, unmanned aerial vehicle inclined image reconstruction point cloud and unmanned aerial vehicle laser radar point cloud. The heterogeneous point cloud has different acquisition devices (such as unmanned aerial vehicles, trolleys and the like) and sensors (such as laser radars, cameras and the like), so that the heterogeneous point cloud has different spatial scales and distribution characteristics; (3) A city region vegetation semantic segmentation dataset that lacks scene enrichment.

The invention combines the deep learning technologies of Hierarchical Structure, transformer Block and the like, and provides an HPCT model (Hierarchical Point Cloud Transformer, HPCT) suitable for vegetation point cloud semantic segmentation. The Hierarchical Structure is to construct a model into a plurality of layers, and each layer can process the characteristics of the three-dimensional point cloud at different levels so as to capture the semantic characteristics of different spatial scales and the relative relationship of the ground object relative to the vegetation; transformer Block by adopting a Self-Attention mechanism (Self-Attention), the method can carry out different weighting according to the importance of each part of input data, capture long-distance dependency relationship between different positions in the three-dimensional point cloud, thereby remarkably improving the expression capability and understanding capability of the model on the characteristics of large spatial scale, complex ground object relationship and the like of the three-dimensional point cloud in the urban area, and simultaneously, the transducer has the characteristic of unchanged sequence arrangement and is suitable for the three-dimensional point cloud with the characteristics of dispersion and disorder in space.

The spatial scale is crucial to understanding the characteristics of ground objects and the distribution relation among the ground objects in the three-dimensional point cloud of the urban area by the model, and in order to realize the training and the prediction of the heterogeneous point cloud based on the unified model, the invention provides a sampling means of the spatial scale for unified heterogeneous data input. Labeling self-acquired vehicle-mounted laser radar point cloud, airborne laser radar point cloud and oblique photogrammetry reconstruction point cloud, manufacturing a city region vegetation semantic segmentation data set, and providing a data base for HPCT model training. Finally, the invention applies the research result of computer vision in the forestry remote sensing field, widens the application scene of vegetation point cloud semantic segmentation, and improves the precision to the manual point-by-point segmentation level.

In summary, on the basis of the existing deep learning technology, the invention creatively provides the HPCT model for solving the semantic segmentation of the vegetation point cloud in the urban area, realizes the accurate segmentation of the vegetation point cloud in the urban area, improves the degree of automation, and can be suitable for the heterogeneous point cloud data, and the point cloud data acquired in different modes do not influence the training and the prediction of the heterogeneous point cloud data. The urban area vegetation point cloud semantic segmentation method based on the HPCT model has a plurality of purposes, for example, can be used for assisting in statistics and calculation of vegetation information such as three-dimensional green amount, urban greening rate and the like, and has an important role in the forestry remote sensing field. Meanwhile, due to the lack of a data set for semantic segmentation of vegetation point clouds in urban areas at present, the invention collects the urban area point clouds and makes the data set of the vegetation point clouds before training an HPCT model, and provides a data basis for other related researches.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic view of an HPCT model architecture;

FIG. 3 is an Offset-Attention schematic;

FIG. 4 is a schematic diagram of vehicle-mounted lidar point cloud data according to an embodiment;

fig. 5 is a schematic diagram of reconstruction point cloud data of an oblique image of an unmanned aerial vehicle according to an embodiment;

FIG. 6 is a schematic diagram of unmanned airborne laser radar point cloud data of an embodiment;

fig. 7 is a schematic diagram of a visualization result of three-source point cloud semantic segmentation cases in HPCT model regions 1, 3 according to an embodiment.

Fig. 8 is a schematic diagram of a visualization result of a three-source point cloud semantic segmentation case of the HPCT model region 2 according to an embodiment.

Detailed Description

In order to intuitively express the advantages of the invention, the implementation case of the semantic segmentation of the vegetation point cloud of the urban area based on the HPCT model is illustrated by combining actual data with an experimental result drawing, and the implementation process is as follows:

and step 1, selecting three typical urban areas as a zone 1, a zone 2 and a zone 3, and collecting three-source point cloud data of the same time phase, wherein the three-source point cloud data are respectively vehicle-mounted laser radar point cloud, unmanned aerial vehicle inclined image reconstruction point cloud and unmanned aerial vehicle laser radar point cloud.

And 1.1, acquiring vehicle-mounted laser radar point cloud data, and acquiring data by adopting 128-line iScan-S-Z laser radars. The laser radar is fixed on the roof of the acquisition vehicle, and the supporting equipment comprises a vehicle speed sensor, a point cloud box and a static differential base station arranged on the ground. The acquisition vehicle is driven along the road to acquire original point cloud data, satellite positioning data GNSS (Global Navigation Satellite System), odometer data and inertial navigation data IMU (Inertial Measurement Unit). Raw GNSS data was converted using staticttorrinex 64 software. The experimental parameters are input into Inertial Explorer software, including geographical coordinates of a control point, vehicle-mounted equipment installation data, POS (Position and Orientation System) adoption intervals and the like, and IE calculation (Inertial-Exterior Solution) is carried out by utilizing the converted GNSS data, IMU data and odometer data to obtain a POS file of the driving path. And inputting the POS file and the original point cloud data into mmsconvert software, and adjusting corresponding parameters according to experiments to obtain the final static full-scene point cloud of the whole scene, as shown in fig. 4.

Step 1.2, reconstructing point cloud data of an unmanned aerial vehicle inclined image, namely firstly adopting a Sinkiang longitude and latitude M300 RTK carrying Buddhist P1 image sensor to acquire image data, and then utilizing the image data to perform three-dimensional reconstruction to obtain the point cloud data. When the unmanned aerial vehicle collects inclined images, the matched operation software DJ Pro is used for setting a flight operation area and setting related flight parameters, wherein key parameters comprise 3m/s of navigational speed, 80% of course overlapping rate, 70% of side overlapping rate, 60m of navigational altitude and the like. The operation software can automatically generate a flight scheme and a photographing scheme according to the parameters, and then can start operation by one key to complete data acquisition. The resolution of the final acquired image is 8192×5460, and the final acquired image contains information such as parameters in the camera and coordinates of the image point GPS (Global Positioning System). In order to improve the quality of three-dimensional reconstruction, additionally arranging image control points during data acquisition, measuring the spatial positions of the image control points by using an RTK (Real-Time Kinematic) base station, uniformly distributing the image control points in a test area as much as possible, and selecting calibrated points. And after the data acquisition is completed, completing three-dimensional reconstruction by using Context Capture software to obtain a static point cloud. Firstly, inputting unmanned aerial vehicle images, POS information and camera parameters into software, manually marking image control points in unmanned aerial vehicle image images, and adding GPS information. And obtaining a sparse point cloud through space three calculation, and submitting a dense reconstruction task to obtain a final static point cloud. Care should be taken in submitting tasks: and selecting regular plane grid cut blocks in the space architecture, and adjusting the size of the tiles to reduce the memory occupation. Hole filling is set to fill all holes except tile boundaries in the process setup, the desired product is set to be a three-dimensional point cloud in the newly created reconstruction project, and sampling intervals are set to be 0.05m in the format point cloud sampling, with the results shown in fig. 5.

And 1.3, acquiring point cloud data of an unmanned airborne laser radar, wherein the data is acquired by adopting a large-scale longitude and latitude M300 RTK carrying Buddhist L1, and the operation process of acquiring the data is similar to reconstructing the point cloud of the upper section inclined image. After the original data are obtained, the laser radar original file acquired by the Buddhist L1 sensor is processed by using the intelligent map software of the Xinjiang to generate three-dimensional point cloud data in the las format. The basic flow for generating the three-dimensional point cloud comprises three steps of original data importing, reconstruction parameter setting and point cloud reconstruction. The dynamic point cloud file generated by the Buddhist L1 sensor is firstly imported, the base station center point is set to be the coordinate of the base station center point during data acquisition, and then the point cloud density is set to be the point cloud density, so that the time and the precision are both considered. And then, using precision inspection to import the image control points acquired in advance and check the precision of the point cloud. After the inspection is completed, an accuracy report is automatically generated. And finally, selecting a WGS 84 coordinate system by the output coordinate system. The output format selects the LAS format. And starting the point cloud reconstruction to obtain static point cloud data, as shown in fig. 6.

And 2, labeling pixel-by-pixel semantic tags on the three-source point cloud data acquired in the step 1 by utilizing CloudCompare software, extracting vegetation point clouds and labeling the vegetation point clouds as 1 and labeling other points as 0, thereby obtaining a data set for model training and testing.

And 2.1, opening CloudCompare software to read the acquired vehicle-mounted laser radar point cloud data.

And 2.2, manually dividing all the tree point clouds in the vehicle-mounted laser radar point cloud data by using a Segment function in software, adding a new attribute to be assigned as 1, and assigning other points for the divided point clouds to be 0.

And 2.3, merging the point clouds which are segmented and labeled in the step 2.2 to form a complete data set, and repeating the steps for reconstructing the point cloud data of the unmanned aerial vehicle oblique image and the unmanned aerial vehicle laser radar point cloud data to finish the manufacturing of the point cloud data set.

And step 3, processing the data set in the step 2, and unifying the spatial scale of the heterogeneous data input, wherein the specific process is shown in the step 3 in the invention content.

And 4, constructing an HPCT model, setting model parameters, and inputting the data set processed in the step 3 into the model for training.

And 4.1, constructing codes of the HPCT model based on PyTorch, and using a 40GB NVIDIA A100 graphic card for model training and testing.

Step 4.2, setting model training parameters, wherein the initial learning rate lr=0.001, and the weight attenuation rate is 10 ^-4 Batch Size (Batch Size) of 128, set the number of input points N per layer ₀ ＝2N ₁ ＝4N ₂ =1024, point Transformer Block M/layer ₁ ＝M ₂ ＝M ₃ Number of FPS proxels per layer K =2 ₁ ＝K ₂ ＝K ₃ The first layer inputs channel number c=64, offset-Attention adjustable parameter r=4, ignores point cloud RGB information, inputs only coordinate xyz, area segmentation parameter bs takes 10m, sampling point ns takes 4098, and the loss function uses cross loss entropy with label smoothing to run 50 epochs.

Step 4.3, the data enhancement can influence the precision of the semantic segmentation task to a great extent, in order to improve the segmentation precision as much as possible, the three data enhancement methods used in the embodiment comprise the data enhancement scheme used by the original PointNet++, namely random rotation, space scaling and displacement; the Point-BERT performs scale transformation by utilizing a resampling mechanism, and randomly acquires 1024 points from the original Point cloud; randLA-Net and Point Transformer load the entire scene when training the semantic segmentation task.

Step 4.4, a more effective Optimization method (Optimization) is also adopted during training to help improve the model performance, and the improvement on the Optimization method of the invention mainly comprises the steps of using AdamW to replace Adam, using cosine learning rate to replace step learning rate to reduce and using Label Smoothing (Label Smoothing).

And 5, inputting the test data set into the model trained in the step 4.5, and checking the vegetation point cloud semantic segmentation result.

Through the steps, the task of semantic segmentation of vegetation point clouds in urban areas can be completed, and then PointNet++ and PCT are trained by the same experimental configuration for transverse comparison of HPCT effects. The test results are shown in table 1, and it can be seen that: (1) The HPCT model exceeded 96% in average Precision in three source data, 96.90%, 99.42% and 97.42%, respectively; the average IoU exceeds 95%, namely 96.21%, 98.37% and 95.75%, respectively, and the segmentation accuracy is high. (2) The HPCT model exceeds PointNet++ and PCT on average Precision and average IoU in three-source data, and PointNet++ and PCT in most of the individual regions Precision and IoU.

TABLE 1 three Source Point cloud semantic segmentation results based on HPCT, PCT and PointNet++

Fig. 7 and 8 list the visualization results of the three-source point cloud semantic segmentation experiment of the HPCT model. From top to bottom, areas 1,2 and 3 are respectively; the inside of each area is represented from top to bottom, and the vehicle-mounted laser radar, the unmanned aerial vehicle-mounted laser radar and the unmanned aerial vehicle oblique photography are reconstructed; the original scene, the real label, the PointNet++ predicted result, the PCT predicted result and the HPCT predicted result are respectively from left to right; the gray object and black background inside the prediction result represent the correct tree point cloud and the non-tree point cloud respectively, and the box represents the position, so that the HPCT prediction result is obviously seen compared with the PointNet++ and PCT prediction results, and the predicted tree point cloud is more complete, namely, the tree point cloud is less missed or other ground object point clouds are identified as the tree point cloud.

As can be seen from the above embodiments, the accuracy of the vegetation point cloud semantic segmentation result of the HPCT model constructed by the invention in the three-source data is improved to the manual point-by-point segmentation level, the average Precision exceeds 96%, and the average IoU exceeds 95%. In the forestry remote sensing field, the method can play an important role in calculating the three-dimensional green amount, counting the urban greening rate and the like.

Claims

1. A city region vegetation point cloud semantic segmentation method based on an HPCT model is characterized by comprising the following steps:

step 2, labeling the three-source point cloud data acquired in the step 1 by pixel semantic tags by utilizing point cloud processing software, extracting vegetation point clouds, labeling the vegetation point clouds as 1, labeling other points as 0, and thus obtaining a point cloud data set for model training and testing;

step 3, carrying out space scale processing of unified heterogeneous data input on the training data set in the step 2;

step 3.1, converting all the point cloud coordinates into a world coordinate system, and using rectangular coordinates to represent points in space by taking meters as units;

step 3.3, normalizing the coordinates of the point cloud data set, wherein the coordinate range is 0-1; and randomly selecting a point (x _c ,y _c ,z _c ) As a sampling center, we get a set P of all points within bs/2 around this point x and y coordinate:

step 3.4, randomly sampling ns points from the point set P as input of the HPCT model, wherein the sampling parameter sr is used for controlling the sampling times N of epoch in each iteration round _iteration ；

N _iteration ＝N _training /ns×sr (2)

wherein N_training Representing training lumped points;

step 4, constructing an HPCT model by combining deep learning technologies Hierarchical Structure and Transformer Block, setting model parameters, and inputting the training data set processed in the step 3 into the HPCT model for training;

hierarchical Structure the model is constructed into a plurality of layers, and each layer can process the characteristics of the three-dimensional point cloud at different levels so as to capture the semantic characteristics of different spatial scales and the relative relation of the ground object relative to the vegetation;

transformer Block adopts Self-Attention mechanism Self-Attention, carries out different weighting according to the importance of each part of input data, and captures long-distance dependency relationship between different positions in the three-dimensional point cloud;

2. The urban area vegetation point cloud semantic segmentation method based on the HPCT model as claimed in claim 1, wherein:

the structure of the HPCT model in the step 4 is specifically as follows:

the method comprises the steps of forming three cascaded different spatial scale feature extraction modules, wherein a Linear extraction layer and a Grid merger layer in a downsampling link of each module are realized through FPS, ball Query and Linear layer combination;

let the input beThe FPS will choose a subset { x }, among _i1 ,x _i2 ,…,x _im}, wherein x_ij Are m points furthest apart in the linear space of all channels;

F _i ＝AT ⁱ (F _i-1 ),1＝1,2,...,M _j ,(3)

wherein ATⁱ Representing the ith Attention layer, F _i-1 The i-1 layer output is represented, wherein the 0 th layer is the output of the Linear coding or Grid coding layer.

3. The urban area vegetation point cloud semantic segmentation method based on the HPCT model as claimed in claim 2, wherein:

the Attention layer adopts Offset-Attention to calculate semantic similarity among different point cloud features to realize semantic modeling;

let Query, key and Value be Q, K and V, respectively, the Offset-attribute principle is as in formula (4):

(6,K,V)＝F _in ·(W _q ,W _％ ,W _v )(4)

wherein ,a learnable linear transformation shared for the layer; c (C) _e ＝C _j ，C _a ＝C _e R, R is an adjustable super parameter; n (N) _j and C_j The number of feature points and the number of dimensions of each spatial dimension layer are respectively input F to the Attention layer _out The calculation is shown as formula (5):

A＝Softmax(6·K ^T )

F _sa ＝A·V

F _out ＝LBR(F _on -F _sa )+F _in (5)

a represents the Attention Score, LBR represents the combination of the linear layer, the BathNorm layer and the ReLU layer;

the adjustable super parameters in the HPCT model are as follows: input points N per layer _i Point Transformer Block number M of layers _i Number of FPS proximities K per layer _i I=1, 2,3; the number of channels C input by the first layer is Offset-Attention adjustable parameter R;

wherein y_ij Indicating label, h _θ (x _i ) _j Representing predicted values；

wherein d (x, x _i ) The Euclidean distance from the point x to the ith adjacent point;

then combining the same layer decoder in the feature dimension to obtain output, wherein the feature combination uses the same Offset-attribute as the backbone network; the loss function uses cross-loss entropy.

4. The urban area vegetation point cloud semantic segmentation method based on the HPCT model as set forth in claim 3, wherein the specific process of step 4 is as follows:

step 4.2, setting model training parameters: initial learning rate, weight attenuation rate, batch Size, number of input points of each layer Point Transformer Block, number of FPS adjacent points of each layer, number of channels input by the first layer, and Offset-attribute adjustable parameters; when the training data set is input, the RGB information of the point cloud is ignored, and only the coordinate xyz is input; setting region segmentation parameters and sampling points, wherein a loss function uses cross loss entropy with label smoothing;

step 4.3, three data enhancement methods are used in sequence: the original PointNet++ uses data enhancement schemes, namely random rotation, spatial scaling and displacement; the Point-BERT performs scale transformation by utilizing a resampling mechanism, and randomly acquires 1024 points from the original Point cloud; randLA-Net and Point Transformer load the whole scene when training the semantic segmentation task;

step 4.4, optimizing in training, wherein the optimizing method comprises the steps of using AdamW to replace Adam, using cosine learning rate to replace ladder learning rate and using Label Smoothing Label Smoothing;

and 4.5, inputting the training data set for training to obtain a trained model file, wherein the model file records various parameters of the model.