CN114792372B

CN114792372B - Three-dimensional point cloud semantic segmentation method and system based on multi-head two-stage attention

Info

Publication number: CN114792372B
Application number: CN202210709918.8A
Authority: CN
Inventors: 潘丹; 罗琳; 曾安; 廖清青; 杨宝瑶; 张逸群
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-11-04
Anticipated expiration: 2042-06-22
Also published as: CN114792372A

Abstract

The invention provides a three-dimensional point cloud semantic segmentation method and a system based on multi-head two-stage attention, which comprises the following steps: acquiring a 2D sequence image of a plant and performing three-dimensional reconstruction to obtain a 3D point cloud of the plant; preprocessing and manually labeling the 3D point cloud data of the plants; in addition, in consideration of the complexity of the plant structural morphology, a multi-head two-stage attention semantic segmentation network is constructed based on an attention mechanism and used for acquiring the geometric features of the point cloud in a layering manner, predicting the semantic label of each point directly from the completely labeled point cloud data, and finally obtaining the segmentation result of the plant organ. According to the method and the system for three-dimensional point cloud semantic segmentation based on multi-head two-stage attention, a semantic segmentation network is constructed, so that a deep learning model for directly processing unordered 3D point cloud from end to end based on data driving is provided in the agricultural field, and the plant three-dimensional point cloud can be automatically and efficiently subjected to organ-level segmentation.

Description

Three-dimensional point cloud semantic segmentation method and system based on multi-head two-stage attention

Technical Field

The invention relates to the technical field of three-dimensional point cloud segmentation, in particular to a multi-head two-stage attention-based three-dimensional point cloud semantic segmentation method and system.

Background

The current segmentation tasks for plants include segmentation methods based on 2D images of plants and based on 3D point cloud data of plants. Methods based on 2D images are mainly classified into segmentation based on color index, segmentation based on threshold, segmentation based on learning (including supervised and unsupervised machine learning methods). Compared with computer vision and machine learning methods relying on manual features, deep learning models are applied to many agricultural phenotype researches due to strong feature extraction and autonomous learning capabilities, such as classical SegNet, U-Net, mask R-CNN models and the like, are used for semantic segmentation of plants and are superior to the feature-based methods. In order to solve the problems of dimensional constraint of a 2D image and incapability of processing overlapping and shielding between leaves, a three-dimensional reconstruction technology of plant point cloud is used for acquiring complete and shielding-free three-dimensional geometric information.

In the previous three-dimensional plant segmentation research, the segmentation of the plant three-dimensional point cloud is mostly based on the following methods: segmentation methods based on local surface features, such as local covariance matrices, tensors, surface curvatures, and the like; performing semantic segmentation on the plant 3D point cloud by using a supervised learning method, such as a support vector machine and a random forest; the plant 3D point cloud segmentation method based on the deep learning comprises a multi-view-based method, wherein a plurality of two-dimensional projections are generated from three-dimensional point cloud, a deep learning segmentation method based on two-dimensional images is applied, and then two-dimensional segmentation results at different angles are combined to obtain a final three-dimensional point cloud segmentation result; voxel-based methods such as VCNN were designed for the classification and segmentation of corn stems and leaves; point-based methods for directly processing unordered three-dimensional point clouds such as PointNet and PointNet + +, although point-based 3D deep learning develops rapidly, there is little research on plant segmentation.

The prior art discloses a point cloud example segmentation method and a point cloud example segmentation system based on PointNet, wherein a point cloud data preprocessing module carries out partitioning, sampling, translation and normalization operations; extracting a point cloud characteristic matrix by a PointNet neural network training module through a PointNet neural network; the matrix calculation module comprises a training similar network, a confidence network and a semantic segmentation network, and extracts a similar matrix, a confidence matrix and a semantic segmentation matrix of point cloud features through three network branches; and after determining the effective segmentation example group, the clustering and merging module carries out denoising and de-duplication operation to complete the segmentation of the example object. Although the scheme realizes instance segmentation of point cloud data, the scheme is not suitable for the field of three-dimensional plant segmentation, and the application of the scheme in three-dimensional plant phenotype analysis has obvious defects.

Disclosure of Invention

In order to solve at least one technical defect, the invention provides a three-dimensional point cloud semantic segmentation method and a three-dimensional point cloud semantic segmentation system based on multi-head two-stage attention, which can automatically and efficiently realize high-precision segmentation of plant three-dimensional point cloud organ levels.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a three-dimensional point cloud semantic segmentation method based on multi-head two-stage attention comprises the following steps:

s1: constructing an image acquisition platform, and acquiring a high-precision and multi-angle plant 2D sequence image through a camera;

s2: performing three-dimensional reconstruction according to the collected 2D sequence image of the plant to obtain a 3D point cloud of the plant;

s3: preprocessing and manually labeling the 3D point cloud of the plant to obtain a labeled point cloud;

s4: and constructing a multi-head two-stage attention three-dimensional point cloud semantic segmentation network, taking the marked point cloud as input, and performing semantic label prediction by the three-dimensional point cloud semantic segmentation network to finish the segmentation of the three-dimensional point cloud semantic.

In the scheme, the image acquisition platform consists of a camera shooting shed, a support, a white circular turntable and a camera; wherein: camera shooting shed (0.8)m×0.8m) The LED lamp tube device comprises a white background plate with the same size and 3 LED lamp tube devices for supplementing light when light is insufficient; the bracket is used for fixing the position of the camera; the white circular turntable is used for preventing a three-dimensional target plant to be reconstructed, and the plant rotates on a specific motion track along with the turntable; the camera is used for shooting and collecting plantsAnd 2D sequence images are taken, in order to ensure the shooting stability, the camera is set to be in an automatic focusing mode in the whole shooting process, and meanwhile, all parameters of the camera are kept unchanged. The scheme builds a set of platform for automatically acquiring 2D sequence images of plants, three-dimensional reconstruction is carried out on the 2D sequence images which are captured by a plurality of viewpoints and are not needed, environmental constraints are small, self-correction is achieved, the requirement on the camera is low, the method is different from a Kinect 3D camera, only a common RGB camera is needed, and robustness is high.

In the scheme, the semantic segmentation network is constructed, so that a deep learning model for directly processing disordered 3D point cloud from end to end based on data driving is provided in the agricultural field, and organ-level segmentation can be automatically and efficiently performed on the plant three-dimensional point cloud.

Wherein, the step S2 specifically includes the following steps:

s21: extracting feature points from the plant 2D sequence images by adopting a Scale Invariant Feature Transform (SIFT) operator, establishing a K-dimensional space binary tree model by using a nearest field search algorithm, and calculating Euclidean distance between the feature points of every two plant 2D sequence images through the K-dimensional space binary tree model to perform stereo matching of the feature points to obtain matching points;

s22: solving the camera attitude by adopting a consistency algorithm, screening the matching points, and providing wrong or outlier matching points;

s23: based on the obtained camera pose, recovering three-dimensional point coordinates corresponding to the matching points by using a triangulation algorithm, and then performing iterative optimization on the camera pose and the three-dimensional point coordinates by using a beam adjustment algorithm to obtain a sparse point cloud;

s24: and expanding pixels around the feature points of the sparse point cloud by adopting CMVS (Cluster Multi View selector) and PMVS (batch-based Multi View selector) algorithms to form dense point cloud, namely obtaining the 3D point cloud of the plant.

In the scheme, the CMVS algorithm is a multi-view three-dimensional clustering algorithm which can cluster the hash images and reduce the data volume of the images; the PMVS algorithm is a multi-view stereo vision algorithm and can generate dense point cloud through three steps of matching, expanding and filtering.

In the scheme, in the point cloud obtaining process, noise is easily generated under the influences of environmental factors, acquisition equipment, artificial disturbance and the like, and deviation is generated on the segmentation and measurement results. Therefore, the acquired 3D point cloud needs to be preprocessed, and the accuracy of subsequent prediction is improved. Therefore, in the step S3, the preprocessing specifically includes a background removal process and a point cloud filtering process; wherein:

in the background removing process, the color features of the 3D point cloud of the plant obtained in the step S2 are taken as different bases, a method based on a color threshold value is used, a green plant point cloud and a red paper point cloud are extracted by utilizing an RGB channel, and irrelevant background parts of the point cloud are removed; the color threshold is specifically set as: G-B is more than or equal to 15 or R-B is more than or equal to 15 and R is more than or equal to 120 and R-G is more than or equal to 45 and R-B is more than or equal to 45; then, in the point cloud filtering process, filtering the point cloud with the irrelevant background part removed by using the statistical outlierremove in the PCL point cloud library, and the specific process is as follows:

traversing all point data in the point cloud, and calculating the average distance between each point and K nearest neighbor points of each point; then calculate the mean of all average distances

And standard deviation of

(ii) a It is assumed here that there is a normal distribution whose shape is given by the mean

And standard deviation of

Determining, thus defining as outliers, points whose average distance exceeds the threshold, the distance thresholdthresholdThe specific calculation formula is as follows:

wherein the content of the first and second substances,

is a constant; finally, traversing all the point clouds again, and removing the point clouds with the average distance between the point clouds and the K adjacent points larger than the distance threshold valuethresholdPoint (2) of (c).

In the step S3, the manual labeling specifically includes:

inputting point cloud needing manual marking

In software, a segmentation tool is adopted to carry out a semantic segmentation and labeling process on the point cloud, and each point in the point cloud is assigned with a category label of leaves, stems or non-plants, so that manual labeling is completed.

In the above scheme, since the subsequently constructed semantic segmentation network needs to be trained before use, a data set for training the joint semantic-instance segmentation network can be obtained by preprocessing and manual labeling before actual segmentation, the data set is divided into a training set and a test set according to a ratio of 3.

In the step S4, the three-dimensional point cloud semantic segmentation network adopts an encoder-decoder structure, and specifically performs the following steps:

s41: inputting the manually marked point cloud into a semantic segmentation network, and performing dimensionality increasing operation on the point cloud by a Full Connected (FC) layer, wherein dimensionality is changed into (9 → 32);

s42: encoding the point clouds subjected to dimensionality increase through four encoders, gradually reducing the number of the point clouds and increasing the dimensionality of each point cloud; each encoder consists of a down-sampling module and a multi-head two-stage attention module, the point cloud is down-sampled at four times of sampling rate, each layer only retains 25% of point characteristics, namely the cardinality of the generated point cloud is changed intoN→N/4→N/16→N/64→N/256，NThe number of representative points; meanwhile, the multi-head two-stage attention module is used for acquiring the geometric features of the plant point cloud in a layered mode, the feature dimension of each layer is gradually increased to reserve more information, namely the feature transformation is 32 → 64 → 128 → 256 → 512;

s43: after passing through the encoder, four decoders are used to restore the point number of the point cloud intoN(ii) a For each layer in the decoder, an up-sampling module and a multi-layer perceptron are included; the up-sampling module firstly uses a KNN algorithm to inquire K nearest neighbor points of each point cloud, and then up-samples the point cloud through a nearest neighbor interpolation algorithm; then, splicing the up-sampled features and the intermediate features generated by the corresponding encoders through jump connection to obtain a fused feature map; finally, inputting the fused feature map into a multilayer perceptron to obtain output;

s44: with three shared fully connected layers, namely dimension change to (N, 128) → (N, 32) → (N, C), and a random loss rate set applied after the first fully connected layer

The drop layer(s) of (a),

is a natural number less than 1; the output of the semantic segmentation network isN×CThe semantic tag of (1), whereinCRepresenting the number of categories.

In the scheme, the down-sampling module can perform down-sampling on the input point cloud, so that the number of the point clouds is reduced, and the dimension of the characteristic of each point is increased. According to the scheme, farthest point sampling is used, farthest points from an existing sampling point set are selected continuously and iteratively to obtain sampled point clouds, then ball query is used for each point, a radius is fixed to find all the points in the radius, a plurality of local areas are generated, each local area passes through a multilayer perceptron, and finally maximum pooling is carried out, so that the obtained characteristic is the output of down sampling.

In the scheme, in order to extract deep semantic features of point cloud, the scheme designs a multi-head two-stage attention module: local attention and global attention modules. The local attention module uses ball query to search local areas, focusing on important neighbor features and then giving them more weight. The global attention module focuses on feature dependencies of all points based on a self-attention mechanism. And finally, splicing the output characteristics of the local attention module and the global attention module to obtain the output of the whole attention module. In addition, a multi-attention mechanism is introduced in order to obtain more complete feature information and improve the generalization capability of the network.

In the above scheme, for dense tasks such as semantic segmentation and the like which classify each point in the point cloud, the point set needs to be up-sampled and restored to the original point number. Similar to the deconvolution operation in a convolutional neural network, the point cloud is upsampled using an upsampling module to transition features from the shape level to the point level. And finding K adjacent points for each point by using a KNN algorithm, and interpolating in a three-dimensional space based on Euclidean distances between the point and the K adjacent points.

The scheme also provides a three-dimensional point cloud semantic segmentation system based on multi-head two-stage attention, which comprises an image acquisition platform, a three-dimensional reconstruction module, a preprocessing module, a manual labeling module and a semantic segmentation network construction module; wherein:

the image acquisition platform acquires a high-precision and multi-angle plant 2D sequence image through a camera;

the three-dimensional reconstruction module is used for performing three-dimensional reconstruction according to the collected 2D sequence images of the plants to obtain 3D point clouds of the plants;

the preprocessing module is used for preprocessing the 3D point cloud of the plant;

the manual marking module is used for manually marking the preprocessed point cloud;

the semantic segmentation network construction module is used for constructing a semantic segmentation network, the marked point cloud is used as input, and the semantic label prediction is carried out by the semantic segmentation network to obtain the segmentation result of the plant organ level.

The three-dimensional reconstruction module comprises a matching point acquisition unit, a matching point screening unit, a sparse point cloud acquisition unit and a dense point cloud acquisition unit; wherein:

the matching point obtaining unit adopts a scale invariant local feature transform (SIFT) operator to extract feature points from the plant 2D sequence images, and establishes a K-dimensional space binary tree model to calculate the Euclidean distance between the feature points of every two plant 2D sequence images by using a nearest neighbor search algorithm so as to carry out three-dimensional matching on the feature points, so as to obtain matching points;

the matching point screening unit adopts a consistency algorithm to solve the camera attitude, screens the matching points and eliminates wrong or outlier matching points;

the sparse point cloud obtaining unit recovers three-dimensional point coordinates corresponding to the matching points by using a triangulation algorithm based on the obtained camera pose, and then performs iterative optimization on the camera pose and the three-dimensional point coordinates by using a light beam adjustment algorithm to obtain a sparse point cloud;

and the dense point cloud obtaining unit adopts CMVS and PMVS algorithms to expand the pixels around the feature points of the sparse point cloud to form dense point cloud, namely obtaining the 3D point cloud of the plant.

The preprocessing module comprises a background removing unit and a point cloud filtering unit; wherein:

in a background removing unit, extracting green plant point cloud and red paper point cloud from 3D point cloud of plants by using RGB (red, green and blue) channels by using a method based on color threshold values according to different color characteristics, and removing irrelevant background parts of the point cloud;

in the point cloud filtering unit, filtering the point cloud with the irrelevant background part removed by using a statistical outlierRemoval in a PCL point cloud library, wherein the specific process is as follows:

Sum standard deviation

(ii) a It is assumed here thatA normal distribution whose shape is defined by the mean

And standard deviation of

Determining, and thus defining, as outliers the points whose average distance exceeds a threshold value, by which the points to be filtered out can be determined, the distance threshold valuethresholdThe specific calculation formula is as follows:

wherein the content of the first and second substances,

is a constant; finally, traversing the point cloud again, and removing the points with the average distance between the points and the K adjacent points larger than the threshold valuethresholdPoint (2) of (c).

Wherein the manual labeling module adopts

The software implementation specifically comprises:

inputting point cloud needing manual markingcloudcompareIn software, a segmentation tool is adopted to carry out a semantic segmentation and labeling process on the point clouds, and one type of label of leaves, stems or non-plants is distributed to each point cloud, so that manual labeling is completed.

The semantic segmentation network constructed by the semantic segmentation network construction module specifically executes the following operations: predicting the category labels of the points in the point cloud to complete semantic label prediction, wherein:

the three-dimensional point cloud semantic segmentation network adopts an encoder-decoder structure and specifically executes the following steps:

inputting the manually marked point cloud into a semantic segmentation network, and performing a 32 → 64 dimensionality operation on the point cloud by a full Connected layer (FC);

encoding the point clouds subjected to dimensionality increase through four encoders, gradually reducing the number of the point clouds and increasing the dimensionality of each point cloud; each encoder consists of a down-sampling module and a multi-head two-stage attention module, the point cloud is down-sampled at four times of sampling rate, each layer only retains 25% of point characteristics, namely the cardinality of the generated point cloud is changed intoN→N/4→N/16→N/64→N/256，NThe number of representative points; meanwhile, the multi-head two-stage attention module is used for acquiring the geometric features of the plant point cloud in a layered mode, the feature dimension of each layer is gradually increased to keep more information, namely, the feature is transformed into 32 → 64 → 128 → 256 → 512;

after passing through the encoder, four decoders are used to restore the point number of the point cloud toN(ii) a The method comprises the steps that an upsampling module and a multi-layer perceptron are contained for each layer in a decoder; the up-sampling module firstly uses a KNN algorithm to inquire K nearest neighbor points of each point cloud, and then up-samples the point cloud through a nearest neighbor interpolation algorithm; then, splicing the up-sampled features and the intermediate features generated by the corresponding encoders through jump connection to obtain a fused feature map; finally, inputting the fused feature map into a multilayer perceptron to obtain output;

the output of a dropout layer semantic segmentation network with three shared FC layers, namely dimension change of (N, 128) → (N, 32) → (N, C), and a random loss rate set to 0.5 applied after the first fully connected layer isN×CThe semantic tag of (1), whereinCRepresenting the number of categories.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a three-dimensional point cloud semantic segmentation method and system based on multi-head two-stage attention, and provides a deep learning model for directly processing disordered 3D point cloud from end to end based on data drive in the agricultural field by constructing a semantic segmentation network, so that organ-level segmentation can be automatically and efficiently carried out on plant three-dimensional point cloud.

Drawings

FIG. 1 is a schematic flow chart of the method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of the step S2 according to an embodiment of the present invention;

FIG. 3 is a diagram of an image acquisition platform constructed in the step S1 according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the manual labeling in the step S3 according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a semantic segmentation network according to an embodiment of the present invention;

FIG. 6 is a diagram of a multi-head two-level attention module in the semantic segmentation network according to an embodiment of the present invention;

fig. 7 is a visualization diagram of the plant three-dimensional point cloud segmentation result in an embodiment of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the embodiment is a complete use example and has rich content

For the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a three-dimensional point cloud semantic segmentation method based on multi-head two-stage attention includes the following steps:

s3: preprocessing and manually marking the 3D point cloud of the plant;

s4: and constructing a semantic segmentation network, taking the marked point cloud as input, and performing semantic label prediction by the semantic segmentation network to obtain a segmentation result of the plant organs.

In a specific implementation process, as shown in fig. 3, the image acquisition platform consists of a camera shed, a support, a white circular turntable and a camera; wherein: camera shooting shed (0.8)m

0.8m) The LED lamp tube device comprises a white background plate with the same size and 3 LED lamp tube devices for supplementing light when light is insufficient; the bracket is used for fixing the position of the camera; the white circular turntable is used for preventing a three-dimensional target plant to be reconstructed, and the plant rotates on a specific motion track along with the turntable; the camera is used for shooing the plant, gathers plant 2D sequence image, and in order to guarantee the stability of shooing, whole shooting in-process all sets up the camera and is the auto focus mode, guarantees simultaneously that each parameter of camera keeps unchangeable. A set of platform that is used for automatic acquisition plant 2D sequence image is built to this scheme, carries out three-dimensional reconstruction with what a plurality of viewpoints captured need not 2D sequence image, receives that environmental constraint is less and possess the self-correction, and is lower to the requirement of camera, different with this kind of 3D camera of Kinect, only need ordinary RGB camera can, the robustness is stronger.

In the specific implementation process, the method provides a deep learning model for directly processing unordered 3D point clouds from end to end based on data driving by constructing a semantic segmentation network in the agricultural field, and can automatically and efficiently perform organ-level segmentation on the plant three-dimensional point clouds.

More specifically, as shown in fig. 2, the step S2 specifically includes the following steps:

s21: extracting feature points from the plant 2D sequence images by adopting a Scale Invariant Feature Transform (SIFT) operator, then establishing a K-dimensional space binary tree model by using a nearest neighbor search algorithm to calculate the Euclidean distance between the feature points of every two plant 2D sequence images so as to perform stereo matching on the feature points, and obtaining matching points;

s22: solving the camera attitude by adopting a consistency algorithm, screening the matching points, and eliminating wrong or outlier matching points;

s23: based on the obtained camera pose, recovering three-dimensional point coordinates corresponding to the matching points by using a triangulation algorithm, and performing iterative optimization on the camera pose and the three-dimensional point coordinates by using a beam adjustment algorithm to obtain sparse point cloud;

In a specific implementation process, the CMVS algorithm is a multi-view three-dimensional clustering algorithm, and can cluster hash images to reduce the data volume of the images; the PMVS algorithm is a multi-view stereo vision algorithm and can generate dense point cloud through three steps of matching, expanding and filtering.

In the specific implementation process, due to the fact that in the point cloud obtaining process, noise is easily generated under the influence of environmental factors, collecting equipment, artificial disturbance and the like, and deviation is generated on the segmentation and measurement results. Therefore, the obtained 3D point cloud needs to be preprocessed to improve the accuracy of subsequent prediction. More specifically, in the step S3, the preprocessing specifically includes a background removal process and a point cloud filtering process; wherein:

in the background removing process, the 3D point cloud of the plant obtained in the step S2 is extracted by using RGB channels by taking color characteristics as different bases and using a method based on a color threshold value, and irrelevant red paper background parts of the point cloud are removed; the color threshold is specifically set as: G-B is more than or equal to 15 or R-B is more than or equal to 15 and R is more than or equal to 120 and R-G is more than or equal to 45 and R-B is more than or equal to 45;

in the point cloud filtering unit, a statistical outlierremove filter in a PCL point cloud library is used to filter the point cloud obtained in the last step, and the implementation principle is as follows: assuming it is a normal distribution, the shape is determined by the mean

And standard deviation of

The decision, and hence the outlier, is defined as the point where the average distance exceeds the threshold. Traversing all points in the point cloudCalculating the average distance between each point and the nearest K neighboring points; then calculate the mean of all average distances

And standard deviation of

Distance thresholdthresholdThe specific calculation formula is as follows:

wherein the content of the first and second substances,

is a constant; finally traversing all point clouds again, removing the average distance between the point clouds and the K adjacent points, wherein the average distance is larger than the distance threshold valuethresholdA point of (c).

More specifically, in the step S3, the manual labeling is specifically adoptedcloudcompareThe software implementation specifically comprises:

inputting point cloud needing manual markingcloudcompareIn software, a segmentation tool is adopted to carry out semantic segmentation and labeling process on point clouds, and a category label in three categories of leaves (Leaf), stems (Stem) or Non-plants (Non-plant) is distributed to each point cloud; thereby completing manual labeling, as shown in fig. 4, the numbers in the figure are used as labels.

In a specific implementation process, because a subsequently constructed semantic segmentation network needs to be trained first for use, a data set for training the semantic segmentation network can be obtained through preprocessing and manual labeling before actual segmentation, the data set is divided into a training set and a test set according to the proportion of 3.

More specifically, in step S4, the semantic segmentation network specifically performs the following operations: predicting the category labels of the points in the point cloud to complete semantic label prediction; wherein:

the semantic segmentation network adopts an encoder-decoder structure, as shown in fig. 5, in which the MLP is a multi-layer perceptron, and specifically performs the following steps:

s41: inputting the manually marked point cloud into a combined semantic-instance segmentation network, and performing dimension-increasing operation on the point cloud by a multilayer perceptron; if the input point cloud dimension is N × 9, N represents the number of points, in this embodiment, 8092,9 is taken to represent the characteristic dimension of each point, and the point cloud dimension is raised from 9 to 32 by using the multilayer perceptron;

s42: encoding the point clouds after the dimension increasing through four encoders, gradually reducing the number of the point clouds and increasing the dimension of each point cloud; each encoder is composed of a down-sampling module and an attention module, the point cloud is down-sampled at four times of sampling rate, namely, only 25% of point features are reserved after each layer, namely, the cardinality of the generated point cloud is changed from N to N/256 (N → N/4 → N/16 → N/64 → N/256), and meanwhile, the feature dimension of each layer is gradually increased to reserve more information, namely, the feature dimension is changed from M to M2 ⁴ M represents the dimension of the point cloud after the dimension increase in step S41, and is [32 → 64 → 128 → 256 → 512 in the embodiment]；

S43: recovering the point number of the point cloud to N by using four decoders after passing through the encoder; for each layer in a decoder comprising an up-sampling module and a multilayer perceptron, firstly, a nearest neighbor point of each point cloud is inquired by using a KNN algorithm, and then the point cloud is up-sampled by nearest neighbor interpolation; then, splicing the up-sampled features and the intermediate features generated by the corresponding encoders through jump connection to obtain a fused feature map; then inputting the fused feature map into a multilayer perceptron to obtain output;

s44: through three shared FC layers, namely, (N, 128) → (N, 32) → (N, C), and after the first FC layer, a dropout layer with a random loss rate set to 0.5 is applied. The output of the network is the semantic label prediction result of NxC, wherein C represents the category number.

In a specific implementation process, the down-sampling module can perform down-sampling on the input point clouds, so that the number of the point clouds is reduced, and the feature of each point is subjected to dimension increasing. According to the scheme, farthest point sampling is used, farthest points from an existing sampling point set are selected continuously and iteratively to obtain sampled point clouds, then a ball query algorithm is used for fixing a radius for each point cloud to find all points in the radius, a plurality of local areas are generated, each local area passes through a multilayer perceptron, and finally maximum pooling is carried out, so that the obtained characteristic is the output of downsampling.

In a specific implementation process, in order to extract deep semantic features of point cloud, a multi-head two-stage attention module is designed in the embodiment: the local attention and global attention modules are shown in fig. 6. The local attention module uses ball query to search for local regions, focusing on important neighbor features and then giving them more weight. The global attention module focuses on feature dependencies of all points based on a self-attention mechanism. And finally, splicing the output characteristics of the local attention module and the global attention module to obtain the output of the whole attention module. In addition, a multi-point attention mechanism is introduced to obtain more complete characteristic information and improve the generalization capability of the network.

First, a position embedding module is introduced that explicitly expresses spatial position coding as:

then, the point features corresponding to the point features are spliced as follows:

wherein

And

respectively representing points

Coordinates, | | | | represents the euclidean distance between the neighboring points and the center point, | | | denotes the stitching operation of the feature. Then, instead of max/average pooling, a powerful attention mechanism is used, which often results in most of the information being lost, thereby automatically aggregating the feature information. In detail, the present embodiment uses a function

To learn attention scores and to weight and sum these features, there are:

wherein the function

For a shared multi-layered perceptron mlp, W represents a learnable weight.

After the local features are aggregated, a global attention module is designed for updating the global features based on self-attention and calculating attention scores of all points by using matrix dot products. Is provided with

、

、

Respectively representing queries, keys and value matrices, generated from input feature linear mappings. First, attention weights are determined using dot products of matrices

The attention score is then normalized using the softmax function. To enhance the normalization effect, the present embodiment uses the L1 norm.

The feature Fs is obtained by multiplying the normalized attention by the value vector. On this basis, element subtraction is introduced to calculate the offset between the input features and the element subtraction. The function g () carries a nonlinear ReLU layer behind two shared multi-layer perceptrons.

In addition, a multi-head attention mechanism is introduced to obtain more complete information, the generalization capability of the network is further improved,

represents a point ofmThe number of attention heads is M, and 4 attention heads are provided in this embodiment.

In a specific implementation process, for dense tasks such as semantic segmentation and the like for classifying each point in the point cloud, the point set needs to be up-sampled and restored to the original point number. Similar to the deconvolution operation in a convolutional neural network, the point cloud is upsampled using an upsampling module to transition features from the shape level to the point level. Using KNN algorithm for each pointpFind itkProximity of points, based on points in three-dimensional spacepAnd itkInterpolation is carried out on Euclidean distances of adjacent points, and the specific calculation formula is as follows:

wherein the content of the first and second substances,

is thatpIs determined by the point of the neighborhood of the point,

representative heel pointpAnd

the distance therebetween is inversely proportional to the weight, and therefore,

separation devicepThe further away this weight is;

representative pointpAnd with

The distance function between them, in this embodiment, the euclidean distance is used.

Example 2

The scheme also provides a three-dimensional point cloud semantic segmentation system based on multi-head two-stage attention, as shown in FIG. 4, the three-dimensional point cloud semantic segmentation system based on multi-head two-stage attention is used for realizing the three-dimensional point cloud semantic segmentation method based on multi-head two-stage attention, and comprises an image acquisition platform, a three-dimensional reconstruction module, a preprocessing module, an artificial labeling module and a semantic segmentation network construction module; wherein:

the image acquisition platform acquires 2D sequence images of plants with high precision and multiple angles through a camera;

the three-dimensional reconstruction module is used for performing three-dimensional reconstruction according to the acquired 2D sequence images of the plants to obtain 3D point clouds of the plants;

the semantic segmentation network construction module is used for constructing a semantic segmentation network, the marked point cloud is used as input, and semantic label prediction is carried out by the semantic segmentation network to obtain the segmentation result of the plant organs.

More specifically, the three-dimensional reconstruction module comprises a matching point acquisition unit, a matching point screening unit, a sparse point cloud acquisition unit and a dense point cloud acquisition unit; wherein:

the matching point obtaining unit adopts a scale-invariant local feature description operator to extract feature points from the plant 2D sequence images, and establishes a K-dimensional space binary tree model to calculate the Euclidean distance between the feature points of every two plant 2D sequence images by using a nearest neighbor search algorithm so as to perform three-dimensional matching on the feature points, so as to obtain matching points;

More specifically, the preprocessing module comprises a background removing unit and a point cloud filtering unit; wherein:

in a background removing unit, extracting green plant point cloud and red paper background by using RGB (red, green and blue) channels according to different color characteristics of the 3D point cloud of the plant by using a method based on a color threshold value, and removing irrelevant red paper background parts of the point cloud;

in a point cloud filtering unit, a stateticalOutlierRemoval filter in a PCL point cloud library is used for filtering the point cloud obtained in the last step, so that the purpose of filtering the point cloud is achievedThe principle is as follows: assuming it is a normal distribution, the shape is determined by the mean

And standard deviation of

The decision, and hence the outlier, is defined as the point where the average distance exceeds the threshold. Traversing all point data in the point cloud, and calculating the average distance between each point and K nearest neighbor points of each point; then calculate the mean of all average distances

Sum standard deviation

Then distance threshold valuethresholdThe specific calculation formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

is a constant.

Finally, traversing the point cloud again, and removing the point cloud with the average distance between the point cloud and the K adjacent points larger than the threshold valuethresholdA point of (c).

More specifically, the manual labeling module adoptscloudcompareThe software implementation specifically comprises:

inputting point cloud needing manual markingcloudcompareIn software, a segmentation tool is adopted to carry out semantic segmentation and labeling process on the point clouds, and one category label of leaves, stems or non-plants is distributed to each point cloud, so that manual labeling is completed.

More specifically, the semantic segmentation network constructed by the semantic segmentation network construction module specifically comprises the following operations: predicting the category labels of the points in the point cloud to complete semantic label prediction; wherein:

the semantic segmentation network adopts an encoder-decoder structure, and specifically comprises the following steps:

performing 32 → 64 ascending operation on the point cloud by a Fully Connected layer (FC); then, four encoders are used for encoding, the number of the point clouds is gradually reduced, and the dimensionality of each point cloud is increased; each encoder consists of a down-sampling module and attention module, the point cloud is down-sampled at four times the sampling rate, i.e. only 25% of the point features remain after each layer, i.e. the cardinality variation of the resulting point cloud is (N → N/4 → N/16 → N/64 → N/256), while the feature dimension of each layer is gradually increased to retain more information (32 → 64 → 128 → 256 → 512);

then, decoding the encoded point cloud through four encoders, and restoring the point number of the point cloud to N; for each layer in a decoder comprising an up-sampling module and a multi-layer perceptron, firstly, a KNN algorithm is used for inquiring K nearest neighbor points of each point cloud, and then the point cloud is up-sampled through nearest neighbor interpolation; then, splicing the up-sampled features and intermediate features generated by corresponding encoders through jump connection to obtain a fusion feature map; the fused feature map is then input to a multi-tier perceptron.

Finally, the output semantic tag prediction result is obtained through three shared FC layers, namely dimension change is (N, 128) → (N, 32) → (N, C), and a dropout layer with a random loss rate set to 0.5 is applied after the first FC layer.

In the specific implementation process, the system is used for realizing the point cloud segmentation method, and is simple to realize, convenient to operate and easy to apply and popularize in reality.

Example 3

More specifically, in order to further explain the technical effects of the present solution, the present embodiment will explain the solution in more detail.

In this embodiment, the segmentation result of the proposed point cloud semantic segmentation method is evaluated from a point-wise level, a common semantic segmentation evaluation index is an average accuracy (mAcc) and a cross-over ratio (mlou) for segmenting each category, and a calculation formula is:

wherein

The number of the representative categories is,

representation of belonging toiClass is predicted asjThe point of the class is determined by the point of the class,

is represented as belonging tojClass is predicted asiThe point of class, both of which are cases of classification errors,

indicating correctly classified points.

The final experimental result of this embodiment is that the semantic segmentation of the 3D plant point cloud aims to segment the point into three categories, namely Leaf, stem, and Non-plant. The semantic segmentation results are shown in table 1, and the visualization of the segmentation results is shown in fig. 7. In addition, the method provided by the embodiment is compared with the performance of other advanced deep neural networks.

TABLE 1 semantic segmentation results Table

The columns of the table correspond to the experimental results of four segmentation models, and it can be noted in table 1 that in all the models, the segmentation precision of the stem is lower than that of the other two types, because the stem is a weak plant organ, and the average number of the stem leaves is only a small part of the leaf number of each point cloud model compared with the average number of the leaf leaves. Since the number of stem points is small, the number of points for which accurate prediction is performed is small, and each point for which prediction is incorrect has a great influence on the accuracy of stem division. It can be seen that the present embodiment obtains the best segmentation performance with 99.17% OA, 95.62% pacc and 93.62% mlou, and has certain advantages in local information perception capability and segmentation accuracy compared with other mainstream algorithms. In addition, the present embodiment uses other advanced deep learning models as the backbone network: 1) PointNet, 2) PointNet + +, 3) DGCNN, 4) ShellNet, 5) PointWeb; the semantic segmentation performance was then explored and compared to the method herein. It can be seen from the table that the present solution can achieve the highest segmentation accuracy in all 3 categories of the data set.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A three-dimensional point cloud semantic segmentation method based on multi-head two-stage attention is characterized by comprising the following steps:

s2: performing three-dimensional reconstruction according to the acquired 2D sequence image of the plant to obtain a 3D point cloud of the plant;

s3: preprocessing and manually marking the 3D point cloud of the plant to obtain marked point cloud;

s4: constructing a multi-head two-stage attention three-dimensional point cloud semantic segmentation network, taking the marked point cloud as input, and performing semantic label prediction by the three-dimensional point cloud semantic segmentation network to finish the segmentation of three-dimensional point cloud semantics;

s41: inputting the manually marked point cloud into a semantic segmentation network, and performing dimensionality improvement operation on the point cloud by a complete connection layer;

s42: encoding the point clouds subjected to dimensionality increase through four encoders, gradually reducing the number of the point clouds and increasing the dimensionality of each point cloud; each encoder consists of a down-sampling module and a multi-head two-stage attention module, the point cloud is down-sampled at four times of sampling rate, each layer only retains 25% of point characteristics, namely the cardinality of the generated point cloud is changed intoN→N/4→N/16→N/64→N/256，NThe number of representative points; meanwhile, the multi-head two-stage attention module is used for acquiring the geometric features of the plant point cloud in a layered mode, the feature dimension of each layer is gradually increased to keep more information, namely, the feature is transformed into 32 → 64 → 128 → 256 → 512;

s43: after passing through the encoder, four decoders are used to restore the point number of the point cloud toN(ii) a For each layer in the decoder, an up-sampling module and a multi-layer perceptron are included; the up-sampling module firstly uses a KNN algorithm to inquire K nearest neighbor points of each point cloud, and then up-samples the point cloud through a nearest neighbor interpolation algorithm; then, splicing the up-sampled features and the intermediate features generated by the corresponding encoders through jump connection to obtain a fused feature map; finally, inputting the fused feature map into a multilayer perceptron to obtain output;

s44: with three shared fully connected layers, dimension change is (N, 128) → (N, 32) → (N, C), and a random loss rate set is applied after the first fully connected layer

The drop layer(s) of (a),

2. The method for semantically segmenting the three-dimensional point cloud based on the multi-head two-stage attention according to claim 1, wherein the step S2 specifically comprises the following steps:

s21: extracting feature points from the plant 2D sequence images by adopting a scale-invariant feature transformation operator, establishing a K-dimensional space binary tree model by using a nearest field search algorithm, and calculating Euclidean distance between the feature points of every two plant 2D sequence images through the K-dimensional space binary tree model to perform three-dimensional matching on the feature points to obtain matching points;

s24: and expanding pixels around the feature points of the sparse point cloud by adopting CMVS and PMVS algorithms to form dense point cloud, namely obtaining the 3D point cloud of the plant.

3. The method for semantically segmenting the three-dimensional point cloud based on the multi-head two-stage attention according to claim 1, wherein in the step S3, the preprocessing specifically comprises a background removal process and a point cloud filtering process; wherein:

in the background removing process, the color features of the 3D point cloud of the plant obtained in the step S2 are taken as different bases, a method based on a color threshold value is used, a green plant point cloud and a red paper point cloud are extracted by utilizing an RGB channel, and irrelevant background parts of the point cloud are removed; then, in the point cloud filtering process, filtering the point cloud with the irrelevant background part removed by using the statistical outlierremove in the PCL point cloud library, and the specific process is as follows:

And standard deviation of

(ii) a Here, it is assumed that there is a normal distribution whose shape is represented by the mean

Sum standard deviation

Determining, thus defining as outliers, points whose average distance exceeds the threshold, the distance thresholdshresholdThe specific calculation formula is as follows:

is a constant; finally traversing all point clouds again, removing the average distance between the point clouds and the K adjacent points, wherein the average distance is larger than the distance threshold valueshresholdPoint (2) of (c).

4. The method for semantic segmentation of the three-dimensional point cloud based on multi-head two-stage attention according to claim 3, wherein in the step S3, the manual labeling specifically comprises:

inputting point cloud needing manual marking

5. A three-dimensional point cloud semantic segmentation system based on multi-head two-stage attention is characterized by comprising an image acquisition platform, a three-dimensional reconstruction module, a preprocessing module, a manual labeling module and a semantic segmentation network construction module; wherein:

the semantic segmentation network construction module is used for constructing a semantic segmentation network, the marked point cloud is used as input, and the semantic segmentation network is used for performing semantic label prediction to obtain a segmentation result of a plant organ level;

inputting the manually marked point cloud into a semantic segmentation network, and performing dimension-increasing operation on the point cloud by a complete connection layer;

encoding the point clouds after the dimension increasing through four encoders, gradually reducing the number of the point clouds and increasing the dimension of each point cloud; each encoder consists of a down-sampling module and a multi-head two-stage attention module, the point cloud is down-sampled at four times of sampling rate, each layer only retains 25% of point characteristics, namely the cardinality of the generated point cloud is changed intoN→N/4→N/16→N/64→N/256，NThe number of representative points; meanwhile, the multi-head two-stage attention module is used for acquiring the geometric features of the plant point cloud in a layered mode, the feature dimension of each layer is gradually increased to reserve more information, namely the feature transformation is 32 → 64 → 128 → 256 → 512;

after passing through the encoder, four decoders are used to restore the point number of the point cloud toN(ii) a For each layer in the decoder, an up-sampling module and a multi-layer perceptron are included; the up-sampling module firstly uses KNN algorithm to inquire K nearest neighbor points of each point cloud,then, the point cloud is up-sampled through a nearest neighbor interpolation algorithm; then, splicing the up-sampled features and the intermediate features generated by the corresponding encoders through jump connection to obtain a fused feature map; finally, inputting the fused feature map into a multilayer perceptron to obtain output;

with three shared fully connected layers, namely dimension change to (N, 128) → (N, 32) → (N, C), and a random loss rate set applied after the first fully connected layer

The drop layer of (a) is formed,

6. The multi-head two-stage attention-based three-dimensional point cloud semantic segmentation system according to claim 5, wherein the three-dimensional reconstruction module comprises a matching point acquisition unit, a matching point screening unit, a sparse point cloud acquisition unit and a dense point cloud acquisition unit; wherein:

the matching point obtaining unit adopts a scale-invariant local feature description operator to extract feature points from the plant 2D sequence images, and establishes a K-dimensional spatial binary tree model to calculate the Euclidean distance between the feature points of every two plant 2D sequence images by using a nearest neighbor search algorithm so as to perform stereo matching of the feature points, so as to obtain matching points;

and the dense point cloud acquisition unit adopts CMVS and PMVS algorithms to expand the pixels around the feature points of the sparse point cloud to form dense point cloud, namely obtaining the 3D point cloud of the plant.

7. The multi-head two-stage attention-based three-dimensional point cloud semantic segmentation system according to claim 5, wherein the preprocessing module comprises a background removal unit and a point cloud filtering unit; wherein:

in a background removing unit, extracting green plant point cloud and red paper point cloud from 3D point cloud of a plant by using an RGB channel by using a method based on color threshold values according to different color characteristics, and removing irrelevant background parts of the point cloud;

Sum standard deviation

wherein the content of the first and second substances,

is a constant; finally, traversing all the point clouds again, and removing the point clouds with the average distance between the point clouds and the K adjacent points larger than the distance threshold valueshresholdPoint (2) of (c).

8. The system of claim 6, wherein the manual labeling module employs a multi-headed two-level attention based semantic segmentation of the three-dimensional point cloud

The software implementation specifically comprises: