CN116824143A

CN116824143A - Point cloud segmentation method based on bilateral feature fusion and vector self-attention

Info

Publication number: CN116824143A
Application number: CN202310780811.7A
Authority: CN
Inventors: 胡海兵; 刘泓淳; 冯鑫
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-09-29

Abstract

The invention discloses a point cloud segmentation method based on bilateral feature fusion and vector self-attention, and relates to the technical field of point cloud semantic segmentation; the method comprises the following steps: inputting original point cloud data; encoding the input original point cloud data by using a bilateral feature fusion module and a vector self-attention module; and decoding the point cloud characteristics, and up-sampling the point cloud characteristics by using a continuous FP layer to obtain a point cloud segmentation result. The invention provides a high-efficiency point cloud semantic segmentation network, so that the semantic segmentation of the point cloud is quicker and more accurate, and the segmentation performance is superior; the correction information computing module based on the geometric semantic bilateral characteristic information is provided, the edge information is adjusted, the problem of edge ambiguity in local area aggregation is relieved, and the aggregation effect of the local information is enhanced; the novel offset vector self-attention module is provided, global features of the point cloud are effectively extracted, and a better global feature extraction effect is achieved on the basis of reducing network calculation amount.

Description

Point cloud segmentation method based on bilateral feature fusion and vector self-attention

Technical Field

The invention relates to the technical field of point cloud semantic segmentation, in particular to a point cloud segmentation method based on bilateral feature fusion and vector self-attention, which is suitable for semantic segmentation of indoor point clouds.

Background

With the intensive research, methods for processing 3D point clouds using deep learning have been significantly successful, and these methods can be generally classified into three types: projection-based methods, voxel-based methods, and point-based methods. Among them, the point-based method, i.e., directly processing a point set using a multi-layer perceptron (MLP), has been the mainstream due to its efficiency and high performance.

In the point-based approach, pointNet is a very classical network. It uses a shared multi-layer perceptron (MLP) to extract features and consistently aggregate global features through symmetric functions, regardless of internal order. However, the Pointet adopts a single-point sampling mode, and this method cannot effectively extract local features. The multi-level feature extraction method effectively solves the problems in the prior art by sampling and grouping operations proposed by Pointnet++ on the basis of Pointnet. However, the local feature extraction of the Pointet++ by the grouping method can cause the problem of edge ambiguity of local areas, and in the neighborhood construction process, abnormal values and overlapping among the neighborhood are difficult to avoid, and in the intersection area of multiple semantic classifications, the abnormal values and overlapping are more prominent. And the aggregation area divided by taking Euclidean distance as a standard can not be well adapted to semantic features in the local range of the semantic space, so that PointNet++ focuses on the extraction of geometric information, the aggregation effect on the local feature information is insufficient, the extraction capacity for semantic extraction is weaker, and the global extraction is insufficient by adopting FPS only. Compared with Pintnet++, the recently proposed PointNeXt is focused on training skills and scale strategies to further improve the performance of PointNet++; the PointMLP achieves very high classification performance without any complex local feature extractor design by introducing a residual MLP structure. However, these methods focus on feature extraction in geometric space, and the problems of PointNet++ edge ambiguity and insufficient global feature extraction are not well solved. There are also some methods focusing on semantic feature extraction, DGCNN proposes an edge convolution (EdgeConv) for learning edge features, by constructing a local neighborhood graph and performing the EdgeConv operation on each adjacent edge, dynamically updating the graph structure between levels. AdaptiveGraph proposes to assign learning weights on each edge to better evaluate and aggregate information. The methods also adopt similar grouping methods, have the problem of edge ambiguity similar to PointNet++, and have the problem of geometric structure deficiency in a high-dimensional semantic space due to feature extraction focusing on the semantic space. With the adoption of the self-attention mechanism which has great success in natural language processing and two-dimensional image processing tasks, the self-attention mechanism is also used for processing the three-dimensional point cloud, and the attention mechanism has strong extraction capability on global features, but has the problem of large calculation amount.

It can be seen that, because most of the point cloud semantic segmentation networks currently extract local features, the point clouds are grouped and aggregated. The problem of ambiguity of neighbor edges in a packet is difficult to resolve. And the aggregation area divided by taking Euclidean distance as a standard can not be well adapted to semantic features in the local scope of the semantic space, and the geometrical structure is deleted in the high-dimensional semantic space, so that the aggregation effect of the local feature information is insufficient. The method of constructing multi-scale feature extraction by downsampling only loses much detail information, and global features cannot be fully extracted. How to solve the problems becomes a technical problem to be solved in the prior art.

Disclosure of Invention

Based on the technical problems in the background art, the invention provides a point cloud segmentation method based on bilateral feature fusion and vector self-attention, which improves the robustness and accuracy of point cloud scene semantic segmentation, improves the feature learning capability of a semantic segmentation network and relieves the problems of edge ambiguity of local region features and insufficient global feature representativeness.

The technical scheme adopted by the invention is as follows:

a point cloud segmentation method based on bilateral feature fusion and vector self-attention comprises the following steps:

s1: inputting original point cloud data;

s2: encoding the input original point cloud data by using a bilateral feature fusion module and a vector self-attention module;

s3: and decoding the point cloud characteristics, and up-sampling the point cloud characteristics by using a continuous FP layer to obtain a point cloud segmentation result.

Further, the input source point cloud data in step S1 specifically includes:

taking S3DIS as the indoor data set of the test, S3DIS is a large indoor scene segmentation data set and comprises 13 categories and 271 rooms. Each point cloud data has 9 features, namely color information R, G, B, coordinate information x, y, z, and 3 normal vectors. 271 rooms are divided into 6 areas, each room being divided into 1 m x 1 m blocks. Setting an input point cloud position F _in Its dimension is [ B, N,9]Wherein B is a batch, N is the number of points, 9 is a feature, and the total number of input features is B.times.N.times.9.

Further, in the step S2, the encoding of the input original point cloud data using the bilateral feature fusion module and the vector self-attention module specifically includes:

dividing the input point cloud into a geometric space comprises coordinate information of the point cloud, and the dimension [ B, N,3 ] of the coordinate information]The semantic space contains color information and normal vectors of the point cloud, and the dimensions of the color information and the normal vectors are [ B, N,6]The semantic space portion is subjected to mlp operation, converted into semantic space and then respectively fed into an encoder SA, and sampling points p are generated by using FPS on the original data _i The corresponding characteristic information represents f _i Grouping the point clouds by a ball query method under the measurement of three-dimensional Euclidean distance by taking the sampling points as the centers, wherein the ball query method is to set a radius r to find the points of a spherical range taking the FPS sampling points as the centers and taking r as the radius, and the points are called as a center point p _i Neighbor point p of (2) _j ，And addAdding an SA layer and a VA layer to extract point cloud characteristics; the SA layer is used for representing bilateral feature information fusion and local feature extraction of the point cloud, and the VA layer is used for completing global feature extraction of the point cloud for the improved self-attention layer;

at SA layer, the absolute position of the center point and the relative position of the neighborhood thereof are combined into a geometric spatial local feature G (p _i ,p _j )＝[p _i ；p _j -p _i ]The method comprises the steps of carrying out a first treatment on the surface of the Likewise, S (f) _i ,f _j )＝[f _i ；f _j -f _i ]Representing local features in semantic space; the geometrical information is converted into semantic space by adopting MLP, the geometrical information is converted into a feature mask by softmax function, and the edge feature information of the semantic space is corrected; likewise, the same method is adopted to dynamically adjust the geometric edge information for the geometric space; meanwhile, the adjusted edge information is added with the original edge information, and a residual error structure is established to ensure the robustness of information optimization; the calculation of this correction information can be formulated as:

f _s ＝Softmax(mlp(G(p _i ,p _j )))*S _e +S _e (1)

p _s ＝Softmax(mlp(S(f _i ,f _j )))*G _e +G _e (2)

wherein G is _e (pi,pj)＝[p _i ；p _j -p _i ],S _e (f _i ,f _j )＝[f _i ；f _j -f _i ],p _i ,p _j ,p _s ∈R ^N×3 ,f _i ,f _j ,f _s ∈R ^N ^×d .

The obtained supplemental information is then combined with p _s ，f _s And G (p) _i ,p _j )，S(f _i ,f _j ) Combining to obtain G' = [ p ] _i ；p _j -p _i ；p _s ]Wherein S' = [ f _i ；f _j -f _i ；f _s ]，G’∈R ^N×9 ，S’∈R ^N×3×d Then get them concatTo enhanced local feature information F _c The obtained local characteristic information F _c Sending into two LBR layers for feature extraction to obtain F _c ' finally, the obtained feature F _c ' sending the partial region into a maximum pooling layer to complete the feature polymerization of the partial region to obtain a partial polymerization feature F _a The method comprises the steps of carrying out a first treatment on the surface of the In obtaining local polymerization characteristics F _a Then, the obtained local area aggregation information is sent into the VA layer to carry out global feature extraction;

in the traditional attention mechanism, the attention mechanism is calculated in the point cloud according to the following steps: first, embed feature F _a Is fed to three separable convolutions of kernel size 1 x 1 to produce three new feature maps F _q 、F _k And F _v The method comprises the steps of carrying out a first treatment on the surface of the Subsequently, a new feature map F _q Will be transposed and multiplied by the new feature map F _k Generating an attention matrix of size n×n after the softmax layer:

the method of the vector attention mechanism is introduced into the calculation method of the traditional attention mechanism, the attention matrix is generated by subtraction and two linear layers through a softmax function, and the calculation formula is as follows

Attention(F _Q ,F _K ,F _V )＝Softmax(γ(F _Q -F _K ))F _V (4)

Where γ is a mapping function that generates an attention vector for feature aggregation, the attention matrix F to be obtained _A And input feature F _a Point riding results in a attention feature F _v An offset attention mechanism is designed, expressed by the following formula:

F _out ＝Relu(Batchnorm(Mlp(F _a -F _A )))+F _a (5)

the characteristic information F is obtained _out 。

Further, the decoding of the point cloud feature in step S3 specifically includes:

the decoding part adopts a continuous FP layer to carry out up-sampling on the point cloud characteristics so as to achieve the effect of partitioning the point cloud; adopting interpolation based on distance and hierarchical propagation strategy crossing skip links; in the feature propagation level, point features are separated from N _l Propagation of x (d+C) point to N _l-1 Points, where N _l-1 And N _l (having N _l ≤N _l-1 ) Is the input and output point set size of the first SA layer; by at N _l-1 Interpolation N at the coordinates of a point _l Feature propagation is realized by the feature values f of the points; among the numerous choices of interpolation, the inverse distance weighted average based on k nearest neighbors is used; then N _l-1 Interpolation features at the points are connected with skip link point features from the set abstraction level; then, the connected features are passed through the "Pointnet Unit" and the shared full connection and ReLU layer is applied to update the feature vector for each point; repeating the process until the feature is propagated to the original set of points;

and finally, obtaining a segmentation result of the point cloud.

The invention provides a point cloud segmentation method based on bilateral feature fusion and an attention mechanism, which at least comprises the following beneficial effects compared with the prior art:

1) The invention provides a high-efficiency point cloud semantic segmentation network, so that the semantic segmentation of the point cloud is quicker and more accurate, and the segmentation performance is superior.

2) The invention provides a group of correction information calculation modules based on geometric semantic bilateral characteristic information, which adjusts edge information, relieves the problem of edge ambiguity in local area aggregation and strengthens the aggregation effect of local information.

3) The invention provides a new offset vector self-attention module, which effectively extracts the global features of the point cloud and obtains better global feature extraction effect on the basis of reducing the network calculation amount.

Drawings

FIG. 1 is a process schematic diagram of a bilateral feature aggregation module;

FIG. 2 is a process schematic of the attention module;

FIG. 3 is a schematic diagram of a point cloud semantic segmentation network process based on bilateral feature fusion and attention;

FIG. 4 is a point cloud semantic segmentation effect diagram.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Example 1.

As shown in fig. 1-4, a point cloud segmentation method based on bilateral feature fusion and vector self-attention, the method comprises the following steps:

s1: inputting original point cloud data

Taking S3DIS as the indoor data set of the test, S3DIS is a large indoor scene segmentation data set and comprises 13 categories and 271 rooms. Each point cloud data has 9 features, namely color information R, G, B, coordinate information x, y, z, and 3 normal vectors. 271 rooms are divided into 6 areas, each room being divided into 1 m x 1 m blocks. Setting an input point cloud position F _in Its dimension is [ B, N,9]Wherein B is a batch, N is the number of points, 9 is a feature, and the total number of input features is B x N x 9;

s2: encoding input original point cloud data by using bilateral feature fusion module and vector self-attention module

Dividing the input point cloud into a geometric space comprises coordinate information of the point cloud, and the dimension [ B, N,3 ] of the coordinate information]The semantic space contains color information and normal vectors of the point cloud, and the dimensions of the color information and the normal vectors are [ B, N,6]The semantic space portion is then subjected to mlp operations, converted into semantic space and fed into the encoder SA, respectively, by first generating sampling points p using the FPS on the raw data _i The corresponding characteristic information represents f _i Grouping the point clouds by a ball query method under the measurement of three-dimensional Euclidean distance by taking the sampling points as the centers, wherein the ball query method is to set a radius r to find the points of a spherical range taking the FPS sampling points as the centers and taking r as the radius, and the points are called as a center point p _i Neighbor point p of (2) _j ，In the model, continuous 4-layer FPS sampling is adopted to construct point cloud space sampling points with different scales as N/4, N/16, N/64 and N/128, 0.1,0.2,0.4,0.8 is adopted as grouping radius point cloud for grouping in each layer, and SA layer and VA layer are added to extract point cloud characteristics. The SA layer is used for representing bilateral feature information fusion and local feature extraction of the point cloud, and the VA layer is used for completing global feature extraction of the point cloud for the improved self-attention layer.

At SA layer, the absolute position of the center point and the relative position of the neighborhood thereof are combined into a geometric spatial local feature G (p _i ,p _j )＝[p _i ；p _j -p _i ]. Likewise, S (f) _i ,f _j )＝[f _i ；f _j -f _i ]Representing local features in semantic space. The geometrical information is converted into semantic space by adopting MLP, the geometrical information is converted into a feature mask by softmax function, and the edge feature information of the semantic space is corrected. Similarly, the same method is used for dynamically adjusting the geometric edge information with respect to the geometric space. And meanwhile, the adjusted edge information is added with the original edge information, so that a residual error structure is established to ensure the robustness of information optimization. The calculation of this correction information can be formulated as:

f _s ＝Softmax(mlp(G(p _i ,p _j )))*S _e +S _e (1)

p _s ＝Softmax(mlp(S(f _i ,f _j )))*G _e +G _e (2)

wherein G is _e (p _i ,p _j )＝[p _i ；p _j -p _i ],S _e (f _i ,f _j )＝[f _i ；f _j -f _i ],p _i ,p _j ,p _s ∈R ^N×3 ,f _i ,f _j ,f _s ∈R ^N ^×d .

The obtained supplemental information is then combined with p _s ，f _s And G (p) _i ,p _j )，S(f _i ,f _j ) Combining to obtain G' = [ p ] _i ；p _j -p _i ；p _s ]Wherein S' = [ f _i ；f _j -f _i ；f _s ]，G’∈R ^N×9 ，S’∈R ^N×3×d Then the local feature information Fc is obtained after concat, and the obtained local feature information Fc is sent into two LBR layers (linear+Batchnorm+Relu) for feature extraction to obtain F _c ' finally, the obtained feature F _c ' sending the partial region into a maximum pooling layer to complete the feature polymerization of the partial region to obtain a partial polymerization feature F _a 。

In obtaining local polymerization characteristics F _a And then, sending the obtained local area aggregation information into the VA layer to perform global feature extraction.

Before introducing the improved attention mechanism, the following conventional attention mechanism is first introduced, and the attention mechanism is calculated in the point cloud, usually in accordance with the following steps. First, embed feature F _a Is fed to three separable convolutions of kernel size 1 x 1 to produce three new feature maps F _q 、F _k And F _v . Subsequently, a new feature map F _q Will be transposed and multiplied by the new feature map F _k An attention matrix of size n x n is generated after the softmax layer.

The attention module is a method for introducing a vector attention mechanism in a traditional attention mechanism calculation method, and generates an attention matrix by subtraction and two linear layers through a softmax function. Unlike conventional attention computation, the attention weights in the vector attention are vectors that can modulate a single characteristic channel, and the amount of computation required to calculate the attention matrix can be reduced by subtraction instead. The calculation formula is as follows

Attention(F _Q ,F _K ,F _V )＝Softmax(γ(F _Q -F _K ))F _V (4)

Where γ is a mapping function (e.g. MLP) that generates an attention matrix F that the attention vector for feature aggregation will get _A And input feature F _a Point riding results in a attention feature F _v Typically, attention features are then fed into the MLP layer and input features F are added by residual links _a The final output feature F is available but in order to increase the attention weight and reduce the effect of noise, an offset attention mechanism is designed which works on the principle of replacing the attention feature with an offset between the input of the self-attention module and the attention feature. Can be expressed by the following formula:

F _out ＝Relu(Batchnorm(Mlp(F _a -F _A )))+F _a (5)

the characteristic information F is obtained _out

S3: decoding point cloud features

The decoding part adopts a continuous FP layer to carry out up-sampling on the point cloud characteristics so as to achieve the effect of partitioning the point cloud. Briefly, the FP layer is a process that aggregates features back to the original point cloud, another approach is to propagate features from sub-sampling points to the original points.

Distance-based interpolation and hierarchical propagation strategies across hop-level links are employed. In the feature propagation level, point features are separated from N _l Propagation of x (d+C) point to N _l-1 Points, where N _l-1 And N _l (having N _l ≤N _l-1 ) Is the point set size of the input and output of the layer i SA layer. By at N _l-1 Interpolation N at the coordinates of a point _l Characteristic value f of each pointNow the feature propagates. Among the numerous options for interpolation, an inverse distance weighted average based on k nearest neighbors is used (as in equation 2, p=2, k=3 is used by default). Then N _l-1 Interpolation features at the points are connected with skip-link point features from the set abstraction level. The characteristics of the connection are then passed through a "Pointnet Unit", similar to a convolution by convolution in CNN. Some shared full connection and ReLU layers are applied to update the feature vector for each point. This process is repeated until the feature is propagated to the original set of points.

In the present model, up-sampling is performed on the point cloud by using FP layers of 4 continuous layers, and finally, the segmentation result of the point cloud is obtained, as shown in table 1.

TABLE 1 qualitative segmentation results

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A point cloud segmentation method based on bilateral feature fusion and vector self-attention is characterized by comprising the following steps:

s1: inputting original point cloud data;

2. The point cloud segmentation method based on bilateral feature fusion and vector self-attention as claimed in claim 1, wherein the input original point cloud data in step S1 specifically comprises:

taking S3DIS as a tested indoor data set, wherein S3DIS is a large indoor scene segmentation data set and comprises 13 categories and 271 rooms; each point cloud data has 9 features, namely color information R, G, B, coordinate information x, y, z and 3 normal vectors; 271 rooms are divided into 6 areas, each room being divided into 1 m x 1 m blocks; setting an input point cloud position F _in Its dimension is [ B, N,9]Wherein B is a batch, N is the number of points, 9 is a feature, and the total number of input features is B.times.N.times.9.

3. The method for point cloud segmentation based on bilateral feature fusion and vector self-attention according to claim 2, wherein the encoding of the input original point cloud data using the bilateral feature fusion module and the vector self-attention module in step S2 specifically comprises:

dividing the input point cloud into a geometric space comprises coordinate information of the point cloud, and the dimension [ B, N,3 ] of the coordinate information]The semantic space contains color information and normal vectors of the point cloud, and the dimensions of the color information and the normal vectors are [ B, N,6]The semantic space portion is subjected to mlp operation, converted into semantic space and then respectively fed into an encoder SA, and sampling points p are generated by using FPS on the original data _i The corresponding characteristic information represents f _i Grouping point clouds by a ball query method under the measurement of three-dimensional Euclidean distance by taking the sampling points as the centers, wherein the ball query method is to find a spherical surface taking an FPS sampling point as the center and taking r as a radius by setting a radius rPoints of the range, which are referred to as center points p _i Neighbor point p of (2) _j ，Adding an SA layer and a VA layer to extract point cloud characteristics; the SA layer is used for representing bilateral feature information fusion and local feature extraction of the point cloud, and the VA layer is used for completing global feature extraction of the point cloud for the improved self-attention layer;

f _s ＝Softmax(mlp(G(p _i ，p _j )))*S _e +S _e (1)

p _s ＝Softmax(mlp(S(f _i ，f _j )))*G _e +G _e (2)

wherein G is _e (p _i ，p _j )＝[p _i ；p _j -p _i ]，S _e (f _i ，f _j )＝[f _i ；f _j -f _i ]，p _i ,p _j ,p _s ∈R ^N × ³ ，f _i ,f _j ,f _s ∈R ^N×d .

The obtained supplemental information is then combined with p _s ，f _s And G (p) _i ，p _j )，S(f _i ,f _j ) Combining to obtain G' = [ p ] _i ；p _j -p _i ；p _s ]Wherein S' = [ f _i ；f _j -f _i ；f _s ]，G’∈R ^N×9 ，S’∈R ^N×3×d Then they are concat to obtain enhanced local feature information F _c The obtained local characteristic information F _c Sending into two LBR layers for feature extraction to obtain F _c ' finally, the obtained feature F _c ' sending the partial region into a maximum pooling layer to complete the feature polymerization of the partial region to obtain a partial polymerization feature F _a The method comprises the steps of carrying out a first treatment on the surface of the In obtaining local polymerization characteristics F _a Then, the obtained local area aggregation information is sent into the VA layer to carry out global feature extraction;

Attention(F _Q ，F _K ，F _V )＝Softmax(γ(F _Q -F _K ))F _V (4)

F _out ＝Relu(Batchnorm(Mlp(F _a -F _A )))+F _a (5)

the characteristic information F is obtained _out 。

4. The point cloud segmentation method based on bilateral feature fusion and vector self-attention as claimed in claim 3, wherein the decoding of the point cloud features in step S3 specifically comprises:

the decoding part adopts a continuous FP layer to carry out up-sampling on the point cloud characteristics so as to achieve the effect of partitioning the point cloud; adopting interpolation based on distance and hierarchical propagation strategy crossing skip links; in the feature propagation level, point features are separated from N _l Propagation of x (d+C) point to N _l-1 Points, where N _l-1 And N _l Is the point set size of the input and output of the layer I SA layer, where N _l ≤N _l-1 The method comprises the steps of carrying out a first treatment on the surface of the By at N _l-1 Interpolation N at the coordinates of a point _l Feature propagation is realized by the feature values f of the points; among the numerous choices of interpolation, the inverse distance weighted average based on k nearest neighbors is used; then N _l-1 Interpolation features on the points are connected with skip link point features from the set abstraction level, and the shared full connection and ReLU layer is applied to update feature vectors of each point; repeating the process until the feature is propagated to the original set of points;

and finally, obtaining a segmentation result of the point cloud.