CN114743007A

CN114743007A - Three-dimensional semantic segmentation method based on channel attention and multi-scale fusion

Info

Publication number: CN114743007A
Application number: CN202210418602.3A
Authority: CN
Inventors: 张莹; 孙月; 张露露; 王玉
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-07-12

Abstract

The invention belongs to the technical field of three-dimensional point cloud data processing, and discloses a three-dimensional point cloud semantic segmentation method based on channel attention and multi-scale fusion. Firstly, point cloud data to be segmented are read, preprocessed and input into a segmentation network. Then sequentially passes through four modules consisting of an encoder and a channel attention layer, wherein the encoder comprises a downsampling layer, a grouping layer and a position adaptive convolution. And extracting the point cloud context information by using a multi-scale convolution context module, and finally sequentially passing through four decoders consisting of an upper sampling layer and a unit PointNet network. The final segmentation result is obtained by a fully connected layer of size k (number of classes). The invention not only makes full use of the position information of the point cloud, but also introduces a channel attention layer to recalibrate the point cloud characteristics on the channel dimension, pays more attention to the channel information useful for the segmentation task, further provides a multi-scale convolution context module, and adopts the cavity convolution with the same expansion rate but different kernel sizes to capture the characteristics of different scales in parallel, thereby improving the segmentation result.

Description

Three-dimensional semantic segmentation method based on channel attention and multi-scale fusion

Technical Field

The invention belongs to the technical field of three-dimensional point cloud data processing, and particularly relates to a three-dimensional point cloud semantic segmentation method based on channel attention and multi-scale fusion.

Background

With the development and the rise of artificial intelligence technology, 3D point cloud data analysis draws extensive attention, and compared with a two-dimensional image, the 3D point cloud contains richer three-dimensional space information, is not influenced by external factors such as illumination and visual angle, and can accurately and comprehensively depict a model. The 3D point cloud segmentation is one of artificial intelligence forward-edge research directions as key contents for scene understanding, and is widely applied to the fields of robots, virtual reality, automatic driving, laser remote sensing measurement and the like.

Point cloud segmentation methods can be divided into traditional point cloud segmentation and point cloud semantic segmentation. The traditional point cloud segmentation utilizes information such as the position and the shape of the point cloud to segment different region boundaries, mainly comprises edge-based segmentation methods, region-based segmentation methods and model fitting-based segmentation methods, and segmentation results obtained by the methods do not contain any semantic information, the results need to be semantically labeled manually, and the efficiency is extremely low under the condition of large data scale. Point cloud semantic segmentation is to automatically label semantic labels of different types to objects of different types in a three-dimensional space on the basis of traditional point cloud segmentation, so that each object has specific type information, and currently, deep learning is mainly used as an implementation means, and the processing modes mainly include the following three types:

(1) the voxel-based method comprises the steps of dividing a three-dimensional scene into voxel grids, converting original three-dimensional point clouds into voxels, and then processing by adopting a three-dimensional convolution network. However, the three-dimensional points are mainly concentrated on the surface of the object and become very sparse after being converted into voxels, so that the time and space utilization rate of the dense convolutional network is low, and a part of information is lost in the process of voxel conversion, thereby affecting the performance of the network.

(2) The method based on multi-view projection comprises the steps of firstly projecting a three-dimensional object into a plurality of views, then extracting image features by using a conventional two-dimensional convolution neural network, and identifying and analyzing a target. Due to the problem of object occlusion in a real scene, a part of information loss is caused after an object is projected on a two-dimensional plane, spatial structure information contained in three-dimensional data cannot be fully utilized, and the selection of a projection plane also has a certain influence on the result of an algorithm.

(3) The method based on the original point cloud directly processes the three-dimensional point cloud data in the scene without the aid of intermediate data types. Compared with the former two methods, the method needs less memory and does not generate information loss, and the characteristics are extracted mainly by adopting a multilayer perceptron or a convolution method suitable for point cloud data.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a three-dimensional point cloud semantic segmentation method based on channel attention and multi-scale fusion, so that the segmentation precision is improved.

The invention discloses a three-dimensional point cloud semantic segmentation method based on channel attention and multi-scale fusion, which comprises the following steps of:

step 1, reading and preprocessing point cloud data;

step 2, the point cloud data passes through an encoder consisting of a down-sampling layer, a grouping layer and position self-adaptive convolution and is mainly responsible for up-sampling and feature extraction;

and 2.1, downsampling the point cloud by using a downsampling layer.

And 2.2, dividing the point set obtained in the last step into a plurality of areas by using a grouping layer.

And 2.3, extracting initial features of each region by using a position self-adaptive convolution method.

Step 3, recalibrating the point cloud characteristics by utilizing a channel attention layer, modeling the correlation among channel characteristic information, and changing the corresponding specific gravity of the different characteristics in the overall characteristic expression by learning the weight values of the different characteristics;

and 4, repeating the steps 2 to 3 for 4 times, and performing down-sampling layer by layer to extract point cloud characteristics.

And 5, inputting the feature vector output by the last channel attention layer into a multi-scale convolution context module, wherein the module samples the features in parallel by adopting the cavity convolution with the same expansion rate and different convolution kernels, so that the receptive field range is gradually enlarged, and the lost detail information is made up.

And 6, passing the feature vector output by the multi-scale convolution context module through a decoder consisting of an up-sampling layer and a PointNet unit, mainly taking charge of down-sampling and feature decoding, and taking the input of the encoder as the other input of the decoder through jumping connection.

And 6.1, upsampling the point cloud characteristics by using an upsampling layer.

And 6.2, decoding the characteristics by using a unit PointNet network.

And 7, repeating the step 6 for 4 times, and up-sampling and decoding the point cloud characteristics layer by layer.

And 8, obtaining classification scores of k classes through a full connection layer with the size of k (class number), and further obtaining a segmentation result.

Compared with the prior art, the invention has the following advantages:

(1) the invention adopts position self-adaptive convolution instead of a common multilayer perceptron to extract point cloud characteristics, constructs a convolution kernel in a dynamic data driving mode, fully utilizes the position information of points and flexibly applies the irregular geometric structure of the 3D point cloud.

(2) The invention introduces the channel attention layer, can fully apply the channel information of the characteristics, increases the weight value of the information with large contribution to the network model, and otherwise reduces the weight (reduces the corresponding weight value of the characteristics with smaller information quantity) to realize the recalibration of the characteristics by the model.

(3) The invention provides a multi-scale convolution context module which is used for extracting point cloud context information. It uses the cavity convolution with the same expansion rate but different kernel sizes to capture the features of different scales in parallel and improve the segmentation result.

Drawings

FIG. 1 is a schematic diagram of a three-dimensional point cloud semantic segmentation network structure according to the present invention.

Fig. 2 is a schematic diagram of the encoder and decoder of fig. 1 according to the present invention.

FIG. 3 is a schematic diagram of the structure of the position adaptive convolution shown in FIG. 2 according to the present invention.

FIG. 4 is a flow chart of the channel attention layer of FIG. 1 according to the present invention.

FIG. 5 is a diagram illustrating a comparison between hole convolution and standard convolution in the multi-scale convolution context module according to the present invention.

FIG. 6 is a block diagram illustrating the structure of the context module of FIG. 1 according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Aiming at the problems of the original point cloud segmentation method based on deep learning, the invention provides a three-dimensional point cloud semantic segmentation method based on channel attention and multi-scale fusion, and a network structure is shown in figure 1. Similar to the image segmentation method, the attention layer focuses on channel information which is more beneficial to the task, the multi-scale fusion module further samples the features by using cavity convolutions with different receptive field sizes, emphasizes ignored local information, and meanwhile, extracts preliminary features by adopting position adaptive convolution which is more suitable for point cloud data, and the specific implementation process is as follows:

step 1, reading and preprocessing point cloud data;

the existing point cloud data set is mainly divided into an indoor scene and an outdoor scene, wherein the indoor data set comprises S3DIS, ScanNet, Semantics and the like, and the storage formats mainly comprise TXT, PLY, OBJ and BIN, so that the point cloud format is unified and data is read at first, and then the point cloud data is simplified on the basis of keeping geometric characteristics through preprocessing operations such as rotation, denoising and the like, and a stable data basis is provided for subsequent processing.

And 2, passing the point cloud data through an encoder which is composed of a down-sampling layer, a grouping layer and position adaptive convolution as shown on the left side of the figure 2.

And 2.1, downsampling the point cloud by using a downsampling layer.

Given an input point { x₁，x₂，...，x_nUsing a farthest point sampling method (FPS) to select a subset of m center points

So that

In the collection

The farthest point relative to the other points. Compared to random sampling, when a given number of centroids is the same, it can better cover the entire set of points and generate the receptive field in a data-dependent manner.

The inputs to this layer are a set of points of size N x (d + C) and a set of points of size N₀Centroid coordinates of x d. The output is of size N₀Set of points of x K x (d + C), where each set corresponds to a local area, K being the number of points near the centroid point. The grouping method adopts a Boolean query method to select K points in a given radius range, wherein the query distance is a measurement distance, and the K values of different local areas are different. Compared with K nearest neighbor (kNN) search, the query method ensures fixed region scale, so that local region features are more universal in space.

And 2.3, extracting initial features for each region by using a position adaptive convolution method (PAConv).

As shown in FIG. 3, PAConv first defines a Weight library (Weight Bank) composed of Weight matrices, then the scoring network (Scorenet) combines the Weight matrices according to the point location learning coefficient vector, and finally the dynamic kernel is generated by combining the Weight matrices and their associated location adaptive coefficients. And (4) acting the obtained convolution kernel on the input features, and obtaining output features through maximum pooling. The detailed process is as follows:

weight bank B ═ B_mI M1, M is generated by random initialization, each of which is one of

Representing a weight matrix and M representing the number of matrices. ScoreNet is responsible for associating the relative positions of the points with the weight matrix. Given a center point p_iAdjacent thereto point p_jIn the positional relationship of (p)_i，p_j)∈R^DinScorenet predicts B according to equation (1)_mPosition adaptive coefficient of

S_ij＝α(θ(p_i，p_j)) (1)

θ in equation (1) denotes a multilayer perceptron (MLP), and α is a normalization operation implemented using a softmax function. Output vector

Wherein

Represents the construction of core K (p)_i，p_j) At time B_mM is the number of weight matrices. The value range of the softmax function is guaranteed to be between 0 and 1, each weight matrix is guaranteed to be selected with a certain probability value, and the larger the value is, the stronger the relation between the position input and the weight matrix is. The kernel of PAConv is derived from equation (2) by combining the weight matrix in the weight library with the location adaptive coefficients predicted by ScoreNet.

And finally, acting the generated kernel on the input features according to a formula (3), and obtaining a new feature vector through maximum pooling.

K in formula (3) represents a convolution kernel,

denotes maximum pooling operation, P_inAnd P_outRespectively representing input and output characteristics.

And 3, re-calibrating the point cloud characteristics by utilizing a channel attention layer (L _ SE layer).

The L _ SE layer is composed of three parts of Squeeze, Excitation and weight. The Squeeze performs feature compression on the spatial dimension, each feature channel is changed into a real number, the real number has a global receptive field to some extent, and the output dimension is the same as the number of input feature channels. The Excitation generates a weight on each feature channel based on the correlation between the feature channels, and the weight is used for representing the importance degree of the feature channel. Reweigh considers the weight of the output of the specification as the importance of each feature channel, and then weights the feature channel by channel to the previous feature by multiplication, thereby completing the recalibration of the original feature in the channel dimension. The detailed process is as follows:

for point cloud data, Squeeze is implemented by one-dimensional global average pooling, as shown in formula (4), and correlation statistics of information among feature mapping channels is completed:

P_avg＝AvgPool1D(P_in) (4)

based on the information obtained by the Squeeze operation, in order to further capture the correlation information between channels, an operation is performed by means of a sigmoid activation function, as shown in equation (4).

P_s＝σ(L(δ(L(P_avg)))) (5)

In the formula (5), σ represents a sigmoid function, L represents a Linear function, and δ represents a leak _ ReLU activation function. In the back propagation process, different from the ReLU function of the original network, the Leaky _ ReLU activation function selected by the invention can also calculate the gradient at the part with the input less than zero, and can solve the problem of neuron death as shown in formulas (6) and (7) if the sample value of the ReLU is 0.

ReLU(x)＝max(0，x) (6)

Leaky_ReLU＝max(0，αx) (7)

In order to reduce the complexity of a network model and improve the adaptability of the network to different data, a first Linear function reduces the dimension of an input channel into

And then, expanding the dimension of the data by using a Leaky _ ReLU activation function and then using a Linear function to ensure that the dimension of the data is the same as the original input dimension. And finally, inputting the weighted value into a sigmoid function to normalize the weighted value into a numerical value from 0 to 1, and then weighting the weighted value to the original channel information through a formula (8) to complete recalibration.

P in formula (8)_outThe calculation process for the new features output by the L _ SE layer is shown in FIG. 4.

And 4, repeating the steps 2 to 3 for 4 times, and sampling layer by layer to extract point cloud characteristics.

And 5, extracting detail information by using a multi-scale convolution context (MSCC) module.

MSCC is designed to extract rich point cloud features, unlike standard convolution, the present invention selects one-dimensional hole convolution. The hole convolution is actually a process of sampling a point cloud feature, and a sampling frequency is set according to a parameter hole size (rate). When rate is 1, the characteristic sampling does not lose any information, namely standard convolution operation; when rate >1, samples are taken every (rate-1) point cloud on the raw data, increasing the extent of the receptive field. The actual kernel size K is calculated according to equation (9).

kernel_size+(kernel_size-1)(rate-1) (9)

The kernel _ size in equation (9) is the initial kernel size. So when the standard convolution is selected, K is equal to kernel _ size, while the K for the hole convolution is larger, as shown in fig. 5 for comparison.

The cavity convolution can not reduce the space dimension and increase the parameter quantity while increasing the receptive field, thereby realizing the balance of precision and speed. The size of the convolved output point cloud is calculated according to the formula (10):

·input：(B，c_in，N_in)

·output：(B，C_out，N_out)

in the formula (10), N is the number of point clouds, and resolution represents rate. For different convolution kernel sizes, in order to keep the output N unchanged, partition is set to 2 and padding is equal to (kernel _ size-1).

The structure of MSCC is shown in fig. 6, where global information is obtained by first using a standard convolution with a kernel size of 1, and then parallel sampling is performed by using a hole convolution with an expansion rate of 2 and kernel sizes of 3, 5, and 7, respectively. Therefore, context features are extracted by using different receptive fields, and the relation between adjacent point clouds is strengthened.

And 6, passing the feature vector output by the multi-scale convolution context module through a decoder consisting of an upsampling layer and a PointNet unit as shown in the right side of the figure 2, and taking the output of the encoder as the other input of the decoder through jumping connection.

And (4) performing up-sampling by adopting an interpolation method, and recovering the original point cloud scale again. Based on the coordinates of the center point, interpolation is performed using a K nearest neighbor algorithm with K ═ 3, as shown in equation (11):

and 6.2, decoding the characteristics by using a PointNet network unit.

The PointNet network mainly comprises a conversion network (T-Net) and a multilayer perceptron (MLP). T-Net is used to generate a transformation matrix and apply this transformation directly to the coordinates of the input points, specifically using two-dimensional regularization, and in order to keep the point cloud rotation invariant, as much as possible using orthogonal matrices, as shown in equation (12). The T-Net network is used for aligning the features, so that the features are more beneficial to extraction.

P_reg＝||I-AA_T||² (12)

P in the formula (10)_regAnd I is an identity matrix corresponding to the dimension of the input matrix, and A is a feature matrix needing to be converted.

MLP is a neural network model composed of an input layer, a hidden layer and an output layer, with the output h_w，b(x) Where w represents the inter-layer weight matrix and b represents the offset. The number of the unit PointNet networks is 3 or 4, and the dimensionality reduction is carried out on the feature vectors in sequence.

And 7, repeating the step 6 for 7 times, and up-sampling the decoded point cloud characteristics layer by layer.

Examples

The data set used in this embodiment is the S3DIS data set, which is collected from the indoor environment of three different buildings, containing 271 rooms of 6 zones. The total number of the point clouds is 695,878,620, each point cloud has corresponding coordinate and color information, semantic labels such as chairs, tables, floors and walls, and the total number of the semantic labels is 13. This example selects

zones

1, 2, 3, 4 and 6 for training and zone 5 for testing. During training, the present embodiment samples the input points into a uniform number of 4096 points, while all points are used during testing.

In this embodiment, 150 epochs are trained on two GeForce RTX 2080Ti GPUs, the batch size is 16, an SGD optimizer with an initial learning rate of 0.05 is used, the momentum is 0.9, and the weight attenuation rate is 10^-4And is realized on a Pythrch platform by using Linux. After the model is obtained by training the network using the training set, the model performance is evaluated by the test set, and mlou (mean cross-over ratio) is selected as an evaluation index. At S3DIS numberIoU (cross-over ratio) of each category on the data set is shown in table 1, and mIoU is 64.8, so that the method can realize better segmentation performance in a three-dimensional point cloud semantic segmentation task.

Table 1: IoU results for each category on S3DIS dataset

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A three-dimensional point cloud semantic segmentation method based on channel attention and multi-scale fusion is characterized by comprising the following steps:

step 1, reading and preprocessing point cloud data;

step 2, the point cloud data passes through an encoder composed of a down-sampling layer, a grouping layer and position self-adaptive convolution and is mainly responsible for up-sampling and feature extraction;

2. The method of claim 1 wherein in step 2, said position adaptive convolution first defines a weight library consisting of weight matrices, then the scoring network (ScoreNet) combines the weight matrices based on the point position learning coefficient vectors, and finally the dynamic kernel is generated by combining the weight matrices and their associated position adaptive coefficients. And (4) acting the obtained convolution kernel on the input features and then obtaining the output features through maximum pooling. The detailed process is as follows:

weight library B ═ B_m1, …, M, each generated by random initialization

Representing a weight matrix and M representing the number of matrices. ScoreNet is responsible for associating the relative positions of the points with the weight matrix. Given a center point p_iAdjacent thereto point p_jIn a positional relationship of (p)_i，p_j)∈R^DinScorenet prediction B_mPosition adaptive coefficient of

Comprises the following steps:

S_ij＝α(θ(p_i，p_j))

where θ denotes the multilayer perceptron (MLP) and α is the normalization operation that implements the softmax function. Output vector

Wherein

Represents the construction of the nucleus K (p)_i，p_j) When B is_mM is the number of weight matrices. The value range of the softmax function is guaranteed to be between 0 and 1, each weight matrix is guaranteed to be selected with a certain probability value, and the larger the value is, the stronger the relation between the position input and the weight matrix is. The kernel of PAConv is derived by combining the weight matrix in the weight library with the location adaptive coefficients predicted by ScoreNet:

and (3) applying the generated kernel to the input features, and obtaining a new feature vector through maximum pooling:

wherein K represents a convolution kernel,

3. The method of claim 1, wherein in step 3, the channel attention layer is composed of three parts, Squeeze, Excitation and weight. The Squeeze performs feature compression on the spatial dimension, each feature channel is changed into a real number, the real number has a global receptive field to some extent, and the output dimension is the same as the number of input feature channels. The Excitation generates a weight on each feature channel based on the correlation between the feature channels, and the weight is used for representing the importance degree of the feature channel. Reweigh considers the weight of the output of the specification as the importance of each feature channel, and then weights the feature channel by channel to the previous feature by multiplication, thereby completing the recalibration of the original feature in the channel dimension. The detailed process is as follows:

for point cloud data, the Squeeze is realized by one-dimensional global average pooling, and the correlation statistics of information among feature mapping channels is completed:

P_avg＝AvgPool1D(P_in)

on the basis of the information obtained by the Squeeze operation, in order to further capture the inter-channel correlation information, an operation is performed by means of a sigmoid activation function:

P_s＝σ(L(δ(L(P_avg))))

where σ denotes a sigmoid function, L denotes a Linear function, and δ denotes a leak _ ReLU activation function. In the back propagation process, different from the ReLU function of the original network, the Leaky _ ReLU function selected by the invention can also calculate the gradient at the part with the input less than zero, and can solve the problem of neuron death as the ReLU sample value is 0:

ReLU(x)＝max(0，x)

Leaky_ReLU＝max(0，αx)

And then, expanding the dimension of the data by using a Leaky _ ReLU activation function and then using a Linear function to ensure that the dimension of the data is the same as the original input dimension. Finally, inputting the weighted value into a sigmoid function to normalize the weighted value into a numerical value from 0 to 1, and weighting the weighted value to the original channel information to finish recalibration:

wherein P is_outA new feature output for the L _ SE layer.

4. The method of claim 1, wherein in step 5, the multi-scale convolution context module is used to extract rich point cloud features, and unlike standard convolution, the invention selects one-dimensional hole convolution. The hole convolution is actually a process of sampling a point cloud feature, and a sampling frequency is set according to a parameter hole size (rate). When rate is 1, the characteristic sampling does not lose any information, namely standard convolution operation; when rate >1, samples are taken every (rate-1) point cloud on the raw data, increasing the extent of the receptive field. The actual kernel size K is calculated according to the following formula:

kernel_size+(kernel_size-1)(rate-1)

where kernel _ size is the initial kernel size. So when the standard convolution is selected, K is equal to kernel _ size, while K for the hole convolution is larger.

The cavity convolution can not reduce the space dimension and increase the parameter quantity while increasing the receptive field, thereby realizing the balance of precision and speed. The output point cloud size after convolution is:

·input：(B，C_in，N_in)

·output：(B，C_out，N_out)

wherein N is the point cloud number, and the variance represents the rate. For different convolution kernel sizes, in order to keep the output N unchanged, partition is set to 2 and padding is equal to (kernel _ size-1). Based on the setting, the multi-scale convolution context module firstly uses the standard convolution with the kernel size of 1 to obtain global information, and then uses the hole convolution with the expansion rate of 2 and the kernel sizes of 3, 5 and 7 respectively to perform parallel sampling. Therefore, context features are extracted by using different receptive fields, and the relation between adjacent point clouds is strengthened.