CN114255238A

CN114255238A - Three-dimensional point cloud scene segmentation method and system fusing image features

Info

Publication number: CN114255238A
Application number: CN202111423794.9A
Authority: CN
Inventors: 饶云波; 杨泽雨; 曾少宁; 牟洪雨; 郑伟斌; 王发新
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-29

Abstract

The invention provides a three-dimensional point cloud scene segmentation method and a three-dimensional point cloud scene segmentation system fusing image features, which relate to the technical field of computer vision and can realize effective fusion of a two-dimensional image and a three-dimensional point cloud and accurate segmentation of a three-dimensional scene; the method comprises the following steps: s1, acquiring two-dimensional data including images, point cloud data and depth data, and calculating according to the acquired data to obtain an incidence relation between the scene images and the point clouds; s2, extracting the features of the two-dimensional data to obtain a high-dimensional feature map to be fused; s3, fusing the feature graph to be fused and the point cloud data according to a fusion strategy to obtain fused point cloud data; the fusion strategy comprises the following steps: warping the corresponding characteristic of a pixel to certain point cloud data by searching the pixel adjacent to the point cloud data; and S4, inputting the fused point cloud data into a three-dimensional segmentation network for feature extraction, thereby obtaining the required global and local semantic information. The technical scheme provided by the invention is suitable for the process of three-dimensional point cloud processing.

Description

Three-dimensional point cloud scene segmentation method and system fusing image features

Technical Field

The invention relates to the technical field of computer vision, in particular to a three-dimensional point cloud scene segmentation method and system fusing image features.

Background

With the development of artificial intelligence, people desire that machines can acquire real world information like people, intelligently analyze the data and give corresponding judgment. Computer vision, as an important technology for machine-sensing the real world, has been a research hotspot in the field of artificial intelligence. The semantic segmentation of the three-dimensional scene is used as a basic task of computer vision, has important research value and application prospect, and the required precision is increasingly improved.

The scene data includes two-dimensional data represented by a two-dimensional image and three-dimensional data represented by an RGB-D depth map and three-dimensional data. The two-dimensional image is influenced by illumination conditions, complex environments and the like, the imaging quality is reduced to a certain extent, and meanwhile, object shielding is also an important factor that the two-dimensional image cannot better describe the real world. The RGB-D depth map is based on a two-dimensional image, and the depth information of pixels is added. Three-dimensional point clouds are sparse and unstructured three-dimensional data that describe the shape and location of objects in the real world. Compared with a two-dimensional image, the three-dimensional point cloud has more details and characteristics on the representation of a real scene, and has richer structural information and semantic information. With the rapid development and application of laser sensors and depth sensors, the difficulty in acquiring point clouds is reduced, the development of a three-dimensional scanning technology also promotes the growth and the quality of three-dimensional data, and a plurality of perfect three-dimensional point cloud scene data emerge on a network.

Point cloud segmentation is the basis and premise of three-dimensional scene understanding, and is to divide points in a point set into different categories according to a certain rule and correspond to different semantic labels. After a long time of point cloud segmentation development, researchers have proposed a large number of traditional segmentation algorithms, and through continuous research and improvement, the traditional segmentation algorithms are continuously improved, but because the traditional segmentation algorithms need to design feature descriptors manually, self-adaptive segmentation tasks cannot be realized, and the generalization capability to different segmentation tasks is poor. In recent years, deep learning is widely applied to the field of computer vision due to self-learning capability. The convolutional neural network has great advantages in a two-dimensional scene segmentation task, and the rapid development of the computer vision field is promoted. Due to the characteristics of point cloud, such as disorder and sparsity, scene segmentation based on three-dimensional point cloud is a difficult point of research.

A point cloud is a disordered set of spatial points, the spatial points in the set not having the ordering and consistency of two-dimensional pixels. Therefore, conventional convolutional neural networks cannot be directly applied to point clouds. In recent years, researchers have proposed many approaches to overcome this problem, generally divided into two categories: namely regularization based methods and original point cloud based methods.

The regularization-based method is to convert the point cloud into a regularized structure such as a voxel or a multi-view and the like, and then input the regularized structure into a convolutional neural network for processing. Representative of such methods are VoxNet and MVCNN, where VoxNet converts an original unstructured point cloud into regular three-dimensional voxel data and uses a three-dimensional convolutional neural network to extract features on the voxel data. MVCNN is a precursor to applying the multi-view method to point cloud identification, obtaining features of multi-view images through a convolutional neural network, and aggregating the multi-view features using pooling.

The method based on the original point cloud directly learns the characteristics of the point cloud without converting the point cloud into data in other forms. The PointNet is a first neural network for directly processing original point cloud data, and the network obtains a global feature vector of the point cloud through a multilayer perceptron and a maximum pooling layer, so that the problems of disorder and conversion invariance of the point cloud are solved. Subsequently, the original author provides a more excellent PointNet + + model, provides a multi-level receptive field feature extraction structure, and enhances the local feature extraction capability of the network to different densities. However, the position distribution information in the three-dimensional space is still difficult to capture, and how to convert the associated information of the point into semantic information is still a problem to be solved at present.

When two-dimensional data and three-dimensional data are alignable, it is one way to improve the current visual task to fully utilize both data. Many researchers have addressed the visual problem by spatially mapping two types of information, 3D mv mapping 2D image features into 3D spatial voxel grids, each voxel grid corresponding to multiple views, each voxel feature being derived by pooling the features of the corresponding views. Similarly, some researchers project the three-dimensional point cloud onto a two-dimensional plane, and perform feature extraction and fusion on views on different two-dimensional planes, so as to predict semantic labels of each point. The above method is an attempt to perform mutual transformation between two-dimensional data and three-dimensional data to assist in feature extraction. How to combine the two kinds of data effectively and improve the accuracy of the point cloud segmentation network are very challenging. Firstly, the point cloud is used as data of a three-dimensional space, the space distribution characteristics and the high-dimensional information of points and points are difficult to capture, secondly, the three-dimensional point cloud is composed of disordered points, two-dimensional images are ordered pixels, and the two different types of data or characteristics are difficult to fuse. Compared with the regular grid structure of image data, the sparsity and unstructured nature of the point cloud results in the inability to extract features directly through CNN.

Therefore, there is a need to develop a method and system for segmenting a three-dimensional point cloud scene by fusing image features to overcome the shortcomings of the prior art, so as to solve or alleviate one or more of the above problems.

Disclosure of Invention

In view of this, the invention provides a three-dimensional point cloud scene segmentation method and system fusing image features, which can realize effective fusion of a two-dimensional image and a three-dimensional point cloud and accurate segmentation of a three-dimensional scene.

In one aspect, the invention provides a three-dimensional point cloud scene segmentation method fusing image features, which comprises the following steps:

s1, acquiring two-dimensional data including images, point cloud data and depth data, and calculating according to the acquired data to obtain an incidence relation between the scene images and the point clouds;

s2, extracting the features of the two-dimensional data to obtain a high-dimensional feature map to be fused;

s3, fusing the feature graph to be fused and the point cloud data according to a fusion strategy to obtain fused point cloud data; the fusion strategy comprises the following steps: searching pixels adjacent to certain point cloud data through the association relation, and warping the corresponding characteristics of the pixels to the point cloud data;

and S4, inputting the fused point cloud data into a three-dimensional segmentation network for feature extraction, thereby obtaining the required global and local semantic information.

In the above aspect and any possible implementation manner, an implementation manner is further provided, before the fusion, the point cloud data is subjected to key point sampling processing to obtain point cloud data to be fused, and then the point cloud data to be fused and the feature map to be fused are fused.

As for the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the content of performing feature extraction on the two-dimensional data in step S2 includes: extracting features through a two-dimensional segmentation network;

preferably, the image feature extraction is performed by using a U-Net-ResNet network with an encoder of ResNet, wherein the U-Net-ResNet network is a network which is trained by a model and obtains enough accuracy.

In the foregoing aspect and any possible implementation manner, there is further provided an implementation manner, and the calculation manner of the association relationship between the scene image and the point cloud in step S1 is: selecting certain point cloud data, projecting a pixel to a three-dimensional space through depth data of an image, calculating a spatial distance between the pixel and the selected point cloud, and considering the pixel as an effective pixel when the spatial distance is smaller than a judgment threshold; counting the total number of the respective effective pixels of all scene images in a certain scene, sequencing the effective pixels from large to small, and selecting the first N scene images as the original images of the two-dimensional data in the step S2; n is a positive integer.

In the above aspect and any possible implementation manner, an implementation manner is further provided, where the point cloud data selected when the total number of the effective pixels is counted is all point cloud data corresponding to the scene or point cloud data of key points in the scene.

The above-described aspect and any possible implementation manner further provide an implementation manner, and the content of the feature extraction in step S includes: the feature relation between the middle point and the neighborhood of the three-dimensional space is captured through an RFE module and a CSA module, and space feature extraction and feature semantic correction are carried out on the fused point cloud data.

The above-mentioned aspects and any possible implementation manner further provide an implementation manner, where the RFE module includes three RE sub-modules, and is configured to encode location information of a key point cloud and neighboring points, so as to capture spatial distribution information of the key points and the neighboring points;

the CSA module processes the neighborhood space characteristics through SharedMLP, alternately generates an attention matrix and the next layer of local characteristics and multiplies the attention matrix and the next layer of local characteristics, and achieves the effects of space characteristic extraction and semantic information enhancement.

The above-described aspects and any possible implementation further provide an implementation in which the RFE module learns characteristics of each encoded information using a shared function g including one or more sharedmlps, and then performs nonlinear learning by activating the function.

The above-described aspects and any possible implementations further provide an implementation in which the CSA module generates an attention moment matrix and local features by a g-function consisting of a set of sharedmlps and Softmax;

wherein the attention matrix is defined as:

A_ishowing the attention matrix under the current i layer, W is the parameter matrix learned by SharedMLP, N is the position coding information obtained by the last RFE module, r_iIs the feature extracted under the current i-layer. r is₀Representing the characteristic information obtained by F, i.e. RFE.

In another aspect, the present invention provides a three-dimensional point cloud scene segmentation system fusing image features, the system being configured to implement the method as described in any one of the above; the system comprises:

the data processing module is used for acquiring two-dimensional data, point cloud data and depth data, calculating to obtain an association relation between the scene image and the point cloud, and packaging the association relation and the point cloud data together;

the two-dimensional image feature coding module is used for extracting features of the two-dimensional data to obtain a high-dimensional feature map to be fused;

the fusion module is used for fusing semantic features carried by pixels to point cloud data according to the position relation between the image pixels and the point cloud, so that the data are effectively fused and semantic information is effectively enhanced;

and the point cloud feature extraction module is used for extracting features of the fused point cloud data and performing semantic segmentation on the fused point cloud data under multiple scales.

The terms of U-Net-ResNet network, ResNet, SharedMLP, Softmax, RFE module, CSA module and the like used in the invention belong to common terms or English abbreviations in the field, and are common tools in the field of image processing or tools improved on the basis of the common tools.

Compared with the prior art, one of the technical schemes has the following advantages or beneficial effects: the method can realize effective fusion of the two-dimensional image and the three-dimensional point cloud, enhance the structural information and the semantic information, facilitate accurate segmentation and accurate feature extraction of the three-dimensional scene, and effectively master global and local semantic information;

another technical scheme in the above technical scheme has the following advantages or beneficial effects: by selecting the point cloud data of the key points and selecting N scene images with the most representative ability for operation, the data volume can be greatly reduced on the premise of ensuring higher segmentation precision, and the calculation efficiency is improved.

Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a three-dimensional point cloud segmentation method for fusing image features according to an embodiment of the present invention;

FIG. 2 is a block diagram of feature fusion provided by one embodiment of the present invention;

FIG. 3 is a diagram of a point cloud segmentation network provided by an embodiment of the invention;

FIG. 4 is a JALayer structure diagram provided by an embodiment of the present invention;

FIG. 5 is a block diagram of RE {64,64} provided in accordance with an embodiment of the present invention;

FIG. 6 is a neighborhood space feature encoding graph according to an embodiment of the present invention;

fig. 7 is a CSA module according to an embodiment of the invention.

Detailed Description

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the defects of the prior art, the invention designs a point cloud segmentation model fusing image features, the network model is constructed through a two-dimensional feature device (a two-dimensional feature encoder), feature fusion and a three-dimensional segmentation network, the features of a two-dimensional image are extracted as accurately as possible through the two-dimensional feature device, then the image features and point cloud are fused, and finally the three-dimensional segmentation network is input to realize the accurate segmentation of a three-dimensional scene. Specifically, a fusion strategy of an image and a point cloud is designed, namely, pixels adjacent to the point cloud are searched, and pixel features are distorted on point cloud data. In the two-dimensional feature coding stage, a two-dimensional segmentation network is trained in advance, so that the segmentation result of the two-dimensional segmentation network reaches a certain level, and an encoder of the two-dimensional segmentation network is used as a two-dimensional feature encoder of the whole scene segmentation network. And fusing the point cloud data and the two-dimensional features extracted by the two-dimensional feature encoder, and inputting the fused point cloud data and the two-dimensional features into a three-dimensional segmentation network. In a three-dimensional segmentation network, the invention designs two efficient Feature Extraction modules with different functions, wherein a Relative Feature Extraction (RFE) module is used for capturing the Feature relation between a point and a neighborhood in a three-dimensional space and better extracting the associated information between the points, and a Cross Spatial Attention (CSA) module is used for enhancing and correcting the extracted point cloud semantic features. The feature extraction is carried out on the point clouds in different scale ranges, and the features of different receptive fields are extracted, so that the global and local semantic information can be effectively mastered. The flow chart of the invention for segmenting a three-dimensional point cloud scene is shown in fig. 1.

The point cloud segmentation model of the invention mainly comprises a data processing module, a two-dimensional image feature coding module, a fusion module and a point cloud feature extraction module in terms of functions. The four functional modules are respectively:

a data processing module: and the two-dimensional image, the depth information and the three-dimensional point cloud are matched and packed, so that network reading and training are facilitated.

The two-dimensional image feature coding module: and performing feature extraction on the two-dimensional scene image so as to perform semantic supplementation on the point cloud data.

A fusion module: according to the position relation between the image pixel and the point cloud, the semantic features of the pixel are fused to the point cloud data, and the data and the enhanced semantic information are effectively fused.

A point cloud feature extraction module: the method is used for extracting the characteristics of the fusion data, and performing semantic segmentation and prediction on the fusion data under multiple scales.

The invention discloses a three-dimensional point cloud scene segmentation method fusing image characteristics, which mainly comprises the following steps:

step 1, preprocessing picture and point cloud data

The images and point clouds were taken with a scannet v2 dataset consisting of 1513 scenes, 21 semantic categories of objects. Each scene contains a video frame of the scene acquisition, depth information, and a scene point cloud, which may better assess the performance of the invention. And selecting a picture which covers the point cloud key points of the scene to the maximum extent from the video frames of the scene. In order for the feature module to function, the adjacent relation between the scene point cloud and the two-dimensional image is calculated and compressed and packaged at the stage. The data used by the segmentation network of the invention is obtained through the operations of preprocessing and packaging the picture and the point cloud.

And 2, building a point cloud segmentation network model based on the fusion image characteristics.

The image features are extracted from the trained two-dimensional segmentation network, and the view images of the scene are encoded into high-dimensional feature maps. And fusing the high-dimensional feature map and the point cloud according to the spatial position relation of the image and the point cloud, and finally outputting enhanced fused point cloud data. And then, inputting the fusion data into a three-dimensional point cloud segmentation network to obtain a final segmentation result.

In the three-dimensional point cloud segmentation part, in order to capture local point cloud features, a Relative Feature Extraction (RFE) module is designed to capture the Feature relationship of points and neighborhoods in a three-dimensional space, and more specifically, the relationship of the neighborhoods and the central points is encoded using a Position Encoder (PE), and this relationship is abstracted through a neural network. In order to extract Spatial features and modify feature semantics of fused data, a Cross Spatial Attention (CSA) module is provided, wherein the CSA module comprises three groups of Cross Attention modules, and the Attention modules can generate an Attention matrix adaptive to feature maps of the Attention modules and perform operation with the feature maps to achieve semantic modification and enhancement effects.

And 3, training a network model and optimizing parameters of the network model.

In order to improve the efficiency of the network, the invention calculates the spatial position relationship between the image and the point cloud in the preprocessing stage and compresses and packs the result, thereby avoiding the problem of overlong time in network training. Two-dimensional images used by the network are stored in different folders by taking a scene as a unit, and the spatial position relation of the images and the point cloud data are packaged and stored into a single file format.

Firstly, training a two-dimensional segmentation network, performing supervised learning on a scene two-dimensional image, and continuously iterating and optimizing parameters and a model through back propagation to obtain a two-dimensional image feature encoder. And then, connecting the two-dimensional image feature encoder and the three-dimensional point cloud segmentation network through a feature fusion module, and completing parameter training of the three-dimensional point cloud segmentation network on picture data and packed data. Therefore, the trained three-dimensional point cloud segmentation network model fusing the image features is obtained, and the network parameters and the model structure are stored in a pth file form.

And 4, inputting the scene picture and the point cloud to be segmented into the trained neural network to obtain a segmentation result.

And inputting the test set of the scene picture and the point cloud into a pre-trained network model for testing experiments to obtain a complete and high-precision scene segmentation result picture.

Example 1:

step one, preparation and pretreatment of scene picture and point cloud data

Scannet v2 is an indoor scene data set that was captured using the iPad's internal camera and an additional depth camera. The data of each scanning of the data set consists of an RGB-D sequence with related gestures, a complete scene grid, semantics and example labels, and the data set comprises 2.5M view images, 1513 point cloud grids and 21 semantic type objects. Each scene includes two-dimensional data, which is a video frame acquired by the scene, and three-dimensional data, which is point cloud information. The present invention uses 1201 scan data for training and 312 scan data for testing.

In order to improve the network training efficiency and save the training time, the incidence relation between the scene image and the point cloud is calculated in the preprocessing step. Firstly, down-sampling is carried out on a scene point cloud, key points of the scene point cloud are selected, image pixels are projected to a three-dimensional space through depth information of an image, the spatial distance between the pixels and the point cloud is calculated, a distance threshold value is set, when the distance between the pixels and the key points is smaller than the threshold value, the current pixels can represent the key points, and the more the pixels in the image are, the stronger the capability of representing the scene point cloud by the image is. And for each scene, calculating the relationship between the scene point cloud and all the images, selecting 5 scene images with the strongest representative ability as subsequent training data, and packaging the picture number and the scene point cloud together to finally obtain training data and test data which can be used by the network.

Step two, image feature extraction and feature fusion module design

Aiming at the problem of fusion of point cloud and image, in order to more effectively fuse two data, the invention designs a distance-based feature fusion method, as shown in fig. 2. And mapping the two-dimensional pixels to a three-dimensional space through image depth information, and performing feature fusion through global point cloud position information and picture pixel position information to achieve the effect of supplementing semantic information of point cloud data.

The image characteristic information can be obtained through a traditional two-dimensional segmentation network, the image characteristic extraction is carried out by adopting a U-Net-ResNet network, and the image data is trained on the U-Net-ResNet network, so that the network achieves a certain accuracy. In the U-Net-ResNet network, ResNet is adopted as an encoder, and the feature extraction capability of the image is further improved. The trained two-dimensional network is used as a two-dimensional feature encoder in three-dimensional network training to extract the features of the image, and the image features and the point cloud data are fused in a feature fusion module.

In the feature fusion module, it is assumed that N H x W x C two-dimensional feature maps and a set of point clouds are input. Firstly, projecting pixel points of a two-dimensional characteristic map into a three-dimensional space, and performing down-sampling on dense pixels to obtain an M x C pixel point cloud, wherein each point in the pixel point cloud is calculated by a pixel coordinate and depth information, and M is less than N x H x W.

For point p in the point cloud_iWe use the Euclidean distance to find its nearest K in three-dimensional spacePoints of a pixel point cloud

Wherein

The spatial coordinates of the points of the pixel point cloud representing M x C,

representing a splicing operation.

By encoding pixel points and positional information of the point cloud

Simulating and calculating each pixel point

To p_iThe semantic contribution of (1). Finally, a set of adjacent pixel point codes is obtained

And fused with the image features in the next step.

As shown in FIG. 2, the feature fusion module integrates the information of each point with the features of the neighboring pixels, where F_iInformation representing fused points, fusing point p_iAnd feature information f of its neighboring pixels_i ^kAnd through f_i ^kAnd enhancing semantic information of the fused data.

Step three, point cloud segmentation network design

The invention designs a point cloud segmentation network based on PointNet + +, which is shown in figure 3. The network structure can better capture point cloud characteristics with different granularities, and then predict semantic labels of fused point cloud data.

Aiming at the fusion data mentioned in the previous step, the invention designs a feature extraction structure JALayer to better extract the spatial features and semantic features of the point cloud. JALayer is composed of Relative Feature Extraction (RFE) and Cross Spatial Attention (CSA), and the detailed structure is shown in fig. 4.

In practical applications, a scene contains a huge amount of point clouds, and even a single object point cloud is composed of a very large number of points. The conventional convolutional neural network requires that processed data be structured when feature extraction and prediction are performed, and parameters of the neural network increase as the amount of data increases. For massive points in the point cloud data, key points which can represent the point cloud and adjacent points near the key points are extracted from the point cloud in a sampling and grouping mode. In this way, the burden of the neural network is reduced, and the training efficiency is improved. In JALayer, the point cloud is firstly sampled by the farthest point, the key points which can represent the point cloud are collected, the adjacent point of each key point is obtained through ball query, and then the coordinate information and the feature information of the key points and the adjacent points are sent to an RFE module for space feature coding.

The RFE module is composed of three Relative Encoding (RE) modules and aims to capture the spatial distribution information of key points and adjacent points. As shown in fig. 5, the RE module employs Position Encoding to encode the Position information of the key point and the near point. Giving point cloud P and each point feature, and centering the center point P_iNearest k points of

And designing a sharing function g to learn the characteristics of each coded information. The g-function contains one or more sharedmlps, each of which is then nonlinearly learned by the ReLU activation function. The Position Encoding formula is as follows:

wherein p is_iAnd

respectively representing the spatial position of the central point and its neighbors,

representing the splice operator, the "iib" symbol calculates the distance of the euclidean space.

The relative position information of the points of the local neighborhood is encoded and abstracted by the g-function to high-dimensional features.

As shown in FIG. 6, in RFE, for each point

Respectively coding local features and global features in the neighborhood through two RE modules and splicing the local features and the global features into neighborhood space features

The formula is as follows:

wherein W₀And W₁Respectively, the weight matrices of SharedMLP.

In the feature abstraction stage, the SharedMLP module is used as an effective encoder and combined with an Attention mechanism, so that a feature extraction task can be well completed. The point cloud data is information of unordered points, and after the point cloud data is abstracted into features through SharedMLP, the characteristics are different from two-dimensional convolution, the receptive field of each point is not enlarged, and therefore in the 3D network, the feature extraction is carried out on different central points through multiple layers of JALayers. As shown in fig. 7, a Cross Spatial Attention (CSA) module is designed, and the core idea is to process the neighborhood Spatial features through SharedMLP, alternately generate and multiply the Attention matrix and the next layer local features, so as to achieve the effects of Spatial feature extraction and semantic information enhancement.

We have designed a g function to generate specific attention moment matrix and local features, where the function g is composed of a set of SharedMLP and Softmax, and the attention matrix is defined as follows:

wherein A is_iDenotes the attention matrix under the current i-layer, W is the parameter matrix learned by SharedMLP.

r_i＝g(A_i-1·r_i-1)

As shown in fig. 7, 3 matrix element multiplications are performed in the CSA module. Neighborhood spatial features for a given point cloud

And local features

The CSA module learns and aggregates local features, corrects the features through semantic enhancement, and finally obtains a coded feature information vector F through maximum pooling_i。

Step four, building a neural network model based on multi-level feature fusion

As shown in fig. 1, the whole architecture of the split network of the present invention is divided into two branches. The upper part is a two-dimensional image feature extraction branch, the lower part is a three-dimensional point cloud scene segmentation branch, and the two branches are connected through a feature fusion module. In the two-dimensional image feature extraction branch, a two-dimensional segmentation network is adopted to obtain the two-dimensional features of a scene view, and the two-dimensional image of the scene is encoded into a high-dimensional feature map. In the feature fusion module, the two-dimensional features and the three-dimensional point cloud data are fused in a position coding mode, and finally, enhanced fusion point cloud data are output. Then, the fused data is input into a three-dimensional point cloud segmentation network. The three-dimensional point cloud scene segmentation branch captures the characteristic relation between a point and a neighborhood in a three-dimensional space through an RFE module and a CSA module, and performs spatial characteristic extraction and characteristic semantic correction on the fusion data, so that the semantic prediction and the point cloud segmentation capability of the network are further enhanced.

And step five, training a network model and verifying the network segmentation effect.

In order to improve the efficiency of the network, the image and the point cloud are respectively arranged in two folders by the same scene name, the picture number and the scene point cloud are compressed and packaged in the preprocessing stage, and the picture is inquired through the picture number in the training and verification stage, so that the memory occupation is reduced.

The experiment of the invention uses an RTX2080ti graphic processor to carry out accelerated learning, the system platform is Ubuntu16.04, and the deep learning frame is Pythrch. And the two-dimensional feature encoder is obtained by training the scene image through U-Net-ResNet. The model optimizer takes Adam with a learning rate of 0.004, the batch _ size of the experiment is set to 16, and cross entropy is used as a loss function. In the three-dimensional point cloud scene segmentation network, the number of JALayer layers is set to be 4, key points of each layer are respectively 2048, 512, 128 and 32, and query radiuses of adjacent points are respectively set to be 0.1m, 0.2m, 0.4m and 0.8 m. In the decoding stage, the invention adopts the same setting as PointNet + +.

In the training stage, firstly, training U-Net-ResNet on a scene picture data set, taking the trained U-Net-ResNet as a two-dimensional feature encoder of the whole network, not updating network parameters in the subsequent stage, and only enabling the encoder to participate in feature extraction of data. And secondly, training a three-dimensional data network, wherein the two-dimensional feature encoder provides scene image features, the feature fusion module fuses the scene image features and the scene point cloud, and then the fused data is sent to a three-dimensional point cloud segmentation network, so that network parameters are continuously optimized and updated, and the training of the three-dimensional network is completed. After training, the models and parameters of the two-dimensional network and the three-dimensional network are stored in the hard disk in a pth file form.

And (3) preprocessing the scene point cloud and the test set of the picture to obtain image data and point cloud packed data, inputting the image data and the point cloud packed data into a trained network model to perform a test experiment, and obtaining a complete prediction result.

The method and the system for segmenting the three-dimensional point cloud scene fusing the image features provided by the embodiment of the application are introduced in detail. The above description of the embodiments is only for the purpose of helping to understand the method of the present application and its core ideas; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element. "substantially" means within an acceptable error range, and a person skilled in the art can solve the technical problem within a certain error range to substantially achieve the technical effect.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used herein is merely an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Claims

1. A three-dimensional point cloud scene segmentation method fusing image features is characterized by comprising the following steps:

2. The image feature fusion three-dimensional point cloud scene segmentation method according to claim 1, wherein before fusion, the point cloud data is subjected to key point sampling processing to obtain point cloud data to be fused, and then the point cloud data to be fused and the feature map to be fused are fused.

3. The image feature fused three-dimensional point cloud scene segmentation method according to claim 1, wherein the content of feature extraction on the two-dimensional data in step S2 includes: extracting features through a two-dimensional segmentation network;

4. The method for segmenting the scene by fusing the three-dimensional point cloud with the image features according to claim 1, wherein the incidence relation between the scene image and the point cloud in the step S1 is calculated in a manner that: selecting certain point cloud data, projecting a pixel to a three-dimensional space through depth data of an image, calculating a spatial distance between the pixel and the selected point cloud, and considering the pixel as an effective pixel when the spatial distance is smaller than a judgment threshold; counting the total number of the respective effective pixels of all scene images in a certain scene, sequencing the effective pixels from large to small, and selecting the first N scene images as the original images of the two-dimensional data in the step S2; n is a positive integer.

5. The image feature fusion three-dimensional point cloud scene segmentation method according to claim 4, wherein the point cloud data selected when the total number of the effective pixels is counted is all point cloud data corresponding to the scene or point cloud data of key points in the scene.

6. The image feature fused three-dimensional point cloud scene segmentation method according to claim 1, wherein the content of feature extraction in step S includes: the feature relation between the middle point and the neighborhood of the three-dimensional space is captured through an RFE module and a CSA module, and space feature extraction and feature semantic correction are carried out on the fused point cloud data.

7. The image feature fused three-dimensional point cloud scene segmentation method according to claim 6, wherein the RFE module comprises three RE sub-modules for encoding the position information of the point cloud of the key points and the neighboring points and capturing the spatial distribution information of the key points and the neighboring points;

8. The image feature fused three-dimensional point cloud scene segmentation method as claimed in claim 7, wherein the RFE module employs a shared function g comprising one or more SharedMLPs to learn features of each encoded information, and then performs nonlinear learning by activating the function.

9. The image-feature-fused three-dimensional point cloud scene segmentation method according to claim 7, wherein the CSA module generates an attention moment matrix and local features by a g-function consisting of a set of SharedMLP and Softmax;

wherein the attention matrix is defined as:

A_ishowing the attention matrix under the current i layer, W is the parameter matrix learned by SharedMLP, N is the position coding information obtained by the last RFE module, r_iIs the feature extracted under the current i-layer.

10. A three-dimensional point cloud scene segmentation system fused with image features, wherein the system is used for implementing the method according to any one of claims 1 to 9; the system comprises:

the data processing module is used for acquiring two-dimensional data, point cloud data and depth data and calculating to obtain an incidence relation between the scene image and the point cloud;