CN116071747A

CN116071747A - 3D point cloud data and 2D image data fusion matching semantic segmentation method

Info

Publication number: CN116071747A
Application number: CN202211722227.8A
Authority: CN
Inventors: 项超; 李雪松; 姚宇翔; 贾雨涵
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-05-05

Abstract

The invention relates to a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data, and belongs to the technical field of image processing. The method comprises the following steps: extracting feature images from the 2D image and the 3D point cloud respectively by utilizing a2D image network and a 3D point cloud network which are fused with a multiscale fused attention mechanism, obtaining a sparse-dense feature sampling result obtained by projecting the 2D feature images by utilizing a feature fusion module, carrying out feature fusion on the result obtained by S1 in a channel splicing mode, finally outputting a predicted segmentation result, and training a model on a target domain and a source domain. The attention mechanism of the multi-scale fusion reduces the characteristics of the 2D characteristic map lost due to the multi-scale characteristic fusion, and improves the accuracy of segmentation; and a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved.

Description

3D point cloud data and 2D image data fusion matching semantic segmentation method

Technical Field

The invention relates to a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data, and belongs to the technical field of image processing.

Background

With the rapid development of automatic driving technology and research of intelligent robots, deep understanding of the surrounding environment is an indispensable condition, so that precise semantic segmentation becomes more and more important. The key to the overall understanding of an image is to split the image into separate parts. The understanding of scenes has now progressed to a pixel-level of refinement, with the processing of pixels to detect each entity in a picture and thereby be able to mark clear boundaries.

With the continuous development of computer vision, many researchers are interested in the problem of semantic segmentation, and the semantic segmentation applied to static images is studied very deeply, and many very deep and mature algorithms have been proposed. However, the semantic segmentation of the 2D image has some non-negligible defects, such as large dependence on illumination conditions, unclear segmentation of edges of small objects, and confusion of segmentation of occlusion objects, which are very confusing for students. The 3D point cloud data can effectively solve the problems, and due to the development of LiDAR equipment in recent years, the acquisition of the 3D point cloud data is not a problem any more, and how to obtain useful information from a large amount of 3D point cloud data, so that the scene is better analyzed, which is an important content of the computer vision research at present.

Target detection, classification and identification based on 3D point cloud data are the main technologies for solving how to perform scene analysis, and semantic segmentation of 3D point cloud is the basis of the technologies. While when faced with a new scene, domain adaptation is an important factor in achieving understanding of the new scene in the absence of data annotation. Although the acquisition of 3D point cloud data becomes easy and the variety of 3D point cloud data is increasingly more and more, the labeling process of 3D point cloud data requires a huge amount of time, so that the semantic segmentation domain adaptation of 3D point cloud is expected. Although the point cloud data is very easy to acquire, compared with the 2D image, the labeling process of the point cloud data consumes manpower and material resources, and the point cloud data is quite sparse for the segmentation work, wherein a plurality of missed points are not prevented from influencing the subsequent segmentation result.

The existing method mainly comprises the steps of downsampling a2D feature map with dense features to obtain a feature map which is as sparse as 3D point cloud data, and the aim of realizing cross-modal interaction after 2D and 3D features are aligned. Therefore, 3D domain adaptation cross-modal learning can be performed by using the 2D data, and the time wasted by 3D point cloud labeling is reduced. In the conventional intra-domain cross-modal learning, since dense 2D pixel features are sampled into a graph with the same size as sparse 3D point cloud features, a large number of 2D features are discarded. Domain adaptation can play its role when applied in semantic segmentation. An important problem faced by semantic segmentation based on deep learning is that a model performs well in one data set, but the effect of applying the model to another data set is reduced, so that the model is difficult to accurately predict a scene which is not seen, and the situation of illumination and the like of a scene also has great influence on accuracy when the semantic segmentation is applied to different scenes, so that a domain adaptation method is used for avoiding a great amount of time consumed by manually labeling data. The present application is directed to solving the above-described problems using a domain-adaptive approach.

Disclosure of Invention

The invention aims to solve the technical defect that the image semantic segmentation result is inaccurate by simply relying on 3D point cloud data due to the fact that the 3D point cloud data lack of relevant labels and data sparseness, and provides a semantic segmentation method based on fusion matching of the 3D point cloud data and 2D image data.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention discloses a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data,

based on the existing xMUDA and DsCML models, the method adds a feature fusion part and modifies a segmentation network on the basis of inter-domain and intra-domain semantic segmentation models of cross-modal learning, so that the result of inter-domain and intra-domain semantic segmentation networks using a multi-modal dataset is improved. The method comprises the following steps: the method comprises the steps that a source domain and a target domain in a scene-scene data set are used for joint learning on a group of data sets by utilizing the difference of scenes on the source data set and the target data set, so that domain adaptability to a certain degree is obtained, and a trained model is tested on test sets with the same source domain and the same target domain and different source domain and target domain respectively; attention mechanisms of multi-scale fusion are integrated based on a cavity space convolution pooling pyramid (ASPP) structure, so that the features of a2D feature map lost due to multi-scale feature fusion are reduced, and the accuracy of segmentation is improved; and a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved.

The semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data specifically comprises the following steps:

step 1: extracting a feature map from the 2D picture by using the deep 3 model to obtain a2D feature map; the method comprises the following steps:

step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be n, so that an output picture is obtained;

step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure as m to obtain the output picture;

step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;

step 1.4: constructing a block structure according to the mode of the step 1.1, and setting the void ratio as t ₁ Inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;

step 1.5: and (3) constructing a cavity space convolution pooling pyramid to process the output picture obtained in the step (1.4), wherein the method specifically comprises the following sub-steps:

step 1.5.1: constructing a convolution layer with a size of a and a plurality of cavity convolutions with a size of b, and processing the output picture obtained in the step 1.4 to obtain a plurality of picture features with different scales;

step 1.5.2: constructing a global average pooling layer, and processing the output picture obtained in the step 1.4 to obtain the image level characteristics of the output picture;

step 2: realizing the extraction of the feature map of the 3D point cloud data based on the sparsetConVNet model to obtain a 3D feature map; the method comprises the following steps:

step 2.1: preprocessing input 3D point cloud data, and arranging input tensors according to NCHW sequence, wherein non-zero data in the point cloud data is defined as an activated input site;

step 2.2: constructing a convolution kernel with the kernel size of c;

step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash _in Key in table _in Representing the coordinates of the input pixel, v _in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P _out And take this as the frontLifting and constructing Hash table Hash _out ，key _out Representing coordinates in the output tensor, v _out A sequence number representing the output tensor;

step 2.4: establishing a RuleBook, associating serial numbers in the input and output hash tables obtained in the step 2.3 so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step 2.1 by using the convolution obtained in the step 2.2 to obtain a 3D feature map;

step 3, obtaining a2D dependent feature map with a global dependent relation by using the self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion; the method comprises the following steps:

step 3.1: calculating the similarity between every two of the image characteristics of a plurality of different scales obtained in the step 1.5.1 and the image level characteristics obtained in the step 1.5.2;

step 3.2: normalizing the similarity obtained in the step 3.1 by using a normalization index function, and taking the similarity as a key value for weighted summation to obtain a2D dependency feature map;

step 4, inputting the 2D dependent feature map obtained in the step 3.2 and the 3D feature map obtained in the step 2 into a deformable convolution, pooling layer and feature fusion module to obtain the projection of the 3D to the 2D feature map and the sparse-dense feature sampling result thereof; the method comprises the following steps:

step 4.1: processing the 2D dependency characteristic diagram of the office dependency relationship obtained in the step 3.2 to obtain a related offset diagram;

step 4.2: constructing a deformable convolution layer, inputting the 2D dependent feature map of the local dependency relationship obtained in the step 3.2 and the offset 2D dependent feature map obtained in the step 4.1 into the constructed deformable convolution layer, and obtaining three 2D feature maps through the processes of maximum, minimum and average pooling;

step 4.3: constructing a 2D-3D projection model, sampling the three 2D feature graphs obtained in the step 4.2, and performing final segmentation prediction after the feature matching process of the two feature graphs is completed to obtain maximum, minimum and average probability scores respectively;

step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to be in multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown as follows:

wherein ,

representing the largest probability score of the nth 2D feature map sampling result

Minimum probability score representing the sampling result of the nth 2D feature map, +.>

Represents probability score of nth point corresponding to 3D point cloud, K (and) represents KL divergence, P _2D Representing a2D feature map;

step 4.5: training the average probability score obtained in the step 4.3 by using the loss function obtained in the step 4.4 and using the average probability score as the finally output semantic segmentation prediction;

step 5: carrying out feature fusion on the sparse-dense feature sampling result obtained in the step 4 in a channel splicing mode, and finally outputting a predicted segmentation result; the method comprises the following steps:

step 5.1: splicing the result obtained in the step 4.5 after the semantic segmentation is subjected to the sparse sampling and pooling process with the 3D feature map obtained in the step 2.4 to obtain a 2D-3D image feature fusion result;

step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function to train the 2D-3D image feature fusion result obtained in the step 5.1, wherein the loss function is as follows:

wherein ,

a label representing the nth point on the target domain, the label on the target data set and the source data set remaining consistent, since the target data set is the result of the cross-domain training to be tested,/->

An average probability score representing the sampled results at the 2D feature map; />

Representing the probability score of the nth point on the 3D feature map on the target domain.

Advantageous effects

1. The semantic segmentation method integrates a multi-scale fusion attention mechanism based on the cavity space convolution pooling pyramid structure, reduces the features of the 2D feature map lost due to multi-scale feature fusion, and improves the segmentation accuracy;

2. according to the method, a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved;

drawings

FIG. 1 is a flow chart of a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data;

fig. 2 is a flowchart for implementing extraction of a feature map of 3D point cloud data based on SparseConVNet model according to the present invention;

FIG. 3 is a flow chart of a self-attention mechanism for 2D feature map through multi-scale feature fusion in accordance with the present invention;

FIG. 4 is a flow chart of sparse-dense feature sampling from a projection of a2D feature map using a 3D feature map in accordance with the present invention;

FIG. 5 is two examples of A2D2 datasets and associated labeling results, where FIG. 5 (a) is a true image of the first example, FIG. 5 (b) is a labeled image segmentation result of the first example, FIG. 5 (c) is a true image of the second example, and FIG. 5 (D) is a labeled image segmentation result of the second example;

fig. 6 is a comparison of a result of the semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data with other models.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples.

Examples

The automatic driving is a product of deep fusion of the automobile industry and new generation information technologies such as artificial intelligence, internet of things, high-performance calculation and the like, and is a main direction of intelligent and networking development in the current global automobile and transportation travel field. Although the development of the automatic driving automobile is good, the bottleneck exists in the aspects of core technologies such as sensor perception, control decision, vehicle interaction, road condition recognition and the like at present, particularly in the aspect of road condition recognition, the vehicle needs to complete the control and recognition of surrounding obstacles, traffic signals, pedestrians and other vehicle states, but the automatic driving automobile capable of achieving the effect is few so far. The study based on automatic driving can be popularized to other fields such as multi-mode image registration, three-dimensional visualization and the like.

The input data of the present embodiment are a2D image and its corresponding 3D point cloud data, respectively. Wherein the image and point cloud data input by the source domain are tagged and the image and point cloud data input by the target domain are untagged. Therefore, after inputting data, it is necessary to obtain a feature map of the corresponding data, and before inputting data into the classifier, it is necessary to first input their corresponding segmentation networks to obtain a feature map of a suitable size.

In this embodiment, the A2D2 dataset is used as the source domain of our target object; the data set is a large autopilot data set proposed in a paper published by the Audi company in 2020, A2D2, audi Autonomous Driving Dataset. The aim is to advance commercial research and academic research in the directions of computer vision and automatic driving, the data types of the commercial research and academic research comprise RGB images and corresponding 3D point cloud data, and the time for recording the data is synchronous. The A2D2 includes scenes of different categories such as expressways, villages and cities, and the dataset for semantic segmentation includes 41,277 already annotated 2D pictures. Of these, 31,448 pictures were taken from the front, 1,966 pictures were taken from the front left, 1,797 pictures were taken from the front right, 1,650 pictures and 2,722 pictures were taken from the front left and right, respectively, and the remaining 1,694 Zhang Zecong pictures were taken from the back. Wherein each pixel of each picture gives a label of the corresponding class label. The point cloud segmentation is generated by fusing semantic pixel information with a lidar point cloud. So that each 3D point is assigned an object type label, which depends on the registration between the exact camera and the LiDAR. In addition, the dataset also provides labeling for the 3D bounding box, which is not within the contemplation of this experiment. In this embodiment, the whole data set is divided into 20 scenes, 40,335 pictures are used as training sets, 1 scene, and 942 pictures are used as test sets. The configuration of the sensor of the A2D2 data set is composed of 6 cameras and five Velodyne VIP-16LIDAR sensors, and 360-degree coverage of the surrounding environment of the vehicle is realized. The data set is also very large in data size, and the marked non-sequence data also comprises 392,556 continuous frames of sensorless data. The number of traffic participant instances in the A2D2 dataset for semantic tag annotation is largely comprised of cars, trucks and pedestrians, two examples of which are illustrated in fig. 5.

SemanticKITTI is provided by Behley et al, university of Bohn, germany, as a target field of our target object, and is a special Semantic segmentation dataset made from the KITTI Vision Odometry Benchmark dataset, which provides a large amount of useful data for Semantic segmentation based on vehicle-mounted lidar. The scene categories of the Semantic-KITTI data set include interior urban traffic areas, residential areas, and expressways and rural lanes in Germany. The original Odometry dataset consists of 22 scenes in total, scenes 00 to 10 are training sets, and the training sets are provided with dense comments; scenes 11-21 are test sets that contain a large number of complex traffic environments. In this embodiment, instead of using 11-21, scenes 07 and 08 are used as test sets and the remaining scenes are used as training sets. The Semantic-KITTI data set contains 28 classes, including moving objects and non-moving objects. Not only are numerous traffic participants included in the category, but also some on-ground content is covered, including parking lots and sidewalks, and the like. Because the Semantic-KITTI is point cloud data, in the experiment, the research is also needed by combining the 2D picture corresponding to the point cloud data, and therefore, the picture data provided by the kITTI-Odometry is also downloaded. The data of the kITTI-Odometry image part mainly comprises a calibration file, a color image, a gray image and a track true value, and only the color image is used in the experiment.

The evaluation index in this embodiment is mean IoU, which refers to the intersection of the actual region and the predicted region divided by the union of the actual region and the predicted region (i.e., the ratio of the intersection to the union of the two sets is calculated), and this ratio can be modified to be the sum of the true positive number, the false negative number, and the union of the false positive number on the positive number ratio, and the sum of the true positive number, the false negative number, and the union of the false positive number is calculated IoU on each class, and then the average operation is performed, where the calculation formula is:

where i represents the true value, j represents the predicted value, p _ij Representing i predicted as j, p _ji Indicating that j is predicted as i.

The operation flow of the semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data in the embodiment is shown in the attached figure 1, and the method specifically comprises the following implementation steps:

step 1: extracting a feature map from the 2D picture by using a deep 3 model to obtain a2D feature map; the method comprises the following steps:

step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be 4, so that an output picture is obtained;

step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure to be 8 to obtain the output picture;

step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure to be 16 to obtain the output picture;

step 1.4: constructing a block structure according to the mode of the step 1.1, setting the void ratio to be 2, inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure to be 16 to obtain the output picture;

step 1.5.1: constructing a convolution layer with the size of 1*1, and processing the output picture obtained in the step 1.4 by using 3 cavity convolutions with the size of 3*3 to obtain a plurality of picture features with different scales;

step 2: extracting a feature map of the 3D point cloud data based on the sparseconVNet model to obtain a 3D feature map, wherein the overall flow of the step 2 is shown in the attached figure 2;

step 2.2: constructing a convolution kernel with a kernel size of 3*3;

step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash _in Key in table _in Representing the coordinates of the input pixel, v _in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P _out And constructing a Hash table Hash on the premise of the Hash table Hash _out ，key _out Representing coordinates in the output tensor, v _out A sequence number representing the output tensor;

step 3, obtaining a2D dependent feature map with global dependency relationship by using the self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion, wherein the whole flow of the step 3 is shown in a figure 3;

step 4, inputting the 2D dependent feature map obtained in the step 3.2 and the 3D feature map obtained in the step 2 into a deformable convolution and pooling layer and a feature fusion module to obtain a sparse-dense feature sampling result obtained by the projection of the 3D to the 2D feature map, wherein the overall flow of the step 4 is shown in a figure 4;

step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to perform multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown in a formula (2);

wherein ,

step 5, carrying out feature fusion on the sparse-dense feature sampling result obtained in the step 4 in a channel splicing mode, and finally outputting a predicted segmentation result; the method comprises the following steps:

step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function training result of the 2D-3D image feature fusion obtained in the step 5.1, wherein the loss function is shown in a formula (3);

wherein ,

The operation result of the method is shown in figure 6 for the A2D2-SemanticKITTI data set;

in summary, the above embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, but any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data is characterized by comprising the following steps of:

step 1: extracting a feature map from the 2D picture by using a deep 3 model to obtain the 2D feature map, wherein the feature map specifically comprises the following steps:

step 2: realizing the extraction of the feature map of the 3D point cloud data based on the sparsetConVNet model to obtain a 3D feature map;

step 3, obtaining a2D dependent feature map with a global dependency relationship by using a self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion, wherein the method specifically comprises the following steps:

wherein ,

maximum probability score representing the nth 2D feature map sampling result, and +.>

step 5: and (3) carrying out feature fusion on the sparse-dense feature sampling result obtained in the step (4) in a channel splicing mode, and finally outputting a predicted segmentation result, wherein the method specifically comprises the following steps:

wherein ,

Average probability score representing sampling result in 2D feature map,/->

2. The 3D point cloud data and 2D image data fusion matching semantic segmentation method as claimed in claim 1, wherein the step 2 specifically comprises:

step 2.2: constructing a convolution kernel with the kernel size of c;

step 2.4: and (3) establishing a RuleBook, establishing a relation between serial numbers in the input hash table and the output hash table obtained in the step (2.3) so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step (2.1) by using the convolution obtained in the step (2.2) to obtain a 3D feature map.