CN116071747A - 3D point cloud data and 2D image data fusion matching semantic segmentation method - Google Patents
3D point cloud data and 2D image data fusion matching semantic segmentation method Download PDFInfo
- Publication number
- CN116071747A CN116071747A CN202211722227.8A CN202211722227A CN116071747A CN 116071747 A CN116071747 A CN 116071747A CN 202211722227 A CN202211722227 A CN 202211722227A CN 116071747 A CN116071747 A CN 116071747A
- Authority
- CN
- China
- Prior art keywords
- feature
- feature map
- point cloud
- image
- block structure
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000004927 fusion Effects 0.000 title claims abstract description 41
- 238000005070 sampling Methods 0.000 claims abstract description 21
- 238000011176 pooling Methods 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 12
- 230000007246 mechanism Effects 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 39
- 230000008569 process Effects 0.000 claims description 14
- 230000001419 dependent effect Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 9
- 238000010586 diagram Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 239000011800 void material Substances 0.000 claims description 3
- 238000011160 research Methods 0.000 description 7
- 238000002372 labelling Methods 0.000 description 6
- 230000006978 adaptation Effects 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 230000018109 developmental process Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 108050005509 3D domains Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data, and belongs to the technical field of image processing. The method comprises the following steps: extracting feature images from the 2D image and the 3D point cloud respectively by utilizing a2D image network and a 3D point cloud network which are fused with a multiscale fused attention mechanism, obtaining a sparse-dense feature sampling result obtained by projecting the 2D feature images by utilizing a feature fusion module, carrying out feature fusion on the result obtained by S1 in a channel splicing mode, finally outputting a predicted segmentation result, and training a model on a target domain and a source domain. The attention mechanism of the multi-scale fusion reduces the characteristics of the 2D characteristic map lost due to the multi-scale characteristic fusion, and improves the accuracy of segmentation; and a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved.
Description
Technical Field
The invention relates to a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data, and belongs to the technical field of image processing.
Background
With the rapid development of automatic driving technology and research of intelligent robots, deep understanding of the surrounding environment is an indispensable condition, so that precise semantic segmentation becomes more and more important. The key to the overall understanding of an image is to split the image into separate parts. The understanding of scenes has now progressed to a pixel-level of refinement, with the processing of pixels to detect each entity in a picture and thereby be able to mark clear boundaries.
With the continuous development of computer vision, many researchers are interested in the problem of semantic segmentation, and the semantic segmentation applied to static images is studied very deeply, and many very deep and mature algorithms have been proposed. However, the semantic segmentation of the 2D image has some non-negligible defects, such as large dependence on illumination conditions, unclear segmentation of edges of small objects, and confusion of segmentation of occlusion objects, which are very confusing for students. The 3D point cloud data can effectively solve the problems, and due to the development of LiDAR equipment in recent years, the acquisition of the 3D point cloud data is not a problem any more, and how to obtain useful information from a large amount of 3D point cloud data, so that the scene is better analyzed, which is an important content of the computer vision research at present.
Target detection, classification and identification based on 3D point cloud data are the main technologies for solving how to perform scene analysis, and semantic segmentation of 3D point cloud is the basis of the technologies. While when faced with a new scene, domain adaptation is an important factor in achieving understanding of the new scene in the absence of data annotation. Although the acquisition of 3D point cloud data becomes easy and the variety of 3D point cloud data is increasingly more and more, the labeling process of 3D point cloud data requires a huge amount of time, so that the semantic segmentation domain adaptation of 3D point cloud is expected. Although the point cloud data is very easy to acquire, compared with the 2D image, the labeling process of the point cloud data consumes manpower and material resources, and the point cloud data is quite sparse for the segmentation work, wherein a plurality of missed points are not prevented from influencing the subsequent segmentation result.
The existing method mainly comprises the steps of downsampling a2D feature map with dense features to obtain a feature map which is as sparse as 3D point cloud data, and the aim of realizing cross-modal interaction after 2D and 3D features are aligned. Therefore, 3D domain adaptation cross-modal learning can be performed by using the 2D data, and the time wasted by 3D point cloud labeling is reduced. In the conventional intra-domain cross-modal learning, since dense 2D pixel features are sampled into a graph with the same size as sparse 3D point cloud features, a large number of 2D features are discarded. Domain adaptation can play its role when applied in semantic segmentation. An important problem faced by semantic segmentation based on deep learning is that a model performs well in one data set, but the effect of applying the model to another data set is reduced, so that the model is difficult to accurately predict a scene which is not seen, and the situation of illumination and the like of a scene also has great influence on accuracy when the semantic segmentation is applied to different scenes, so that a domain adaptation method is used for avoiding a great amount of time consumed by manually labeling data. The present application is directed to solving the above-described problems using a domain-adaptive approach.
Disclosure of Invention
The invention aims to solve the technical defect that the image semantic segmentation result is inaccurate by simply relying on 3D point cloud data due to the fact that the 3D point cloud data lack of relevant labels and data sparseness, and provides a semantic segmentation method based on fusion matching of the 3D point cloud data and 2D image data.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention discloses a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data,
based on the existing xMUDA and DsCML models, the method adds a feature fusion part and modifies a segmentation network on the basis of inter-domain and intra-domain semantic segmentation models of cross-modal learning, so that the result of inter-domain and intra-domain semantic segmentation networks using a multi-modal dataset is improved. The method comprises the following steps: the method comprises the steps that a source domain and a target domain in a scene-scene data set are used for joint learning on a group of data sets by utilizing the difference of scenes on the source data set and the target data set, so that domain adaptability to a certain degree is obtained, and a trained model is tested on test sets with the same source domain and the same target domain and different source domain and target domain respectively; attention mechanisms of multi-scale fusion are integrated based on a cavity space convolution pooling pyramid (ASPP) structure, so that the features of a2D feature map lost due to multi-scale feature fusion are reduced, and the accuracy of segmentation is improved; and a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved.
The semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data specifically comprises the following steps:
step 1: extracting a feature map from the 2D picture by using the deep 3 model to obtain a2D feature map; the method comprises the following steps:
step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be n, so that an output picture is obtained;
step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure as m to obtain the output picture;
step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.4: constructing a block structure according to the mode of the step 1.1, and setting the void ratio as t 1 Inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.5: and (3) constructing a cavity space convolution pooling pyramid to process the output picture obtained in the step (1.4), wherein the method specifically comprises the following sub-steps:
step 1.5.1: constructing a convolution layer with a size of a and a plurality of cavity convolutions with a size of b, and processing the output picture obtained in the step 1.4 to obtain a plurality of picture features with different scales;
step 1.5.2: constructing a global average pooling layer, and processing the output picture obtained in the step 1.4 to obtain the image level characteristics of the output picture;
step 2: realizing the extraction of the feature map of the 3D point cloud data based on the sparsetConVNet model to obtain a 3D feature map; the method comprises the following steps:
step 2.1: preprocessing input 3D point cloud data, and arranging input tensors according to NCHW sequence, wherein non-zero data in the point cloud data is defined as an activated input site;
step 2.2: constructing a convolution kernel with the kernel size of c;
step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash in Key in table in Representing the coordinates of the input pixel, v in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P out And take this as the frontLifting and constructing Hash table Hash out ,key out Representing coordinates in the output tensor, v out A sequence number representing the output tensor;
step 2.4: establishing a RuleBook, associating serial numbers in the input and output hash tables obtained in the step 2.3 so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step 2.1 by using the convolution obtained in the step 2.2 to obtain a 3D feature map;
step 3, obtaining a2D dependent feature map with a global dependent relation by using the self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion; the method comprises the following steps:
step 3.1: calculating the similarity between every two of the image characteristics of a plurality of different scales obtained in the step 1.5.1 and the image level characteristics obtained in the step 1.5.2;
step 3.2: normalizing the similarity obtained in the step 3.1 by using a normalization index function, and taking the similarity as a key value for weighted summation to obtain a2D dependency feature map;
step 4.1: processing the 2D dependency characteristic diagram of the office dependency relationship obtained in the step 3.2 to obtain a related offset diagram;
step 4.2: constructing a deformable convolution layer, inputting the 2D dependent feature map of the local dependency relationship obtained in the step 3.2 and the offset 2D dependent feature map obtained in the step 4.1 into the constructed deformable convolution layer, and obtaining three 2D feature maps through the processes of maximum, minimum and average pooling;
step 4.3: constructing a 2D-3D projection model, sampling the three 2D feature graphs obtained in the step 4.2, and performing final segmentation prediction after the feature matching process of the two feature graphs is completed to obtain maximum, minimum and average probability scores respectively;
step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to be in multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown as follows:
wherein ,representing the largest probability score of the nth 2D feature map sampling resultMinimum probability score representing the sampling result of the nth 2D feature map, +.>Represents probability score of nth point corresponding to 3D point cloud, K (and) represents KL divergence, P 2D Representing a2D feature map;
step 4.5: training the average probability score obtained in the step 4.3 by using the loss function obtained in the step 4.4 and using the average probability score as the finally output semantic segmentation prediction;
step 5: carrying out feature fusion on the sparse-dense feature sampling result obtained in the step 4 in a channel splicing mode, and finally outputting a predicted segmentation result; the method comprises the following steps:
step 5.1: splicing the result obtained in the step 4.5 after the semantic segmentation is subjected to the sparse sampling and pooling process with the 3D feature map obtained in the step 2.4 to obtain a 2D-3D image feature fusion result;
step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function to train the 2D-3D image feature fusion result obtained in the step 5.1, wherein the loss function is as follows:
wherein ,a label representing the nth point on the target domain, the label on the target data set and the source data set remaining consistent, since the target data set is the result of the cross-domain training to be tested,/->An average probability score representing the sampled results at the 2D feature map; />Representing the probability score of the nth point on the 3D feature map on the target domain.
Advantageous effects
1. The semantic segmentation method integrates a multi-scale fusion attention mechanism based on the cavity space convolution pooling pyramid structure, reduces the features of the 2D feature map lost due to multi-scale feature fusion, and improves the segmentation accuracy;
2. according to the method, a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved;
drawings
FIG. 1 is a flow chart of a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data;
fig. 2 is a flowchart for implementing extraction of a feature map of 3D point cloud data based on SparseConVNet model according to the present invention;
FIG. 3 is a flow chart of a self-attention mechanism for 2D feature map through multi-scale feature fusion in accordance with the present invention;
FIG. 4 is a flow chart of sparse-dense feature sampling from a projection of a2D feature map using a 3D feature map in accordance with the present invention;
FIG. 5 is two examples of A2D2 datasets and associated labeling results, where FIG. 5 (a) is a true image of the first example, FIG. 5 (b) is a labeled image segmentation result of the first example, FIG. 5 (c) is a true image of the second example, and FIG. 5 (D) is a labeled image segmentation result of the second example;
fig. 6 is a comparison of a result of the semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data with other models.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples.
Examples
The automatic driving is a product of deep fusion of the automobile industry and new generation information technologies such as artificial intelligence, internet of things, high-performance calculation and the like, and is a main direction of intelligent and networking development in the current global automobile and transportation travel field. Although the development of the automatic driving automobile is good, the bottleneck exists in the aspects of core technologies such as sensor perception, control decision, vehicle interaction, road condition recognition and the like at present, particularly in the aspect of road condition recognition, the vehicle needs to complete the control and recognition of surrounding obstacles, traffic signals, pedestrians and other vehicle states, but the automatic driving automobile capable of achieving the effect is few so far. The study based on automatic driving can be popularized to other fields such as multi-mode image registration, three-dimensional visualization and the like.
The input data of the present embodiment are a2D image and its corresponding 3D point cloud data, respectively. Wherein the image and point cloud data input by the source domain are tagged and the image and point cloud data input by the target domain are untagged. Therefore, after inputting data, it is necessary to obtain a feature map of the corresponding data, and before inputting data into the classifier, it is necessary to first input their corresponding segmentation networks to obtain a feature map of a suitable size.
In this embodiment, the A2D2 dataset is used as the source domain of our target object; the data set is a large autopilot data set proposed in a paper published by the Audi company in 2020, A2D2, audi Autonomous Driving Dataset. The aim is to advance commercial research and academic research in the directions of computer vision and automatic driving, the data types of the commercial research and academic research comprise RGB images and corresponding 3D point cloud data, and the time for recording the data is synchronous. The A2D2 includes scenes of different categories such as expressways, villages and cities, and the dataset for semantic segmentation includes 41,277 already annotated 2D pictures. Of these, 31,448 pictures were taken from the front, 1,966 pictures were taken from the front left, 1,797 pictures were taken from the front right, 1,650 pictures and 2,722 pictures were taken from the front left and right, respectively, and the remaining 1,694 Zhang Zecong pictures were taken from the back. Wherein each pixel of each picture gives a label of the corresponding class label. The point cloud segmentation is generated by fusing semantic pixel information with a lidar point cloud. So that each 3D point is assigned an object type label, which depends on the registration between the exact camera and the LiDAR. In addition, the dataset also provides labeling for the 3D bounding box, which is not within the contemplation of this experiment. In this embodiment, the whole data set is divided into 20 scenes, 40,335 pictures are used as training sets, 1 scene, and 942 pictures are used as test sets. The configuration of the sensor of the A2D2 data set is composed of 6 cameras and five Velodyne VIP-16LIDAR sensors, and 360-degree coverage of the surrounding environment of the vehicle is realized. The data set is also very large in data size, and the marked non-sequence data also comprises 392,556 continuous frames of sensorless data. The number of traffic participant instances in the A2D2 dataset for semantic tag annotation is largely comprised of cars, trucks and pedestrians, two examples of which are illustrated in fig. 5.
SemanticKITTI is provided by Behley et al, university of Bohn, germany, as a target field of our target object, and is a special Semantic segmentation dataset made from the KITTI Vision Odometry Benchmark dataset, which provides a large amount of useful data for Semantic segmentation based on vehicle-mounted lidar. The scene categories of the Semantic-KITTI data set include interior urban traffic areas, residential areas, and expressways and rural lanes in Germany. The original Odometry dataset consists of 22 scenes in total, scenes 00 to 10 are training sets, and the training sets are provided with dense comments; scenes 11-21 are test sets that contain a large number of complex traffic environments. In this embodiment, instead of using 11-21, scenes 07 and 08 are used as test sets and the remaining scenes are used as training sets. The Semantic-KITTI data set contains 28 classes, including moving objects and non-moving objects. Not only are numerous traffic participants included in the category, but also some on-ground content is covered, including parking lots and sidewalks, and the like. Because the Semantic-KITTI is point cloud data, in the experiment, the research is also needed by combining the 2D picture corresponding to the point cloud data, and therefore, the picture data provided by the kITTI-Odometry is also downloaded. The data of the kITTI-Odometry image part mainly comprises a calibration file, a color image, a gray image and a track true value, and only the color image is used in the experiment.
The evaluation index in this embodiment is mean IoU, which refers to the intersection of the actual region and the predicted region divided by the union of the actual region and the predicted region (i.e., the ratio of the intersection to the union of the two sets is calculated), and this ratio can be modified to be the sum of the true positive number, the false negative number, and the union of the false positive number on the positive number ratio, and the sum of the true positive number, the false negative number, and the union of the false positive number is calculated IoU on each class, and then the average operation is performed, where the calculation formula is:
where i represents the true value, j represents the predicted value, p ij Representing i predicted as j, p ji Indicating that j is predicted as i.
The operation flow of the semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data in the embodiment is shown in the attached figure 1, and the method specifically comprises the following implementation steps:
step 1: extracting a feature map from the 2D picture by using a deep 3 model to obtain a2D feature map; the method comprises the following steps:
step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be 4, so that an output picture is obtained;
step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure to be 8 to obtain the output picture;
step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure to be 16 to obtain the output picture;
step 1.4: constructing a block structure according to the mode of the step 1.1, setting the void ratio to be 2, inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure to be 16 to obtain the output picture;
step 1.5: and (3) constructing a cavity space convolution pooling pyramid to process the output picture obtained in the step (1.4), wherein the method specifically comprises the following sub-steps:
step 1.5.1: constructing a convolution layer with the size of 1*1, and processing the output picture obtained in the step 1.4 by using 3 cavity convolutions with the size of 3*3 to obtain a plurality of picture features with different scales;
step 1.5.2: constructing a global average pooling layer, and processing the output picture obtained in the step 1.4 to obtain the image level characteristics of the output picture;
step 2: extracting a feature map of the 3D point cloud data based on the sparseconVNet model to obtain a 3D feature map, wherein the overall flow of the step 2 is shown in the attached figure 2;
step 2.1: preprocessing input 3D point cloud data, and arranging input tensors according to NCHW sequence, wherein non-zero data in the point cloud data is defined as an activated input site;
step 2.2: constructing a convolution kernel with a kernel size of 3*3;
step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash in Key in table in Representing the coordinates of the input pixel, v in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P out And constructing a Hash table Hash on the premise of the Hash table Hash out ,key out Representing coordinates in the output tensor, v out A sequence number representing the output tensor;
step 2.4: establishing a RuleBook, associating serial numbers in the input and output hash tables obtained in the step 2.3 so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step 2.1 by using the convolution obtained in the step 2.2 to obtain a 3D feature map;
step 3, obtaining a2D dependent feature map with global dependency relationship by using the self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion, wherein the whole flow of the step 3 is shown in a figure 3;
step 3.1: calculating the similarity between every two of the image characteristics of a plurality of different scales obtained in the step 1.5.1 and the image level characteristics obtained in the step 1.5.2;
step 3.2: normalizing the similarity obtained in the step 3.1 by using a normalization index function, and taking the similarity as a key value for weighted summation to obtain a2D dependency feature map;
step 4.1: processing the 2D dependency characteristic diagram of the office dependency relationship obtained in the step 3.2 to obtain a related offset diagram;
step 4.2: constructing a deformable convolution layer, inputting the 2D dependent feature map of the local dependency relationship obtained in the step 3.2 and the offset 2D dependent feature map obtained in the step 4.1 into the constructed deformable convolution layer, and obtaining three 2D feature maps through the processes of maximum, minimum and average pooling;
step 4.3: constructing a 2D-3D projection model, sampling the three 2D feature graphs obtained in the step 4.2, and performing final segmentation prediction after the feature matching process of the two feature graphs is completed to obtain maximum, minimum and average probability scores respectively;
step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to perform multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown in a formula (2);
wherein ,representing the largest probability score of the nth 2D feature map sampling resultMinimum probability score representing the sampling result of the nth 2D feature map, +.>Represents probability score of nth point corresponding to 3D point cloud, K (and) represents KL divergence, P 2D Representing a2D feature map;
step 4.5: training the average probability score obtained in the step 4.3 by using the loss function obtained in the step 4.4 and using the average probability score as the finally output semantic segmentation prediction;
step 5, carrying out feature fusion on the sparse-dense feature sampling result obtained in the step 4 in a channel splicing mode, and finally outputting a predicted segmentation result; the method comprises the following steps:
step 5.1: splicing the result obtained in the step 4.5 after the semantic segmentation is subjected to the sparse sampling and pooling process with the 3D feature map obtained in the step 2.4 to obtain a 2D-3D image feature fusion result;
step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function training result of the 2D-3D image feature fusion obtained in the step 5.1, wherein the loss function is shown in a formula (3);
wherein ,a label representing the nth point on the target domain, the label on the target data set and the source data set remaining consistent, since the target data set is the result of the cross-domain training to be tested,/->An average probability score representing the sampled results at the 2D feature map; />Representing the probability score of the nth point on the 3D feature map on the target domain.
The operation result of the method is shown in figure 6 for the A2D2-SemanticKITTI data set;
in summary, the above embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, but any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (2)
1. The semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data is characterized by comprising the following steps of:
step 1: extracting a feature map from the 2D picture by using a deep 3 model to obtain the 2D feature map, wherein the feature map specifically comprises the following steps:
step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be n, so that an output picture is obtained;
step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure as m to obtain the output picture;
step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.4: constructing a block structure according to the mode of the step 1.1, and setting the void ratio as t 1 Inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.5: and (3) constructing a cavity space convolution pooling pyramid to process the output picture obtained in the step (1.4), wherein the method specifically comprises the following sub-steps:
step 1.5.1: constructing a convolution layer with a size of a and a plurality of cavity convolutions with a size of b, and processing the output picture obtained in the step 1.4 to obtain a plurality of picture features with different scales;
step 1.5.2: constructing a global average pooling layer, and processing the output picture obtained in the step 1.4 to obtain the image level characteristics of the output picture;
step 2: realizing the extraction of the feature map of the 3D point cloud data based on the sparsetConVNet model to obtain a 3D feature map;
step 3, obtaining a2D dependent feature map with a global dependency relationship by using a self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion, wherein the method specifically comprises the following steps:
step 3.1: calculating the similarity between every two of the image characteristics of a plurality of different scales obtained in the step 1.5.1 and the image level characteristics obtained in the step 1.5.2;
step 3.2: normalizing the similarity obtained in the step 3.1 by using a normalization index function, and taking the similarity as a key value for weighted summation to obtain a2D dependency feature map;
step 4, inputting the 2D dependent feature map obtained in the step 3.2 and the 3D feature map obtained in the step 2 into a deformable convolution, pooling layer and feature fusion module to obtain the projection of the 3D to the 2D feature map and the sparse-dense feature sampling result thereof; the method comprises the following steps:
step 4.1: processing the 2D dependency characteristic diagram of the office dependency relationship obtained in the step 3.2 to obtain a related offset diagram;
step 4.2: constructing a deformable convolution layer, inputting the 2D dependent feature map of the local dependency relationship obtained in the step 3.2 and the offset 2D dependent feature map obtained in the step 4.1 into the constructed deformable convolution layer, and obtaining three 2D feature maps through the processes of maximum, minimum and average pooling;
step 4.3: constructing a 2D-3D projection model, sampling the three 2D feature graphs obtained in the step 4.2, and performing final segmentation prediction after the feature matching process of the two feature graphs is completed to obtain maximum, minimum and average probability scores respectively;
step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to be in multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown as follows:
wherein ,maximum probability score representing the nth 2D feature map sampling result, and +.>Minimum probability score representing the sampling result of the nth 2D feature map, +.>Represents probability score of nth point corresponding to 3D point cloud, K (and) represents KL divergence, P 2D Representing a2D feature map;
step 4.5: training the average probability score obtained in the step 4.3 by using the loss function obtained in the step 4.4 and using the average probability score as the finally output semantic segmentation prediction;
step 5: and (3) carrying out feature fusion on the sparse-dense feature sampling result obtained in the step (4) in a channel splicing mode, and finally outputting a predicted segmentation result, wherein the method specifically comprises the following steps:
step 5.1: splicing the result obtained in the step 4.5 after the semantic segmentation is subjected to the sparse sampling and pooling process with the 3D feature map obtained in the step 2.4 to obtain a 2D-3D image feature fusion result;
step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function to train the 2D-3D image feature fusion result obtained in the step 5.1, wherein the loss function is as follows:
wherein ,a label representing the nth point on the target domain, the label on the target data set and the source data set remaining consistent, since the target data set is the result of the cross-domain training to be tested,/->Average probability score representing sampling result in 2D feature map,/->Representing the probability score of the nth point on the 3D feature map on the target domain.
2. The 3D point cloud data and 2D image data fusion matching semantic segmentation method as claimed in claim 1, wherein the step 2 specifically comprises:
step 2.1: preprocessing input 3D point cloud data, and arranging input tensors according to NCHW sequence, wherein non-zero data in the point cloud data is defined as an activated input site;
step 2.2: constructing a convolution kernel with the kernel size of c;
step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash in Key in table in Representing the coordinates of the input pixel, v in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P out And constructing a Hash table Hash on the premise of the Hash table Hash out ,key out Representing coordinates in the output tensor, v out A sequence number representing the output tensor;
step 2.4: and (3) establishing a RuleBook, establishing a relation between serial numbers in the input hash table and the output hash table obtained in the step (2.3) so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step (2.1) by using the convolution obtained in the step (2.2) to obtain a 3D feature map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211722227.8A CN116071747A (en) | 2022-12-30 | 2022-12-30 | 3D point cloud data and 2D image data fusion matching semantic segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211722227.8A CN116071747A (en) | 2022-12-30 | 2022-12-30 | 3D point cloud data and 2D image data fusion matching semantic segmentation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116071747A true CN116071747A (en) | 2023-05-05 |
Family
ID=86183101
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211722227.8A Pending CN116071747A (en) | 2022-12-30 | 2022-12-30 | 3D point cloud data and 2D image data fusion matching semantic segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116071747A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116258719A (en) * | 2023-05-15 | 2023-06-13 | 北京科技大学 | Flotation foam image segmentation method and device based on multi-mode data fusion |
CN116258970A (en) * | 2023-05-15 | 2023-06-13 | 中山大学 | Geographic element identification method integrating remote sensing image and point cloud data |
CN117953335A (en) * | 2024-03-27 | 2024-04-30 | 中国兵器装备集团自动化研究所有限公司 | Cross-domain migration continuous learning method, device, equipment and storage medium |
-
2022
- 2022-12-30 CN CN202211722227.8A patent/CN116071747A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116258719A (en) * | 2023-05-15 | 2023-06-13 | 北京科技大学 | Flotation foam image segmentation method and device based on multi-mode data fusion |
CN116258970A (en) * | 2023-05-15 | 2023-06-13 | 中山大学 | Geographic element identification method integrating remote sensing image and point cloud data |
CN116258719B (en) * | 2023-05-15 | 2023-07-18 | 北京科技大学 | Flotation foam image segmentation method and device based on multi-mode data fusion |
CN116258970B (en) * | 2023-05-15 | 2023-08-08 | 中山大学 | Geographic element identification method integrating remote sensing image and point cloud data |
CN117953335A (en) * | 2024-03-27 | 2024-04-30 | 中国兵器装备集团自动化研究所有限公司 | Cross-domain migration continuous learning method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sengupta et al. | Urban 3d semantic modelling using stereo vision | |
US8620026B2 (en) | Video-based detection of multiple object types under varying poses | |
Chen et al. | Moving-object detection from consecutive stereo pairs using slanted plane smoothing | |
CN116071747A (en) | 3D point cloud data and 2D image data fusion matching semantic segmentation method | |
CN108830171B (en) | Intelligent logistics warehouse guide line visual detection method based on deep learning | |
Matzen et al. | Nyc3dcars: A dataset of 3d vehicles in geographic context | |
Hoppe et al. | Incremental Surface Extraction from Sparse Structure-from-Motion Point Clouds. | |
CN106951830B (en) | Image scene multi-object marking method based on prior condition constraint | |
Zhang et al. | CDNet: A real-time and robust crosswalk detection network on Jetson nano based on YOLOv5 | |
Nemoto et al. | Building change detection via a combination of CNNs using only RGB aerial imageries | |
Taran et al. | Impact of ground truth annotation quality on performance of semantic image segmentation of traffic conditions | |
Jensen et al. | Traffic light detection at night: Comparison of a learning-based detector and three model-based detectors | |
Li et al. | Enhancing 3-D LiDAR point clouds with event-based camera | |
Karkera et al. | Autonomous bot using machine learning and computer vision | |
Bu et al. | A UAV photography–based detection method for defective road marking | |
Liu et al. | Road segmentation with image-LiDAR data fusion in deep neural network | |
Yan et al. | Video scene parsing: An overview of deep learning methods and datasets | |
Zhang et al. | Improved Lane Detection Method Based on Convolutional Neural Network Using Self-attention Distillation. | |
CN111626971B (en) | Smart city CIM real-time imaging method with image semantic perception | |
Lertniphonphan et al. | 2d to 3d label propagation for object detection in point cloud | |
CN109740405B (en) | Method for detecting front window difference information of non-aligned similar vehicles | |
Tian et al. | Vision-based mapping of lane semantics and topology for intelligent vehicles | |
Sharma et al. | Deep Learning-Based Object Detection and Classification for Autonomous Vehicles in Different Weather Scenarios of Quebec, Canada | |
Acun et al. | D3net (divide and detect drivable area net): deep learning based drivable area detection and its embedded application | |
Ding et al. | A comprehensive approach for road marking detection and recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |