CN116071747A - 3D point cloud data and 2D image data fusion matching semantic segmentation method - Google Patents

3D point cloud data and 2D image data fusion matching semantic segmentation method Download PDF

Info

Publication number
CN116071747A
CN116071747A CN202211722227.8A CN202211722227A CN116071747A CN 116071747 A CN116071747 A CN 116071747A CN 202211722227 A CN202211722227 A CN 202211722227A CN 116071747 A CN116071747 A CN 116071747A
Authority
CN
China
Prior art keywords
feature
feature map
point cloud
image
block structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211722227.8A
Other languages
Chinese (zh)
Inventor
项超
李雪松
姚宇翔
贾雨涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202211722227.8A priority Critical patent/CN116071747A/en
Publication of CN116071747A publication Critical patent/CN116071747A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data, and belongs to the technical field of image processing. The method comprises the following steps: extracting feature images from the 2D image and the 3D point cloud respectively by utilizing a2D image network and a 3D point cloud network which are fused with a multiscale fused attention mechanism, obtaining a sparse-dense feature sampling result obtained by projecting the 2D feature images by utilizing a feature fusion module, carrying out feature fusion on the result obtained by S1 in a channel splicing mode, finally outputting a predicted segmentation result, and training a model on a target domain and a source domain. The attention mechanism of the multi-scale fusion reduces the characteristics of the 2D characteristic map lost due to the multi-scale characteristic fusion, and improves the accuracy of segmentation; and a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved.

Description

3D point cloud data and 2D image data fusion matching semantic segmentation method
Technical Field
The invention relates to a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data, and belongs to the technical field of image processing.
Background
With the rapid development of automatic driving technology and research of intelligent robots, deep understanding of the surrounding environment is an indispensable condition, so that precise semantic segmentation becomes more and more important. The key to the overall understanding of an image is to split the image into separate parts. The understanding of scenes has now progressed to a pixel-level of refinement, with the processing of pixels to detect each entity in a picture and thereby be able to mark clear boundaries.
With the continuous development of computer vision, many researchers are interested in the problem of semantic segmentation, and the semantic segmentation applied to static images is studied very deeply, and many very deep and mature algorithms have been proposed. However, the semantic segmentation of the 2D image has some non-negligible defects, such as large dependence on illumination conditions, unclear segmentation of edges of small objects, and confusion of segmentation of occlusion objects, which are very confusing for students. The 3D point cloud data can effectively solve the problems, and due to the development of LiDAR equipment in recent years, the acquisition of the 3D point cloud data is not a problem any more, and how to obtain useful information from a large amount of 3D point cloud data, so that the scene is better analyzed, which is an important content of the computer vision research at present.
Target detection, classification and identification based on 3D point cloud data are the main technologies for solving how to perform scene analysis, and semantic segmentation of 3D point cloud is the basis of the technologies. While when faced with a new scene, domain adaptation is an important factor in achieving understanding of the new scene in the absence of data annotation. Although the acquisition of 3D point cloud data becomes easy and the variety of 3D point cloud data is increasingly more and more, the labeling process of 3D point cloud data requires a huge amount of time, so that the semantic segmentation domain adaptation of 3D point cloud is expected. Although the point cloud data is very easy to acquire, compared with the 2D image, the labeling process of the point cloud data consumes manpower and material resources, and the point cloud data is quite sparse for the segmentation work, wherein a plurality of missed points are not prevented from influencing the subsequent segmentation result.
The existing method mainly comprises the steps of downsampling a2D feature map with dense features to obtain a feature map which is as sparse as 3D point cloud data, and the aim of realizing cross-modal interaction after 2D and 3D features are aligned. Therefore, 3D domain adaptation cross-modal learning can be performed by using the 2D data, and the time wasted by 3D point cloud labeling is reduced. In the conventional intra-domain cross-modal learning, since dense 2D pixel features are sampled into a graph with the same size as sparse 3D point cloud features, a large number of 2D features are discarded. Domain adaptation can play its role when applied in semantic segmentation. An important problem faced by semantic segmentation based on deep learning is that a model performs well in one data set, but the effect of applying the model to another data set is reduced, so that the model is difficult to accurately predict a scene which is not seen, and the situation of illumination and the like of a scene also has great influence on accuracy when the semantic segmentation is applied to different scenes, so that a domain adaptation method is used for avoiding a great amount of time consumed by manually labeling data. The present application is directed to solving the above-described problems using a domain-adaptive approach.
Disclosure of Invention
The invention aims to solve the technical defect that the image semantic segmentation result is inaccurate by simply relying on 3D point cloud data due to the fact that the 3D point cloud data lack of relevant labels and data sparseness, and provides a semantic segmentation method based on fusion matching of the 3D point cloud data and 2D image data.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention discloses a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data,
based on the existing xMUDA and DsCML models, the method adds a feature fusion part and modifies a segmentation network on the basis of inter-domain and intra-domain semantic segmentation models of cross-modal learning, so that the result of inter-domain and intra-domain semantic segmentation networks using a multi-modal dataset is improved. The method comprises the following steps: the method comprises the steps that a source domain and a target domain in a scene-scene data set are used for joint learning on a group of data sets by utilizing the difference of scenes on the source data set and the target data set, so that domain adaptability to a certain degree is obtained, and a trained model is tested on test sets with the same source domain and the same target domain and different source domain and target domain respectively; attention mechanisms of multi-scale fusion are integrated based on a cavity space convolution pooling pyramid (ASPP) structure, so that the features of a2D feature map lost due to multi-scale feature fusion are reduced, and the accuracy of segmentation is improved; and a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved.
The semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data specifically comprises the following steps:
step 1: extracting a feature map from the 2D picture by using the deep 3 model to obtain a2D feature map; the method comprises the following steps:
step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be n, so that an output picture is obtained;
step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure as m to obtain the output picture;
step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.4: constructing a block structure according to the mode of the step 1.1, and setting the void ratio as t 1 Inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.5: and (3) constructing a cavity space convolution pooling pyramid to process the output picture obtained in the step (1.4), wherein the method specifically comprises the following sub-steps:
step 1.5.1: constructing a convolution layer with a size of a and a plurality of cavity convolutions with a size of b, and processing the output picture obtained in the step 1.4 to obtain a plurality of picture features with different scales;
step 1.5.2: constructing a global average pooling layer, and processing the output picture obtained in the step 1.4 to obtain the image level characteristics of the output picture;
step 2: realizing the extraction of the feature map of the 3D point cloud data based on the sparsetConVNet model to obtain a 3D feature map; the method comprises the following steps:
step 2.1: preprocessing input 3D point cloud data, and arranging input tensors according to NCHW sequence, wherein non-zero data in the point cloud data is defined as an activated input site;
step 2.2: constructing a convolution kernel with the kernel size of c;
step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash in Key in table in Representing the coordinates of the input pixel, v in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P out And take this as the frontLifting and constructing Hash table Hash out ,key out Representing coordinates in the output tensor, v out A sequence number representing the output tensor;
step 2.4: establishing a RuleBook, associating serial numbers in the input and output hash tables obtained in the step 2.3 so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step 2.1 by using the convolution obtained in the step 2.2 to obtain a 3D feature map;
step 3, obtaining a2D dependent feature map with a global dependent relation by using the self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion; the method comprises the following steps:
step 3.1: calculating the similarity between every two of the image characteristics of a plurality of different scales obtained in the step 1.5.1 and the image level characteristics obtained in the step 1.5.2;
step 3.2: normalizing the similarity obtained in the step 3.1 by using a normalization index function, and taking the similarity as a key value for weighted summation to obtain a2D dependency feature map;
step 4, inputting the 2D dependent feature map obtained in the step 3.2 and the 3D feature map obtained in the step 2 into a deformable convolution, pooling layer and feature fusion module to obtain the projection of the 3D to the 2D feature map and the sparse-dense feature sampling result thereof; the method comprises the following steps:
step 4.1: processing the 2D dependency characteristic diagram of the office dependency relationship obtained in the step 3.2 to obtain a related offset diagram;
step 4.2: constructing a deformable convolution layer, inputting the 2D dependent feature map of the local dependency relationship obtained in the step 3.2 and the offset 2D dependent feature map obtained in the step 4.1 into the constructed deformable convolution layer, and obtaining three 2D feature maps through the processes of maximum, minimum and average pooling;
step 4.3: constructing a 2D-3D projection model, sampling the three 2D feature graphs obtained in the step 4.2, and performing final segmentation prediction after the feature matching process of the two feature graphs is completed to obtain maximum, minimum and average probability scores respectively;
step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to be in multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown as follows:
Figure BDA0004028700090000061
wherein ,
Figure BDA0004028700090000062
representing the largest probability score of the nth 2D feature map sampling result
Figure BDA0004028700090000063
Minimum probability score representing the sampling result of the nth 2D feature map, +.>
Figure BDA0004028700090000064
Represents probability score of nth point corresponding to 3D point cloud, K (and) represents KL divergence, P 2D Representing a2D feature map;
step 4.5: training the average probability score obtained in the step 4.3 by using the loss function obtained in the step 4.4 and using the average probability score as the finally output semantic segmentation prediction;
step 5: carrying out feature fusion on the sparse-dense feature sampling result obtained in the step 4 in a channel splicing mode, and finally outputting a predicted segmentation result; the method comprises the following steps:
step 5.1: splicing the result obtained in the step 4.5 after the semantic segmentation is subjected to the sparse sampling and pooling process with the 3D feature map obtained in the step 2.4 to obtain a 2D-3D image feature fusion result;
step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function to train the 2D-3D image feature fusion result obtained in the step 5.1, wherein the loss function is as follows:
Figure BDA0004028700090000065
wherein ,
Figure BDA0004028700090000071
a label representing the nth point on the target domain, the label on the target data set and the source data set remaining consistent, since the target data set is the result of the cross-domain training to be tested,/->
Figure BDA0004028700090000072
An average probability score representing the sampled results at the 2D feature map; />
Figure BDA0004028700090000073
Representing the probability score of the nth point on the 3D feature map on the target domain.
Advantageous effects
1. The semantic segmentation method integrates a multi-scale fusion attention mechanism based on the cavity space convolution pooling pyramid structure, reduces the features of the 2D feature map lost due to multi-scale feature fusion, and improves the segmentation accuracy;
2. according to the method, a feature fusion module is added on the basis of feature matching of the 2D image and the 3D point cloud data, and the deformable convolution pooling part in the original model is combined, so that the model prediction precision is improved;
drawings
FIG. 1 is a flow chart of a semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data;
fig. 2 is a flowchart for implementing extraction of a feature map of 3D point cloud data based on SparseConVNet model according to the present invention;
FIG. 3 is a flow chart of a self-attention mechanism for 2D feature map through multi-scale feature fusion in accordance with the present invention;
FIG. 4 is a flow chart of sparse-dense feature sampling from a projection of a2D feature map using a 3D feature map in accordance with the present invention;
FIG. 5 is two examples of A2D2 datasets and associated labeling results, where FIG. 5 (a) is a true image of the first example, FIG. 5 (b) is a labeled image segmentation result of the first example, FIG. 5 (c) is a true image of the second example, and FIG. 5 (D) is a labeled image segmentation result of the second example;
fig. 6 is a comparison of a result of the semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data with other models.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples.
Examples
The automatic driving is a product of deep fusion of the automobile industry and new generation information technologies such as artificial intelligence, internet of things, high-performance calculation and the like, and is a main direction of intelligent and networking development in the current global automobile and transportation travel field. Although the development of the automatic driving automobile is good, the bottleneck exists in the aspects of core technologies such as sensor perception, control decision, vehicle interaction, road condition recognition and the like at present, particularly in the aspect of road condition recognition, the vehicle needs to complete the control and recognition of surrounding obstacles, traffic signals, pedestrians and other vehicle states, but the automatic driving automobile capable of achieving the effect is few so far. The study based on automatic driving can be popularized to other fields such as multi-mode image registration, three-dimensional visualization and the like.
The input data of the present embodiment are a2D image and its corresponding 3D point cloud data, respectively. Wherein the image and point cloud data input by the source domain are tagged and the image and point cloud data input by the target domain are untagged. Therefore, after inputting data, it is necessary to obtain a feature map of the corresponding data, and before inputting data into the classifier, it is necessary to first input their corresponding segmentation networks to obtain a feature map of a suitable size.
In this embodiment, the A2D2 dataset is used as the source domain of our target object; the data set is a large autopilot data set proposed in a paper published by the Audi company in 2020, A2D2, audi Autonomous Driving Dataset. The aim is to advance commercial research and academic research in the directions of computer vision and automatic driving, the data types of the commercial research and academic research comprise RGB images and corresponding 3D point cloud data, and the time for recording the data is synchronous. The A2D2 includes scenes of different categories such as expressways, villages and cities, and the dataset for semantic segmentation includes 41,277 already annotated 2D pictures. Of these, 31,448 pictures were taken from the front, 1,966 pictures were taken from the front left, 1,797 pictures were taken from the front right, 1,650 pictures and 2,722 pictures were taken from the front left and right, respectively, and the remaining 1,694 Zhang Zecong pictures were taken from the back. Wherein each pixel of each picture gives a label of the corresponding class label. The point cloud segmentation is generated by fusing semantic pixel information with a lidar point cloud. So that each 3D point is assigned an object type label, which depends on the registration between the exact camera and the LiDAR. In addition, the dataset also provides labeling for the 3D bounding box, which is not within the contemplation of this experiment. In this embodiment, the whole data set is divided into 20 scenes, 40,335 pictures are used as training sets, 1 scene, and 942 pictures are used as test sets. The configuration of the sensor of the A2D2 data set is composed of 6 cameras and five Velodyne VIP-16LIDAR sensors, and 360-degree coverage of the surrounding environment of the vehicle is realized. The data set is also very large in data size, and the marked non-sequence data also comprises 392,556 continuous frames of sensorless data. The number of traffic participant instances in the A2D2 dataset for semantic tag annotation is largely comprised of cars, trucks and pedestrians, two examples of which are illustrated in fig. 5.
SemanticKITTI is provided by Behley et al, university of Bohn, germany, as a target field of our target object, and is a special Semantic segmentation dataset made from the KITTI Vision Odometry Benchmark dataset, which provides a large amount of useful data for Semantic segmentation based on vehicle-mounted lidar. The scene categories of the Semantic-KITTI data set include interior urban traffic areas, residential areas, and expressways and rural lanes in Germany. The original Odometry dataset consists of 22 scenes in total, scenes 00 to 10 are training sets, and the training sets are provided with dense comments; scenes 11-21 are test sets that contain a large number of complex traffic environments. In this embodiment, instead of using 11-21, scenes 07 and 08 are used as test sets and the remaining scenes are used as training sets. The Semantic-KITTI data set contains 28 classes, including moving objects and non-moving objects. Not only are numerous traffic participants included in the category, but also some on-ground content is covered, including parking lots and sidewalks, and the like. Because the Semantic-KITTI is point cloud data, in the experiment, the research is also needed by combining the 2D picture corresponding to the point cloud data, and therefore, the picture data provided by the kITTI-Odometry is also downloaded. The data of the kITTI-Odometry image part mainly comprises a calibration file, a color image, a gray image and a track true value, and only the color image is used in the experiment.
The evaluation index in this embodiment is mean IoU, which refers to the intersection of the actual region and the predicted region divided by the union of the actual region and the predicted region (i.e., the ratio of the intersection to the union of the two sets is calculated), and this ratio can be modified to be the sum of the true positive number, the false negative number, and the union of the false positive number on the positive number ratio, and the sum of the true positive number, the false negative number, and the union of the false positive number is calculated IoU on each class, and then the average operation is performed, where the calculation formula is:
Figure BDA0004028700090000101
where i represents the true value, j represents the predicted value, p ij Representing i predicted as j, p ji Indicating that j is predicted as i.
The operation flow of the semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data in the embodiment is shown in the attached figure 1, and the method specifically comprises the following implementation steps:
step 1: extracting a feature map from the 2D picture by using a deep 3 model to obtain a2D feature map; the method comprises the following steps:
step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be 4, so that an output picture is obtained;
step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure to be 8 to obtain the output picture;
step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure to be 16 to obtain the output picture;
step 1.4: constructing a block structure according to the mode of the step 1.1, setting the void ratio to be 2, inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure to be 16 to obtain the output picture;
step 1.5: and (3) constructing a cavity space convolution pooling pyramid to process the output picture obtained in the step (1.4), wherein the method specifically comprises the following sub-steps:
step 1.5.1: constructing a convolution layer with the size of 1*1, and processing the output picture obtained in the step 1.4 by using 3 cavity convolutions with the size of 3*3 to obtain a plurality of picture features with different scales;
step 1.5.2: constructing a global average pooling layer, and processing the output picture obtained in the step 1.4 to obtain the image level characteristics of the output picture;
step 2: extracting a feature map of the 3D point cloud data based on the sparseconVNet model to obtain a 3D feature map, wherein the overall flow of the step 2 is shown in the attached figure 2;
step 2.1: preprocessing input 3D point cloud data, and arranging input tensors according to NCHW sequence, wherein non-zero data in the point cloud data is defined as an activated input site;
step 2.2: constructing a convolution kernel with a kernel size of 3*3;
step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash in Key in table in Representing the coordinates of the input pixel, v in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P out And constructing a Hash table Hash on the premise of the Hash table Hash out ,key out Representing coordinates in the output tensor, v out A sequence number representing the output tensor;
step 2.4: establishing a RuleBook, associating serial numbers in the input and output hash tables obtained in the step 2.3 so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step 2.1 by using the convolution obtained in the step 2.2 to obtain a 3D feature map;
step 3, obtaining a2D dependent feature map with global dependency relationship by using the self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion, wherein the whole flow of the step 3 is shown in a figure 3;
step 3.1: calculating the similarity between every two of the image characteristics of a plurality of different scales obtained in the step 1.5.1 and the image level characteristics obtained in the step 1.5.2;
step 3.2: normalizing the similarity obtained in the step 3.1 by using a normalization index function, and taking the similarity as a key value for weighted summation to obtain a2D dependency feature map;
step 4, inputting the 2D dependent feature map obtained in the step 3.2 and the 3D feature map obtained in the step 2 into a deformable convolution and pooling layer and a feature fusion module to obtain a sparse-dense feature sampling result obtained by the projection of the 3D to the 2D feature map, wherein the overall flow of the step 4 is shown in a figure 4;
step 4.1: processing the 2D dependency characteristic diagram of the office dependency relationship obtained in the step 3.2 to obtain a related offset diagram;
step 4.2: constructing a deformable convolution layer, inputting the 2D dependent feature map of the local dependency relationship obtained in the step 3.2 and the offset 2D dependent feature map obtained in the step 4.1 into the constructed deformable convolution layer, and obtaining three 2D feature maps through the processes of maximum, minimum and average pooling;
step 4.3: constructing a 2D-3D projection model, sampling the three 2D feature graphs obtained in the step 4.2, and performing final segmentation prediction after the feature matching process of the two feature graphs is completed to obtain maximum, minimum and average probability scores respectively;
step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to perform multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown in a formula (2);
Figure BDA0004028700090000131
wherein ,
Figure BDA0004028700090000132
representing the largest probability score of the nth 2D feature map sampling result
Figure BDA0004028700090000133
Minimum probability score representing the sampling result of the nth 2D feature map, +.>
Figure BDA0004028700090000134
Represents probability score of nth point corresponding to 3D point cloud, K (and) represents KL divergence, P 2D Representing a2D feature map;
step 4.5: training the average probability score obtained in the step 4.3 by using the loss function obtained in the step 4.4 and using the average probability score as the finally output semantic segmentation prediction;
step 5, carrying out feature fusion on the sparse-dense feature sampling result obtained in the step 4 in a channel splicing mode, and finally outputting a predicted segmentation result; the method comprises the following steps:
step 5.1: splicing the result obtained in the step 4.5 after the semantic segmentation is subjected to the sparse sampling and pooling process with the 3D feature map obtained in the step 2.4 to obtain a 2D-3D image feature fusion result;
step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function training result of the 2D-3D image feature fusion obtained in the step 5.1, wherein the loss function is shown in a formula (3);
Figure BDA0004028700090000135
wherein ,
Figure BDA0004028700090000136
a label representing the nth point on the target domain, the label on the target data set and the source data set remaining consistent, since the target data set is the result of the cross-domain training to be tested,/->
Figure BDA0004028700090000141
An average probability score representing the sampled results at the 2D feature map; />
Figure BDA0004028700090000142
Representing the probability score of the nth point on the 3D feature map on the target domain.
The operation result of the method is shown in figure 6 for the A2D2-SemanticKITTI data set;
in summary, the above embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, but any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (2)

1. The semantic segmentation method based on fusion matching of 3D point cloud data and 2D image data is characterized by comprising the following steps of:
step 1: extracting a feature map from the 2D picture by using a deep 3 model to obtain the 2D feature map, wherein the feature map specifically comprises the following steps:
step 1.1: constructing block structures, wherein each block structure comprises a convolution layer, a batch standardization function and a linear rectification activation function, an original picture is subjected to convolution layer, batch standardization function, linear rectification activation function, convolution layer and batch standardization function to obtain a result, the result is spliced with an original input to obtain a result, the result is subjected to batch standardization function and linear rectification activation function to serve as output of the block structure, an original 2D image is input into the block structure, and the step length of the block structure is set to be n, so that an output picture is obtained;
step 1.2: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.1 into the block structure constructed in the step, and setting the step length of the block structure as m to obtain the output picture;
step 1.3: constructing a block structure according to the mode of the step 1.1, inputting the output picture obtained in the step 1.2 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.4: constructing a block structure according to the mode of the step 1.1, and setting the void ratio as t 1 Inputting the output picture obtained in the step 1.3 into the block structure constructed in the step, and setting the step length of the block structure as q to obtain the output picture;
step 1.5: and (3) constructing a cavity space convolution pooling pyramid to process the output picture obtained in the step (1.4), wherein the method specifically comprises the following sub-steps:
step 1.5.1: constructing a convolution layer with a size of a and a plurality of cavity convolutions with a size of b, and processing the output picture obtained in the step 1.4 to obtain a plurality of picture features with different scales;
step 1.5.2: constructing a global average pooling layer, and processing the output picture obtained in the step 1.4 to obtain the image level characteristics of the output picture;
step 2: realizing the extraction of the feature map of the 3D point cloud data based on the sparsetConVNet model to obtain a 3D feature map;
step 3, obtaining a2D dependent feature map with a global dependency relationship by using a self-attention mechanism of the 2D feature map obtained in the step 1 through multi-scale feature fusion, wherein the method specifically comprises the following steps:
step 3.1: calculating the similarity between every two of the image characteristics of a plurality of different scales obtained in the step 1.5.1 and the image level characteristics obtained in the step 1.5.2;
step 3.2: normalizing the similarity obtained in the step 3.1 by using a normalization index function, and taking the similarity as a key value for weighted summation to obtain a2D dependency feature map;
step 4, inputting the 2D dependent feature map obtained in the step 3.2 and the 3D feature map obtained in the step 2 into a deformable convolution, pooling layer and feature fusion module to obtain the projection of the 3D to the 2D feature map and the sparse-dense feature sampling result thereof; the method comprises the following steps:
step 4.1: processing the 2D dependency characteristic diagram of the office dependency relationship obtained in the step 3.2 to obtain a related offset diagram;
step 4.2: constructing a deformable convolution layer, inputting the 2D dependent feature map of the local dependency relationship obtained in the step 3.2 and the offset 2D dependent feature map obtained in the step 4.1 into the constructed deformable convolution layer, and obtaining three 2D feature maps through the processes of maximum, minimum and average pooling;
step 4.3: constructing a 2D-3D projection model, sampling the three 2D feature graphs obtained in the step 4.2, and performing final segmentation prediction after the feature matching process of the two feature graphs is completed to obtain maximum, minimum and average probability scores respectively;
step 4.4: the characteristic that most adjacent pixels in two-dimensional semantic segmentation are divided into the same category is utilized, the variable number of pixels around the currently sampled pixels are considered to be in multi-to-one interaction with the corresponding three-dimensional feature points, the maximum probability score and the minimum probability score obtained in the step 4.3 are utilized as a construction loss function between calculation and three-dimensional semantic segmentation, and the loss function is shown as follows:
Figure FDA0004028700080000031
wherein ,
Figure FDA0004028700080000032
maximum probability score representing the nth 2D feature map sampling result, and +.>
Figure FDA0004028700080000033
Minimum probability score representing the sampling result of the nth 2D feature map, +.>
Figure FDA0004028700080000034
Represents probability score of nth point corresponding to 3D point cloud, K (and) represents KL divergence, P 2D Representing a2D feature map;
step 4.5: training the average probability score obtained in the step 4.3 by using the loss function obtained in the step 4.4 and using the average probability score as the finally output semantic segmentation prediction;
step 5: and (3) carrying out feature fusion on the sparse-dense feature sampling result obtained in the step (4) in a channel splicing mode, and finally outputting a predicted segmentation result, wherein the method specifically comprises the following steps:
step 5.1: splicing the result obtained in the step 4.5 after the semantic segmentation is subjected to the sparse sampling and pooling process with the 3D feature map obtained in the step 2.4 to obtain a 2D-3D image feature fusion result;
step 5.2: training the model on the source domain and the target domain by using the cross entropy loss function to train the 2D-3D image feature fusion result obtained in the step 5.1, wherein the loss function is as follows:
Figure FDA0004028700080000035
wherein ,
Figure FDA0004028700080000036
a label representing the nth point on the target domain, the label on the target data set and the source data set remaining consistent, since the target data set is the result of the cross-domain training to be tested,/->
Figure FDA0004028700080000037
Average probability score representing sampling result in 2D feature map,/->
Figure FDA0004028700080000038
Representing the probability score of the nth point on the 3D feature map on the target domain.
2. The 3D point cloud data and 2D image data fusion matching semantic segmentation method as claimed in claim 1, wherein the step 2 specifically comprises:
step 2.1: preprocessing input 3D point cloud data, and arranging input tensors according to NCHW sequence, wherein non-zero data in the point cloud data is defined as an activated input site;
step 2.2: constructing a convolution kernel with the kernel size of c;
step 2.3: establishing a sequence number-coordinate Hash table of an input tensor and an output tensor, firstly establishing an input Hash table Hash in Key in table in Representing the coordinates of the input pixel, v in The serial numbers of the input pixels are represented, each row represents an activated input site, and the relevant pixel point of each pixel point of the output tensor is marked as P out And constructing a Hash table Hash on the premise of the Hash table Hash out ,key out Representing coordinates in the output tensor, v out A sequence number representing the output tensor;
step 2.4: and (3) establishing a RuleBook, establishing a relation between serial numbers in the input hash table and the output hash table obtained in the step (2.3) so as to realize sparse convolution, and checking the 3D point cloud data preprocessed in the step (2.1) by using the convolution obtained in the step (2.2) to obtain a 3D feature map.
CN202211722227.8A 2022-12-30 2022-12-30 3D point cloud data and 2D image data fusion matching semantic segmentation method Pending CN116071747A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211722227.8A CN116071747A (en) 2022-12-30 2022-12-30 3D point cloud data and 2D image data fusion matching semantic segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211722227.8A CN116071747A (en) 2022-12-30 2022-12-30 3D point cloud data and 2D image data fusion matching semantic segmentation method

Publications (1)

Publication Number Publication Date
CN116071747A true CN116071747A (en) 2023-05-05

Family

ID=86183101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211722227.8A Pending CN116071747A (en) 2022-12-30 2022-12-30 3D point cloud data and 2D image data fusion matching semantic segmentation method

Country Status (1)

Country Link
CN (1) CN116071747A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116258719A (en) * 2023-05-15 2023-06-13 北京科技大学 Flotation foam image segmentation method and device based on multi-mode data fusion
CN116258970A (en) * 2023-05-15 2023-06-13 中山大学 Geographic element identification method integrating remote sensing image and point cloud data
CN117953335A (en) * 2024-03-27 2024-04-30 中国兵器装备集团自动化研究所有限公司 Cross-domain migration continuous learning method, device, equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116258719A (en) * 2023-05-15 2023-06-13 北京科技大学 Flotation foam image segmentation method and device based on multi-mode data fusion
CN116258970A (en) * 2023-05-15 2023-06-13 中山大学 Geographic element identification method integrating remote sensing image and point cloud data
CN116258719B (en) * 2023-05-15 2023-07-18 北京科技大学 Flotation foam image segmentation method and device based on multi-mode data fusion
CN116258970B (en) * 2023-05-15 2023-08-08 中山大学 Geographic element identification method integrating remote sensing image and point cloud data
CN117953335A (en) * 2024-03-27 2024-04-30 中国兵器装备集团自动化研究所有限公司 Cross-domain migration continuous learning method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Sengupta et al. Urban 3d semantic modelling using stereo vision
US8620026B2 (en) Video-based detection of multiple object types under varying poses
Chen et al. Moving-object detection from consecutive stereo pairs using slanted plane smoothing
CN116071747A (en) 3D point cloud data and 2D image data fusion matching semantic segmentation method
CN108830171B (en) Intelligent logistics warehouse guide line visual detection method based on deep learning
Matzen et al. Nyc3dcars: A dataset of 3d vehicles in geographic context
Hoppe et al. Incremental Surface Extraction from Sparse Structure-from-Motion Point Clouds.
CN106951830B (en) Image scene multi-object marking method based on prior condition constraint
Zhang et al. CDNet: A real-time and robust crosswalk detection network on Jetson nano based on YOLOv5
Nemoto et al. Building change detection via a combination of CNNs using only RGB aerial imageries
Taran et al. Impact of ground truth annotation quality on performance of semantic image segmentation of traffic conditions
Jensen et al. Traffic light detection at night: Comparison of a learning-based detector and three model-based detectors
Li et al. Enhancing 3-D LiDAR point clouds with event-based camera
Karkera et al. Autonomous bot using machine learning and computer vision
Bu et al. A UAV photography–based detection method for defective road marking
Liu et al. Road segmentation with image-LiDAR data fusion in deep neural network
Yan et al. Video scene parsing: An overview of deep learning methods and datasets
Zhang et al. Improved Lane Detection Method Based on Convolutional Neural Network Using Self-attention Distillation.
CN111626971B (en) Smart city CIM real-time imaging method with image semantic perception
Lertniphonphan et al. 2d to 3d label propagation for object detection in point cloud
CN109740405B (en) Method for detecting front window difference information of non-aligned similar vehicles
Tian et al. Vision-based mapping of lane semantics and topology for intelligent vehicles
Sharma et al. Deep Learning-Based Object Detection and Classification for Autonomous Vehicles in Different Weather Scenarios of Quebec, Canada
Acun et al. D3net (divide and detect drivable area net): deep learning based drivable area detection and its embedded application
Ding et al. A comprehensive approach for road marking detection and recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination