CN117934831A - Three-dimensional semantic segmentation method based on camera and laser fusion - Google Patents
Three-dimensional semantic segmentation method based on camera and laser fusion Download PDFInfo
- Publication number
- CN117934831A CN117934831A CN202311872786.1A CN202311872786A CN117934831A CN 117934831 A CN117934831 A CN 117934831A CN 202311872786 A CN202311872786 A CN 202311872786A CN 117934831 A CN117934831 A CN 117934831A
- Authority
- CN
- China
- Prior art keywords
- camera
- point cloud
- laser
- cloud data
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 86
- 230000004927 fusion Effects 0.000 title claims abstract description 76
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000012544 monitoring process Methods 0.000 claims abstract description 4
- 238000010586 diagram Methods 0.000 claims description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000012546 transfer Methods 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S17/00—Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
- G01S17/88—Lidar systems specially adapted for specific applications
- G01S17/89—Lidar systems specially adapted for specific applications for mapping or imaging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Electromagnetism (AREA)
- Radar, Positioning & Navigation (AREA)
- Computer Networks & Wireless Communication (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Remote Sensing (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a three-dimensional semantic segmentation method based on camera and laser fusion, which comprises the following steps: inputting camera image and laser point cloud data into a camera module and a laser module respectively to extract image and laser point cloud data characteristics to obtain a camera image characteristic image and a laser point cloud data characteristic image, inputting the camera image characteristic image and the laser point cloud data characteristic image into a fusion module to perform characteristic fusion to obtain fused camera image characteristics and laser point cloud data characteristics, inputting the camera image characteristic image and the laser point cloud data characteristic image into the camera module and the laser module respectively to obtain a camera image characteristic image and a laser point cloud data characteristic image, inputting the camera image characteristic image and the laser point cloud data characteristic image into a monitoring module to calculate a loss function, updating parameter weights of a three-dimensional semantic segmentation network, and obtaining a trained three-dimensional semantic segmentation network; acquiring camera image and laser point cloud data, and inputting a trained three-dimensional semantic segmentation network to obtain semantic segmentation results of the laser point cloud data and the camera image; the method effectively combines the texture information of the image and the distance information of the laser, and improves the accuracy of semantic segmentation.
Description
Technical Field
The invention relates to the technical field of automatic driving and semantic segmentation, in particular to a three-dimensional semantic segmentation method based on fusion of a camera and laser.
Background
In the field of autopilot, semantic segmentation is very important for scene understanding. The semantic segmentation task is to assign a corresponding semantic tag to each camera pixel, laser point cloud input. Currently there are mainly two types of methods, camera-based and lidar-based.
The camera image contains color data of three channels and thus has richer appearance information, e.g., color, texture. However, the camera is a passive sensor, is easily affected by lighting conditions and weather, and because the camera is a 2D sensor, it is often difficult to obtain accurate distance information of the surrounding environment because of lack of depth information. The laser radar belongs to an active sensor, and calculates an accurate distance by emitting laser to the outside and receiving reflected laser, so that the performance is hardly affected under different illumination conditions. But the segmentation effect is poor in a scene with small objects, long distances and similar structures due to sparse point cloud, irregular distribution and lack of textures.
The existing fusion scheme based on camera images and laser point clouds combines the advantages of two methods based on cameras and laser radars, and achieves the purpose of three-dimensional semantic segmentation by considering textures of the images and the distance of laser. But this method has the problems of lack of texture features for laser radar segmentation and lack of distance for image segmentation.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a three-dimensional semantic segmentation method based on fusion of a camera and laser.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
A three-dimensional semantic segmentation method based on camera and laser fusion comprises the following steps:
S1, inputting a camera image into a camera module of a three-dimensional semantic segmentation network to extract image features, and obtaining a camera image feature map with an original size;
S2, inputting laser point cloud data into a laser module of a three-dimensional semantic segmentation network to extract laser point cloud data characteristics, and obtaining a laser point cloud data characteristic diagram with an original size;
S3, inputting the camera image feature map in the step S1 and the laser point cloud data feature map in the step S2 into a fusion module of a three-dimensional semantic segmentation network for feature fusion to obtain fused camera image features and laser point cloud data features;
S4, inputting the fused image features and the laser point cloud data features obtained in the step S3 into a camera module and a laser module respectively to obtain a camera image feature map and a laser point cloud data feature map;
s5, inputting the camera image feature map and the laser point cloud data feature map obtained in the step S4 into a supervision module of the three-dimensional semantic segmentation network, and calculating a loss function by adopting a self-supervision mode or a supervised mode;
S6, calculating gradients of a camera module, a laser module, a fusion module and a supervision module of the three-dimensional semantic segmentation network according to the loss function calculated in the step S5, and updating parameter weights of the three-dimensional semantic segmentation network by adopting a gradient descent method to obtain a trained three-dimensional semantic segmentation network;
S7, acquiring camera image and laser point cloud data, and inputting the trained three-dimensional semantic segmentation network in the step S6 to obtain semantic segmentation results of the laser point cloud data and the camera image.
Further, the camera module and the laser module are composed of an encoder and a decoder, wherein the feature pattern size in the encoder is reduced layer by layer, the feature pattern size in the decoder is increased layer by layer, and a jump connection structure is added between the encoder layer and the decoder layer with the same image size.
Further, the step S1 specifically includes:
S11, acquiring a camera image acquired by a camera, inputting the camera image into an encoder, and extracting local features of the camera image by adopting a convolutional neural network to obtain a camera image feature map;
S12, reducing the size of the camera image feature map layer by using a pooling layer according to the camera image feature map obtained in the step S11;
S13, inputting the camera image feature images with reduced sizes into a decoder, and recovering the sizes of the camera image feature images layer by adopting a convolutional neural network and a bilinear upsampling method to obtain camera image feature images with original sizes.
Further, step S2 specifically includes:
S21, acquiring laser point cloud data acquired by a laser radar, and projecting the laser point cloud data on a camera plane to obtain two-dimensional laser point cloud data;
s22, inputting the two-dimensional laser point cloud data obtained in the step S21 into an encoder, and extracting local features of the two-dimensional laser point cloud data by adopting a convolutional neural network to obtain a laser point cloud data feature map;
S23, reducing the size of the laser point cloud data feature map layer by using a pooling layer according to the laser point cloud data feature map obtained in the step S22;
S24, inputting the reduced-size laser point cloud data feature map into a decoder, and recovering the size of the laser point cloud data feature map layer by utilizing a convolutional neural network and a bilinear upsampling method to obtain the laser point cloud data feature map with the original size.
Further, a calculation formula for performing projection of the laser point cloud data on the camera plane is as follows:
[x′i,y′i,z′i]T=K×Tr×[xi,yi,zi,1]T
Ml[ui][vi]=1
Where x' i、y′i、z′i represents the position of the ith laser point cloud data in the camera coordinate system, T represents the transpose, K represents the internal reference of the camera, T r represents the laser-to-camera transfer matrix, x i、yi、zi represents the position of the ith laser point cloud data in the x, y and z axes, u i、vi represents the indices of the ith laser point cloud in the vertical and horizontal directions of the camera plane, respectively, and M l represents the lidar mask.
Further, the fusion module is composed of a splicing module, a convolution layer and a sliding window attention module, wherein the sliding window attention module is composed of a first sliding window attention layer and a second sliding window attention layer, the first sliding window attention layer is composed of a layer standardization module, a W-MSA module and a multi-layer perceptron module, and the second sliding window attention layer is composed of a layer standardization module, a SW-MSA module and a multi-layer perceptron module.
Further, the step S3 specifically includes:
S31, inputting the camera image feature map in the step S1 and the laser point cloud data feature map in the step S2 into a splicing module of a fusion module to obtain splicing features of a camera and a laser radar;
S32, inputting the splicing characteristics of the camera and the laser radar obtained in the step S31 into a convolution layer to obtain the fusion characteristics of the camera and the laser radar;
s33, inputting the fusion characteristics of the camera and the laser radar obtained in the step S32 into a sliding window attention module to obtain the fusion attention characteristics of the camera and the laser radar;
S34, merging the fusion attention characteristic of the camera and the laser radar obtained in the step S33 and the fusion characteristic of the camera and the laser radar obtained in the step S32 into the image characteristic diagram in the step S1 and the laser point cloud data characteristic diagram in the step S2 in proportion to obtain the fused camera image characteristic and the laser point cloud data characteristic.
Further, the calculation formula of the fused image features and the laser point cloud data features in step S34 is as follows:
Cfusion=Corigin+a1×SelfAttension×FusionFeature
Lfusion=Lorigin+a2×SelfAttension×FusionFeature
Wherein, C fusion represents the fused camera image feature, C origin represents the camera image feature of the original size, a 1、a2 represents the fused scale factors, selfAttension represents the fused attention feature of the camera and the laser radar, fusionFeature represents the fused feature of the camera and the laser radar, L fusion represents the fused laser point cloud data feature, and L origin represents the laser point cloud data feature of the original size.
Further, the camera image feature map and the laser point cloud data feature map obtained in the step S4 are input into a supervision module, and the specific process of calculating the loss function by adopting the self-supervision mode is as follows:
The supervision module generates a pseudo tag through a PIDNet network added with confidence coefficient, simultaneously reserves high confidence coefficient pixels and laser point cloud data, and obtains a loss function of a self-supervision mode through setting a camera mask and a laser radar mask, namely:
Lself-supervised=Lfoc1+Llov1+Lfov2+Llov2+Lkl
Wherein L self-supervised represents a loss function in a self-supervision mode, L kl represents a one-way KL divergence, L foc1、Llov1 represents a focus loss and a lorentz loss between a prediction result of a camera branch and a pseudo tag, L fov2、Llov2 represents a focus loss and a lorentz loss between a prediction result of a laser radar branch and a pseudo tag, u and v represent a length and a width of a prediction result feature map, C represents a confidence level, focalloss (·) represents a focus loss function calculation formula, pred camera represents a prediction result of a camera branch, pred Lidar represents a prediction result of a laser radar branch, label represents a pseudo tag value, M θ1 represents a camera mask, M θ2 represents a laser radar mask, and M l represents a laser radar mask.
Further, the camera image feature map and the laser point cloud data feature map obtained in the step S4 are input into a supervision module, and the specific process of calculating the loss function by adopting the supervised mode is as follows:
The monitoring module adjusts the parameter weight by adopting focusing loss and Lowastage loss to obtain a loss function of the monitored mode, namely:
Lsupervised=Lfoc1+Llov1+Lfov2+Llov2
Where L supervised represents a supervised mode penalty function, L foc1、Llov1 represents focus and lorentz penalty between the predicted outcome and truth labels of the camera branch, respectively, and L fov2、Llov2 represents focus and lorentz penalty between the predicted outcome and truth labels of the lidar branch, respectively.
The invention has the following beneficial effects:
1. According to the three-dimensional semantic segmentation method based on camera and laser fusion, the accuracy of semantic segmentation is improved by effectively combining the texture information of the camera image and the distance information of the laser radar, and meanwhile, the prediction result is more accurate by introducing the camera image information under the condition that the laser point cloud of a small object is sparse;
2. the fusion module adopts a sliding window attention mechanism, and the three-dimensional semantic segmentation network has stronger robustness under the severe changes of illumination and color;
3. the pseudo tag is generated by utilizing the PIDNet network added with the confidence coefficient, and the three-dimensional semantic segmentation network can be trained in a data-crossing and mode-crossing mode without any manual labeling of the point cloud tag, so that the prediction precision of the three-dimensional semantic segmentation network is improved.
Drawings
FIG. 1 is a schematic flow chart of a three-dimensional semantic segmentation method based on camera and laser fusion;
FIG. 2 is a schematic diagram of a three-dimensional semantic segmentation network architecture;
FIG. 3 is a schematic diagram of a fusion module;
FIG. 4 is a schematic diagram of a PIDNet network with added confidence in the supervision module;
fig. 5 is a schematic diagram of a three-dimensional semantic segmentation network segmentation result.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
As shown in fig. 1, a three-dimensional semantic segmentation method based on camera and laser fusion includes the following steps S1-S7:
as shown in fig. 2, fig. 2 is a schematic diagram of a three-dimensional semantic segmentation network structure; the three-dimensional semantic segmentation network in fig. 2 includes a camera module, a laser module, a fusion module, and a supervision module. The three-dimensional semantic segmentation network integrally adopts a double-flow architecture, namely, two input ends, a camera image is input into a camera module, and laser point cloud data of the laser radar is input into a laser module. Wherein the camera module and the laser module are similar in structure and each consists of an encoder with a gradually reduced feature map size and a decoder with a gradually increased feature map size. Extracting local features of camera images or laser point cloud data by adopting a convolutional neural network in an encoder stage, and reducing the size of a feature map through a pooling layer; and adopting a convolutional neural network and a bilinear upsampling method in a decoder stage, and recovering the size of the feature map layer by layer until the size of the original input camera image or the laser point cloud data image is recovered. Meanwhile, in this embodiment, a jump connection structure is added between the feature diagrams of the same size of the encoder and the decoder of the camera module or the laser module, as in fig. 2, in the camera module or the laser module, the features are directly propagated from the encoder to the decoder through the path shown by the dotted line.
S1, inputting a camera image into a camera module of a three-dimensional semantic segmentation network to extract image features, and obtaining a camera image feature map with an original size.
S2, inputting the laser point cloud data into a laser module of the three-dimensional semantic segmentation network to extract the laser point cloud data characteristics, and obtaining a laser point cloud data characteristic diagram with the original size.
Specifically, the camera module and the laser module are composed of an encoder and a decoder, wherein the feature pattern size in the encoder is reduced layer by layer, the feature pattern size in the decoder is increased layer by layer, and a jump connection structure is added between the encoder layer and the decoder layer with the same image size.
In the embodiment, the visual field of convolution is enlarged in a layer-by-layer mode, the segmentation performance of objects with different sizes is improved, and meanwhile, the jump connection structure is utilized, the edge position information of the original image features is reserved, so that the semantic segmentation is more accurate. The number of layers of the encoder and the decoder can be freely set, and fine adjustment is performed according to the input image size, and is usually 4 layers.
Specifically, step S1 specifically includes S11-S13:
S11, acquiring a camera image acquired by a camera, inputting the camera image into an encoder, and extracting local features of the camera image by adopting a convolutional neural network to obtain a camera image feature map.
S12, reducing the size of the camera image feature map layer by using a pooling layer according to the camera image feature map obtained in the step S11.
S13, inputting the camera image feature images with reduced sizes into a decoder, and recovering the sizes of the camera image feature images layer by adopting a convolutional neural network and a bilinear upsampling method to obtain camera image feature images with original sizes.
Specifically, step S2 specifically includes S21-S24:
S21, acquiring laser point cloud data acquired by a laser radar, and projecting the laser point cloud data on a camera plane to obtain two-dimensional laser point cloud data.
Specifically, a calculation formula for performing projection of the laser point cloud data on the camera plane is as follows:
[x′i,y′i,z′i]T=K×Tr×[xi,yi,zi,1]T
Ml[ui][vi]=1
Where x' i、y′i、z′i represents the position of the ith laser point cloud data in the camera coordinate system, T represents the transpose, K represents the internal reference of the camera, T r represents the laser-to-camera transfer matrix, x i、yi、zi represents the position of the ith laser point cloud data in the x, y and z axes, u i、vi represents the indices of the ith laser point cloud in the vertical and horizontal directions of the camera plane, respectively, and M l represents the lidar mask.
In this embodiment, since the camera data is two-dimensional data and the laser point cloud data is three-dimensional data, there is a problem that the space is not uniform, and the view angle range of the mechanical rotation type laser radar is generally larger than the view angle range of the camera, so in this embodiment, the laser point cloud data is projected on the camera plane and converted into two-dimensional laser point cloud data. Wherein, the formula [x′i,y′i,z′i]T=K×Tr×[xi,yi,zi,1]T is a formula of projection of laser point cloud data to a camera image, and the laser point cloud is P i={xi,yi,zi}T、xi,yi,zi, which respectively represents coordinates of the laser point cloud in a three-dimensional space; t r represents a laser-to-camera transfer matrix, namely describing the physical distance between the laser radar and the two sensors of the camera, K represents an internal reference of the camera, namely the relation of mapping a three-dimensional space to a two-dimensional photosensitive plane, and [ x' i,y′i,z′i]T represents a three-dimensional point under a camera coordinate system; formula (VI)Scaling z' i to obtain [ u i,vi,1]T, namely a two-dimensional coordinate corresponding to the laser point cloud on the camera plane; since laser point cloud data is typically sparse, not every image pixel has a corresponding laser point cloud, so the position with the laser point cloud mapping is represented by the formula M l[ui][vi ] =1.
S22, inputting the two-dimensional laser point cloud data obtained in the step S21 into an encoder, and extracting local features of the two-dimensional laser point cloud data by adopting a convolutional neural network to obtain a laser point cloud data feature map.
S23, according to the laser point cloud data feature map obtained in the step S22, reducing the size of the laser point cloud data feature map layer by using a pooling layer.
S24, inputting the reduced-size laser point cloud data feature map into a decoder, and recovering the size of the laser point cloud data feature map layer by utilizing a convolutional neural network and a bilinear upsampling method to obtain the laser point cloud data feature map with the original size.
And S3, inputting the camera image feature map in the step S1 and the laser point cloud data feature map in the step S2 into a fusion module of the three-dimensional semantic segmentation network for feature fusion, and obtaining fused camera image features and laser point cloud data features.
As shown in fig. 3, fig. 3 is a schematic diagram of a fusion module structure; in fig. 3 (a), the fusion module includes a splice module (C), a convolutional layer (Conv), a sliding window attention module (Swin-transducer). In this embodiment, an image Feature (C origin) with an original size in a camera image Feature map and a laser point cloud data Feature (L origin) with an original size in a laser point cloud data Feature map are input into a splicing module to obtain a splicing Feature (Concat Feature), then the splicing Feature is input into a convolution layer to obtain a Fusion Feature (Fusion Feature), then the Fusion Feature is input into a sliding window attention module to obtain a Fusion attention Feature (Self-Attention Feature), and finally the Fusion attention Feature is fused with the image Feature with the original size and the laser point cloud data Feature with the original size according to a proportion to obtain a fused camera image Feature (C fusion) and a fused laser point cloud data Feature (L fusion). In fig. 3 (b), a first sliding window Attention layer and a second sliding window Attention layer are provided in the sliding window Attention module, the first sliding window Attention layer is composed of a layer normalization module (LN), a W-MSA module (Windows Multi-Head Self-Attention), and a Multi-layer perceptron Module (MLP), and the second sliding window Attention layer is composed of a layer normalization module (LN), a SW-MSA module (Shifted Windows Multi-Head Self-Attention), and a Multi-layer perceptron Module (MLP); firstly flattening the fusion Feature to obtain a Patch Feature (Patch Feature), and then passing through a first sliding window attention layer and a second sliding window attention layer, wherein the first sliding window attention layer and the second sliding window attention layer are different only in that the first sliding window attention layer uses a W-MSA structure, the purpose of the W-MSA structure is to reduce the calculation amount, and only self-attention is calculated in each window, the second sliding window attention layer uses a SW-MSA structure, and the purpose of the SW-MSA structure is to provide information communication among the windows through a translation window.
Specifically, the fusion module is composed of a splicing module, a convolution layer and a sliding window attention module, wherein the sliding window attention module is composed of a first sliding window attention layer and a second sliding window attention layer, the first sliding window attention layer is composed of a layer standardization module, a W-MSA module and a multi-layer perceptron module, and the second sliding window attention layer is composed of a layer standardization module, a SW-MSA module and a multi-layer perceptron module.
Specifically, step S3 specifically includes S31-S34:
S31, inputting the camera image feature map in the step S1 and the laser point cloud data feature map in the step S2 into a splicing module of the fusion module to obtain splicing features of the camera and the laser radar.
S32, inputting the splicing characteristics of the camera and the laser radar obtained in the step S31 into a convolution layer to obtain the fusion characteristics of the camera and the laser radar.
S33, inputting the fusion characteristics of the camera and the laser radar obtained in the step S32 into a sliding window attention module to obtain the fusion attention characteristics of the camera and the laser radar.
S34, merging the fusion attention characteristic of the camera and the laser radar obtained in the step S33 and the fusion characteristic of the camera and the laser radar obtained in the step S32 into the image characteristic diagram in the step S1 and the laser point cloud data characteristic diagram in the step S2 in proportion to obtain the fused camera image characteristic and the laser point cloud data characteristic.
Specifically, the calculation formula of the fused image feature and the laser point cloud data feature in step S34 is as follows:
Cfusion=Corigin+a1×SelfAttension×FusionFeature
Lfusion=Lorigin+a2×SelfAttension×FusionFeature
Wherein, C fusion represents the fused camera image feature, C origin represents the camera image feature of the original size, a 1、a2 represents the fused scale factors, selfAttension represents the fused attention feature of the camera and the laser radar, fusionFeature represents the fused feature of the camera and the laser radar, L fusion represents the fused laser point cloud data feature, and L origin represents the laser point cloud data feature of the original size.
S4, inputting the fused image features and the laser point cloud data features obtained in the step S3 into a camera module and a laser module respectively to obtain a camera image feature map and a laser point cloud data feature map.
In this embodiment, the image features and the laser point cloud data features fused by the fusion module are respectively input into the camera module and the laser module, and feature extraction is respectively continued. The purpose of the fusion module is to effectively conduct data interaction. Compared with a convolution mode, the fusion module in the embodiment adopts a structure of sliding window attention, not only can better perform feature selection through a global attention mechanism, but also weights the fused image features and laser point cloud data features to respective original features of the image and the laser through the attention mechanism so as to perform next feature extraction.
S5, inputting the camera image feature map and the laser point cloud data feature map obtained in the step S4 into a supervision module of the three-dimensional semantic segmentation network, and calculating a loss function by adopting a self-supervision mode or a supervised mode.
In this embodiment, the supervision module adopts two modes, one is a supervised mode and the other is a self-supervision mode; the supervised mode is trained by adopting a real tag mode, namely, the real tag of the laser point cloud in the original data set, and supervises the predicted value of the camera and the predicted value of the laser radar; the self-supervision mode is to supervise network convergence in a pseudo-label mode, namely, a real label without laser point cloud is adopted, then the image is inferred through a 2D image segmentation network trained in advance in other training sets, meanwhile, the pseudo-label result and confidence coefficient are reserved, the pseudo-label and confidence coefficient mode is utilized to supervise the predicted value of a camera and the predicted value of a laser radar in a combined mode.
Specifically, the camera image feature map and the laser point cloud data feature map obtained in the step S4 are input into a supervision module, and the specific process of calculating the loss function by adopting the self-supervision mode is as follows:
The supervision module generates a pseudo tag through a PIDNet network added with confidence coefficient, simultaneously reserves high confidence coefficient pixels and laser point cloud data, and obtains a loss function of a self-supervision mode through setting a camera mask and a laser radar mask, namely:
Lself-supervised=Lfoc1+Llov1+Lfov2+Llov2+Lkl
Wherein L self-supervised represents a loss function in a self-supervision mode, L kl represents a one-way KL divergence, L foc1、Llov1 represents a focus loss and a lorentz loss between a prediction result of a camera branch and a pseudo tag, L fov2、Llov2 represents a focus loss and a lorentz loss between a prediction result of a laser radar branch and a pseudo tag, u and v represent a length and a width of a prediction result feature map, C represents a confidence level, focalloss (·) represents a focus loss function calculation formula, pred camera represents a prediction result of a camera branch, pred Lidar represents a prediction result of a laser radar branch, label represents a pseudo tag value, M θ1 represents a camera mask, M θ2 represents a laser radar mask, and M l represents a laser radar mask.
In this embodiment, the camera mask or lidar mask is activated only when the confidence is greater than θ1/θ2.
As shown in fig. 4, fig. 4 is a schematic structural diagram of a PIDNet network added with confidence in a supervision module; in this embodiment, a self-supervision mode is used to generate a pseudo tag to train the network, and the pseudo tag is used to predict the predicted value of the camera and the predicted value of the laser radar. FIG. 4 is a PIDNet network with added confidence, with which confidence and pseudo tags can be calculated, which is used only to generate pseudo tags and does not participate in the training process of the whole three-dimensional semantic segmentation network. The confidence coefficient C is defined according to the entropy E, the entropy E is calculated through the probability p, and if the output of the three-dimensional semantic segmentation network is concentrated in a certain class, the smaller the entropy E is, the closer the confidence coefficient C is to 1. The confidence coefficient is calculated by the following formula: n represents the number of classes of semantic segmentation.
Specifically, the camera image feature map and the laser point cloud data feature map obtained in the step S4 are input into a supervision module, and the specific process of calculating the loss function by adopting the supervised mode is as follows:
The monitoring module adjusts the parameter weight by adopting focusing loss and Lowastage loss to obtain a loss function of the monitored mode, namely:
Lsupervised=Lfoc1+Llov1+Lfov2+Llov2
Where L supervised represents a supervised mode penalty function, L foc1、Llov1 represents focus and lorentz penalty between the predicted outcome and truth labels of the camera branch, respectively, and L fov2、Llov2 represents focus and lorentz penalty between the predicted outcome and truth labels of the lidar branch, respectively.
S6, calculating gradients of a camera module, a laser module, a fusion module and a supervision module of the three-dimensional semantic segmentation network according to the loss function calculated in the step S5, and updating parameter weights of the three-dimensional semantic segmentation network by adopting a gradient descent method to obtain the trained three-dimensional semantic segmentation network.
In this embodiment, the calculated loss function is utilized, and the gradient of the camera module, the laser module, the fusion module and the supervision module is calculated to perform layer-by-layer back propagation, update the parameter weight of the three-dimensional semantic segmentation network, and finally, the three-dimensional semantic segmentation network is converged to obtain the trained three-dimensional semantic segmentation network.
S7, acquiring camera image and laser point cloud data, and inputting the trained three-dimensional semantic segmentation network in the step S6 to obtain semantic segmentation results of the laser point cloud data and the camera image.
In the embodiment, the three-dimensional semantic segmentation method provided by the invention is compared with the pure laser radar segmentation method through experiments, and the three-dimensional semantic segmentation method provided by the invention is improved by 2.6% compared with the pure laser radar segmentation method, and meanwhile, the introduction of images is more advantageous in small object segmentation. The specific segmentation results are shown in table 1:
Table 1 results on SEMANTICKITTI dataset
* By our realization, forward looking lidar data is derived from this, + other results are from benchmark *, bolded to the highest result, underlined to the second highest result
As shown in fig. 5, fig. 5 is a schematic diagram of a three-dimensional semantic segmentation network segmentation result; fig. 5 is a three-dimensional semantic segmentation method based on camera and laser fusion, which is provided by the invention, for segmenting camera images and laser point cloud data to obtain three-dimensional semantic segmentation results of the camera images and the laser point cloud data. As can be seen from fig. 5, the ground becomes difficult to distinguish due to the shadow of the tree on the left side of the road, but the road boundary can still be accurately identified by adopting the three-dimensional semantic segmentation method provided by the invention, and meanwhile, the far cyclist can accurately predict the reflected point cloud though the reflected point cloud is sparse.
The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.
Claims (10)
1. The three-dimensional semantic segmentation method based on the fusion of the camera and the laser is characterized by comprising the following steps of:
S1, inputting a camera image into a camera module of a three-dimensional semantic segmentation network to extract image features, and obtaining a camera image feature map with an original size;
S2, inputting laser point cloud data into a laser module of a three-dimensional semantic segmentation network to extract laser point cloud data characteristics, and obtaining a laser point cloud data characteristic diagram with an original size;
S3, inputting the camera image feature map in the step S1 and the laser point cloud data feature map in the step S2 into a fusion module of a three-dimensional semantic segmentation network for feature fusion to obtain fused camera image features and laser point cloud data features;
S4, inputting the fused image features and the laser point cloud data features obtained in the step S3 into a camera module and a laser module respectively to obtain a camera image feature map and a laser point cloud data feature map;
s5, inputting the camera image feature map and the laser point cloud data feature map obtained in the step S4 into a supervision module of the three-dimensional semantic segmentation network, and calculating a loss function by adopting a self-supervision mode or a supervised mode;
S6, calculating gradients of a camera module, a laser module, a fusion module and a supervision module of the three-dimensional semantic segmentation network according to the loss function calculated in the step S5, and updating parameter weights of the three-dimensional semantic segmentation network by adopting a gradient descent method to obtain a trained three-dimensional semantic segmentation network;
S7, acquiring camera image and laser point cloud data, and inputting the trained three-dimensional semantic segmentation network in the step S6 to obtain semantic segmentation results of the laser point cloud data and the camera image.
2. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 1, wherein the camera module and the laser module are composed of an encoder and a decoder, wherein the feature map size in the encoder is reduced layer by layer, the feature map size in the decoder is increased layer by layer, and a jump connection structure is added between the encoder layer and the decoder layer with the same image size.
3. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 2, wherein step S1 specifically comprises:
S11, acquiring a camera image acquired by a camera, inputting the camera image into an encoder, and extracting local features of the camera image by adopting a convolutional neural network to obtain a camera image feature map;
S12, reducing the size of the camera image feature map layer by using a pooling layer according to the camera image feature map obtained in the step S11;
S13, inputting the camera image feature images with reduced sizes into a decoder, and recovering the sizes of the camera image feature images layer by adopting a convolutional neural network and a bilinear upsampling method to obtain camera image feature images with original sizes.
4. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 2, wherein step S2 specifically comprises:
S21, acquiring laser point cloud data acquired by a laser radar, and projecting the laser point cloud data on a camera plane to obtain two-dimensional laser point cloud data;
s22, inputting the two-dimensional laser point cloud data obtained in the step S21 into an encoder, and extracting local features of the two-dimensional laser point cloud data by adopting a convolutional neural network to obtain a laser point cloud data feature map;
S23, reducing the size of the laser point cloud data feature map layer by using a pooling layer according to the laser point cloud data feature map obtained in the step S22;
S24, inputting the reduced-size laser point cloud data feature map into a decoder, and recovering the size of the laser point cloud data feature map layer by utilizing a convolutional neural network and a bilinear upsampling method to obtain the laser point cloud data feature map with the original size.
5. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 4, wherein the calculation formula for performing projection of the laser point cloud data on the camera plane is:
[xi ′,yi ′,zi ′]T=K×Tr×[xi,yi,zi,1]T
Ml[ui][vi]=1
Where x i ′、yi ′、zi ′ represents the position of the ith laser point cloud data in the camera coordinate system, T represents the transpose, K represents the internal reference of the camera, T r represents the laser-to-camera transfer matrix, x i、yi、zi represents the position of the ith laser point cloud data in the x, y and z axes, u i、vi represents the indices of the ith laser point cloud in the vertical and horizontal directions of the camera plane, respectively, and M l represents the lidar mask.
6. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 1, wherein the fusion module is composed of a splicing module, a convolution layer and a sliding window attention module, wherein the sliding window attention module is composed of a first sliding window attention layer and a second sliding window attention layer, the first sliding window attention layer is composed of a layer standardization module, a W-MSA module and a multi-layer perceptron module, and the second sliding window attention layer is composed of a layer standardization module, a SW-MSA module and a multi-layer perceptron module.
7. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 6, wherein step S3 specifically comprises:
S31, inputting the camera image feature map in the step S1 and the laser point cloud data feature map in the step S2 into a splicing module of a fusion module to obtain splicing features of a camera and a laser radar;
S32, inputting the splicing characteristics of the camera and the laser radar obtained in the step S31 into a convolution layer to obtain the fusion characteristics of the camera and the laser radar;
s33, inputting the fusion characteristics of the camera and the laser radar obtained in the step S32 into a sliding window attention module to obtain the fusion attention characteristics of the camera and the laser radar;
S34, merging the fusion attention characteristic of the camera and the laser radar obtained in the step S33 and the fusion characteristic of the camera and the laser radar obtained in the step S32 into the image characteristic diagram in the step S1 and the laser point cloud data characteristic diagram in the step S2 in proportion to obtain the fused camera image characteristic and the laser point cloud data characteristic.
8. The three-dimensional semantic segmentation method based on the fusion of a camera and laser according to claim 7, wherein the calculation formula of the fused image features and the laser point cloud data features in the step S34 is as follows:
Cfusion=Corigin+a1×SelfAttension×FusionFeature
Lfusion=Lorigin+a2×SelfAttension×FusionFeature
Wherein, C fusion represents the fused camera image feature, C origin represents the camera image feature of the original size, a 1、a2 represents the fused scale factors, selfAttension represents the fused attention feature of the camera and the laser radar, fusionFeature represents the fused feature of the camera and the laser radar, L fusion represents the fused laser point cloud data feature, and L origin represents the laser point cloud data feature of the original size.
9. The three-dimensional semantic segmentation method based on the fusion of the camera and the laser according to claim 1, wherein the specific process of inputting the camera image feature map and the laser point cloud data feature map obtained in the step S4 into a supervision module and calculating a loss function by adopting a self-supervision mode is as follows:
The supervision module generates a pseudo tag through a PIDNet network added with confidence coefficient, simultaneously reserves high confidence coefficient pixels and laser point cloud data, and obtains a loss function of a self-supervision mode through setting a camera mask and a laser radar mask, namely:
Lself-supervised=Lfoc1+Llov1+Lfov2+Llov2+Lkl
Wherein L self-supervised represents a loss function in a self-supervision mode, L kl represents a one-way KL divergence, L foc1、Llov1 represents a focus loss and a lorentz loss between a prediction result of a camera branch and a pseudo tag, L fov2、Llov2 represents a focus loss and a lorentz loss between a prediction result of a laser radar branch and a pseudo tag, u and v represent a length and a width of a prediction result feature map, C represents a confidence level, focalloss (·) represents a focus loss function calculation formula, pred camera represents a prediction result of a camera branch, pred Lidar represents a prediction result of a laser radar branch, label represents a pseudo tag value, M θ1 represents a camera mask, M θ2 represents a laser radar mask, and M l represents a laser radar mask.
10. The three-dimensional semantic segmentation method based on the fusion of a camera and laser according to claim 1, wherein the specific process of inputting the camera image feature map and the laser point cloud data feature map obtained in the step S4 into a supervision module and calculating a loss function by adopting a supervision mode is as follows:
The monitoring module adjusts the parameter weight by adopting focusing loss and Lowastage loss to obtain a loss function of the monitored mode, namely:
Lsupervised=Lfoc1+Llov1+Lfov2+Llov2
Where L supervised represents a supervised mode penalty function, L foc1、Llov1 represents focus and lorentz penalty between the predicted outcome and truth labels of the camera branch, respectively, and L fov2、Llov2 represents focus and lorentz penalty between the predicted outcome and truth labels of the lidar branch, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311872786.1A CN117934831A (en) | 2023-12-29 | 2023-12-29 | Three-dimensional semantic segmentation method based on camera and laser fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311872786.1A CN117934831A (en) | 2023-12-29 | 2023-12-29 | Three-dimensional semantic segmentation method based on camera and laser fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117934831A true CN117934831A (en) | 2024-04-26 |
Family
ID=90762311
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311872786.1A Pending CN117934831A (en) | 2023-12-29 | 2023-12-29 | Three-dimensional semantic segmentation method based on camera and laser fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117934831A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118262385A (en) * | 2024-05-30 | 2024-06-28 | 齐鲁工业大学(山东省科学院) | Scheduling sequence based on camera difference and pedestrian re-recognition method based on training |
-
2023
- 2023-12-29 CN CN202311872786.1A patent/CN117934831A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118262385A (en) * | 2024-05-30 | 2024-06-28 | 齐鲁工业大学(山东省科学院) | Scheduling sequence based on camera difference and pedestrian re-recognition method based on training |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021233029A1 (en) | Simultaneous localization and mapping method, device, system and storage medium | |
CN109241913B (en) | Ship detection method and system combining significance detection and deep learning | |
US20210390329A1 (en) | Image processing method, device, movable platform, unmanned aerial vehicle, and storage medium | |
CN107025668B (en) | Design method of visual odometer based on depth camera | |
CN114782691A (en) | Robot target identification and motion detection method based on deep learning, storage medium and equipment | |
CN113409459B (en) | Method, device and equipment for producing high-precision map and computer storage medium | |
CN112258600A (en) | Simultaneous positioning and map construction method based on vision and laser radar | |
WO2021082745A1 (en) | Information completion method, lane line recognition method, intelligent driving method and related product | |
CN111968229A (en) | High-precision map making method and device | |
CN111507210A (en) | Traffic signal lamp identification method and system, computing device and intelligent vehicle | |
CN115830265A (en) | Automatic driving movement obstacle segmentation method based on laser radar | |
CN116188893A (en) | Image detection model training and target detection method and device based on BEV | |
CN113313763A (en) | Monocular camera pose optimization method and device based on neural network | |
CN111161334B (en) | Semantic map construction method based on deep learning | |
CN113255779A (en) | Multi-source perception data fusion identification method and system and computer readable storage medium | |
CN117934831A (en) | Three-dimensional semantic segmentation method based on camera and laser fusion | |
CN112561996A (en) | Target detection method in autonomous underwater robot recovery docking | |
CN116309705B (en) | Satellite video single-target tracking method and system based on feature interaction | |
Ni et al. | Second-order semi-global stereo matching algorithm based on slanted plane iterative optimization | |
CN115511759A (en) | Point cloud image depth completion method based on cascade feature interaction | |
CN115578516A (en) | Three-dimensional imaging method, device, equipment and storage medium | |
CN114608522A (en) | Vision-based obstacle identification and distance measurement method | |
CN113436239A (en) | Monocular image three-dimensional target detection method based on depth information estimation | |
CN116740488B (en) | Training method and device for feature extraction model for visual positioning | |
CN117848348A (en) | Fusion perception method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |