CN117934831A

CN117934831A - A 3D semantic segmentation method based on camera and laser fusion

Info

Publication number: CN117934831A
Application number: CN202311872786.1A
Authority: CN
Inventors: 肖卓凌; 王天越; 胡信为; 向禹骄; 张新辰
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-26

Abstract

The invention discloses a three-dimensional semantic segmentation method based on camera and laser fusion, which comprises the following steps: inputting camera image and laser point cloud data into a camera module and a laser module respectively to extract image and laser point cloud data characteristics to obtain a camera image characteristic image and a laser point cloud data characteristic image, inputting the camera image characteristic image and the laser point cloud data characteristic image into a fusion module to perform characteristic fusion to obtain fused camera image characteristics and laser point cloud data characteristics, inputting the camera image characteristic image and the laser point cloud data characteristic image into the camera module and the laser module respectively to obtain a camera image characteristic image and a laser point cloud data characteristic image, inputting the camera image characteristic image and the laser point cloud data characteristic image into a monitoring module to calculate a loss function, updating parameter weights of a three-dimensional semantic segmentation network, and obtaining a trained three-dimensional semantic segmentation network; acquiring camera image and laser point cloud data, and inputting a trained three-dimensional semantic segmentation network to obtain semantic segmentation results of the laser point cloud data and the camera image; the method effectively combines the texture information of the image and the distance information of the laser, and improves the accuracy of semantic segmentation.

Description

A 3D semantic segmentation method based on camera and laser fusion

技术领域Technical Field

本发明涉及自动驾驶与语义分割技术领域，具体涉及一种基于相机和激光融合的三维语义分割方法。The present invention relates to the technical field of autonomous driving and semantic segmentation, and in particular to a three-dimensional semantic segmentation method based on camera and laser fusion.

背景技术Background technique

在自动驾驶领域，语义分割对于场景理解是非常重要的。语义分割任务是为每一个相机像素、激光点云输入分配一个对应的语义标签。目前主要存在两类方法，基于相机和基于激光雷达的方法。In the field of autonomous driving, semantic segmentation is very important for scene understanding. The semantic segmentation task is to assign a corresponding semantic label to each camera pixel and laser point cloud input. There are currently two main methods: camera-based and lidar-based methods.

相机图像包含三个通道的色彩数据，因此具有更丰富外观信息，例如颜色、纹理。但是相机作为被动式传感器，容易受到照明条件、天气的影响，另外由于相机是2D传感器缺乏深度信息，通常情况下很难得到周围环境的准确距离信息。激光雷达属于主动式传感器，通过向外部发射激光并接收反射激光算出准确距离，在不同光照条件下性能几乎不受影响。但是，由于点云稀疏、分布不规则、缺乏纹理，因此在小物体、远距离、结构相似的场景下分割效果较差。Camera images contain color data from three channels, so they have richer appearance information, such as color and texture. However, as a passive sensor, the camera is easily affected by lighting conditions and weather. In addition, since the camera is a 2D sensor that lacks depth information, it is usually difficult to obtain accurate distance information of the surrounding environment. LiDAR is an active sensor that calculates the exact distance by emitting lasers to the outside and receiving reflected lasers. Its performance is almost unaffected under different lighting conditions. However, due to the sparse point cloud, irregular distribution, and lack of texture, the segmentation effect is poor in scenes with small objects, long distances, and similar structures.

目前基于相机图像和激光点云融合方案结合了基于相机和基于激光雷达两种方法的优点，通过考虑图像的纹理和激光的距离，达到三维语义分割的目的。但是，该方法存在激光雷达分割缺乏纹理特征和图像分割缺乏距离的问题。At present, the camera image and laser point cloud fusion solution combines the advantages of the camera-based and laser radar-based methods, and achieves the purpose of 3D semantic segmentation by considering the texture of the image and the distance of the laser. However, this method has the problem that the laser radar segmentation lacks texture features and the image segmentation lacks distance.

发明内容Summary of the invention

针对现有技术中的上述不足，本发明提供了一种基于相机和激光融合的三维语义分割方法。In view of the above-mentioned deficiencies in the prior art, the present invention provides a three-dimensional semantic segmentation method based on camera and laser fusion.

为了达到上述发明目的，本发明采用的技术方案为：In order to achieve the above-mentioned object of the invention, the technical solution adopted by the present invention is:

一种基于相机和激光融合的三维语义分割方法，包括以下步骤：A three-dimensional semantic segmentation method based on camera and laser fusion includes the following steps:

S1、将相机图像输入三维语义分割网络的相机模块提取图像特征，得到原始大小的相机图像特征图；S1, input the camera image into the camera module of the 3D semantic segmentation network to extract image features and obtain a camera image feature map of the original size;

S2、将激光点云数据输入三维语义分割网络的激光模块提取激光点云数据特征，得到原始大小的激光点云数据特征图；S2, inputting the laser point cloud data into the laser module of the 3D semantic segmentation network to extract the features of the laser point cloud data, and obtaining a laser point cloud data feature map of the original size;

S3、将步骤S1中相机图像特征图与步骤S2中激光点云数据特征图输入三维语义分割网络的融合模块进行特征融合，得到融合后的相机图像特征和激光点云数据特征；S3, inputting the camera image feature map in step S1 and the laser point cloud data feature map in step S2 into a fusion module of a three-dimensional semantic segmentation network for feature fusion, to obtain fused camera image features and laser point cloud data features;

S4、将步骤S3中得到的融合后的图像特征与激光点云数据特征分别输入相机模块与激光模块，得到相机图像特征图和激光点云数据特征图；S4, inputting the fused image features and laser point cloud data features obtained in step S3 into the camera module and the laser module respectively to obtain a camera image feature map and a laser point cloud data feature map;

S5、将步骤S4中得到的相机图像特征图和激光点云数据特征图输入三维语义分割网络的监督模块，采用自监督模式或有监督模式计算损失函数；S5, inputting the camera image feature map and the laser point cloud data feature map obtained in step S4 into the supervision module of the 3D semantic segmentation network, and calculating the loss function in a self-supervised mode or a supervised mode;

S6、根据步骤S5中计算的损失函数，计算三维语义分割网络的相机模块、激光模块、融合模块以及监督模块的梯度，并采用梯度下降法更新三维语义分割网络的参数权重，得到训练好的三维语义分割网络；S6. According to the loss function calculated in step S5, the gradients of the camera module, laser module, fusion module and supervision module of the 3D semantic segmentation network are calculated, and the parameter weights of the 3D semantic segmentation network are updated by using the gradient descent method to obtain a trained 3D semantic segmentation network.

S7、获取相机图像和激光点云数据，输入步骤S6中训练好的三维语义分割网络，得到激光点云数据与相机图像的语义分割结果。S7, obtaining camera images and laser point cloud data, inputting the three-dimensional semantic segmentation network trained in step S6, and obtaining semantic segmentation results of the laser point cloud data and the camera images.

进一步地，相机模块与激光模块由编码器和解码器构成，其中，编码器中的特征图尺寸逐层减小，解码器中的特征图尺寸逐层增加，并在图像尺寸相同的编码器层和解码器层之间加入跳跃连接结构。Furthermore, the camera module and the laser module are composed of an encoder and a decoder, wherein the size of the feature map in the encoder decreases layer by layer, the size of the feature map in the decoder increases layer by layer, and a jump connection structure is added between the encoder layer and the decoder layer with the same image size.

进一步地，步骤S1具体包括：Furthermore, step S1 specifically includes:

S11、获取相机采集的相机图像，将相机图像输入编码器中，采用卷积神经网络提取相机图像的局部特征，得到相机图像特征图；S11, obtaining a camera image captured by a camera, inputting the camera image into an encoder, and using a convolutional neural network to extract local features of the camera image to obtain a camera image feature map;

S12、根据步骤S11中得到的相机图像特征图，利用池化层逐层降低相机图像特征图的尺寸；S12, according to the camera image feature map obtained in step S11, using a pooling layer to reduce the size of the camera image feature map layer by layer;

S13、将降低尺寸的相机图像特征图输入解码器中，采用卷积神经网络和双线性上采样方法逐层恢复相机图像特征图的尺寸，得到原始大小的相机图像特征图。S13. Input the reduced-size camera image feature map into the decoder, and use a convolutional neural network and a bilinear upsampling method to restore the size of the camera image feature map layer by layer to obtain the camera image feature map of the original size.

进一步地，步骤S2具体包括：Furthermore, step S2 specifically includes:

S21、获取激光雷达采集的激光点云数据，将激光点云数据进行相机平面的投影，得到二维激光点云数据；S21, acquiring laser point cloud data collected by the laser radar, and projecting the laser point cloud data onto a camera plane to obtain two-dimensional laser point cloud data;

S22、将步骤S21中得到的二维激光点云数据输入编码器中，采用卷积神经网络提取二维激光点云数据的局部特征，得到激光点云数据特征图；S22, inputting the two-dimensional laser point cloud data obtained in step S21 into an encoder, and using a convolutional neural network to extract local features of the two-dimensional laser point cloud data to obtain a laser point cloud data feature map;

S23、根据步骤S22中得到的激光点云数据特征图，利用池化层逐层降低激光点云数据特征图的尺寸；S23, according to the laser point cloud data feature map obtained in step S22, using a pooling layer to reduce the size of the laser point cloud data feature map layer by layer;

S24、将降低尺寸的激光点云数据特征图输入解码器中，利用卷积神经网络和双线性上采样方法逐层恢复激光点云数据特征图的尺寸，得到原始大小的激光点云数据特征图。S24. Input the reduced-size laser point cloud data feature map into the decoder, and use a convolutional neural network and a bilinear upsampling method to restore the size of the laser point cloud data feature map layer by layer to obtain the laser point cloud data feature map of the original size.

进一步地，将激光点云数据进行相机平面的投影的计算公式为：Furthermore, the calculation formula for projecting the laser point cloud data onto the camera plane is:

[x′_i，y′_i，z′_i]^T＝K×T_r×[x_i，y_i，z_i，1]^T [x′ _i , y′ _i , z′ _i ] ^T = K × T _r × [x _i , y _i , z _i , 1] ^T

M_l[u_i][v_i]＝1M _l [u _i ][v _i ]＝1

其中，x′_i、y′_i、z′_i分别表示第i个激光点云数据在相机坐标系下的位置，T表示转置，K表示相机的内参，T_r表示激光到相机的转移矩阵，x_i、y_i、z_i表示第i个激光点云数据在x、y与z轴上的位置，u_i、v_i分别表示第i个激光点云在相机平面的垂直和水平方向上的索引，M_l表示激光雷达掩码。Among them, x′ _i , y′ _i , z′ _i respectively represent the position of the i-th laser point cloud data in the camera coordinate system, T represents transpose, K represents the intrinsic parameter of the camera, _Tr represents the transfer matrix from laser to camera, x _i , y _i , _zi represent the position of the i-th laser point cloud data on the x, y and z axes, _ui , _vi represent the index of the i-th laser point cloud in the vertical and horizontal directions of the camera plane, respectively, and M _l represents the lidar mask.

进一步地，融合模块由拼接模块、卷积层、滑动窗口注意力模块构成，其中，滑动窗口注意力模块由第一滑动窗口注意力层和第二滑动窗口注意力层构成，第一滑动窗口注意力层由层标准化模块、W-MSA模块以及多层感知器模块构成，第二滑动窗口注意力层由层标准化模块、SW-MSA模块以及多层感知器模块构成。Furthermore, the fusion module is composed of a splicing module, a convolutional layer, and a sliding window attention module, wherein the sliding window attention module is composed of a first sliding window attention layer and a second sliding window attention layer, the first sliding window attention layer is composed of a layer normalization module, a W-MSA module, and a multilayer perceptron module, and the second sliding window attention layer is composed of a layer normalization module, a SW-MSA module, and a multilayer perceptron module.

进一步地，步骤S3具体包括：Furthermore, step S3 specifically includes:

S31、将步骤S1中相机图像特征图与步骤S2中激光点云数据特征图输入融合模块的拼接模块，得到相机与激光雷达的拼接特征；S31, inputting the camera image feature map in step S1 and the laser point cloud data feature map in step S2 into a splicing module of a fusion module to obtain splicing features of the camera and the laser radar;

S32、将步骤S31中得到的相机与激光雷达的拼接特征输入卷积层，得到相机与激光雷达的融合特征；S32, inputting the splicing features of the camera and the laser radar obtained in step S31 into the convolution layer to obtain the fusion features of the camera and the laser radar;

S33、将步骤S32中得到的相机与激光雷达的融合特征输入滑动窗口注意力模块，得到相机与激光雷达的融合注意力特征；S33, inputting the fusion features of the camera and the laser radar obtained in step S32 into the sliding window attention module to obtain the fusion attention features of the camera and the laser radar;

S34、将步骤S33中得到的相机与激光雷达的融合注意力特征与步骤S32中得到的相机与激光雷达的融合特征按比例融入步骤S1中图像特征图和步骤S2中激光点云数据特征图中，得到融合后的相机图像特征和激光点云数据特征。S34. Integrate the fused attention features of the camera and the laser radar obtained in step S33 and the fused features of the camera and the laser radar obtained in step S32 into the image feature map in step S1 and the laser point cloud data feature map in step S2 in proportion to obtain fused camera image features and laser point cloud data features.

进一步地，步骤S34中融合后的图像特征与激光点云数据特征的计算公式为：Furthermore, the calculation formula of the fused image features and laser point cloud data features in step S34 is:

C_fusion＝C_origin+a₁×SelfAttension×FusionFeatureC _fusion = C _origin + a ₁ × SelfAttension × FusionFeature

L_fusion＝L_origin+a₂×SelfAttension×FusionFeatureL _fusion = L _origin + a ₂ × SelfAttension × FusionFeature

其中，C_fusion表示融合后的相机图像特征，C_origin表示原始大小的相机图像特征，a₁、a₂分别表示融合比例因子，SelfAttension表示相机与激光雷达的融合注意力特征，FusionFeature表示相机与激光雷达的融合特征，L_fusion表示融合后的激光点云数据特征，L_origin表示原始大小的激光点云数据特征。Among them, C _fusion represents the fused camera image features, C _origin represents the camera image features of the original size, a ₁ and a ₂ represent the fusion scale factors respectively, SelfAttension represents the fusion attention features of the camera and lidar, FusionFeature represents the fusion features of the camera and lidar, L _fusion represents the fused laser point cloud data features, and L _origin represents the laser point cloud data features of the original size.

进一步地，将步骤S4中得到的相机图像特征图和激光点云数据特征图输入监督模块，采用自监督模式计算损失函数的具体过程为：Furthermore, the camera image feature map and the laser point cloud data feature map obtained in step S4 are input into the supervision module, and the specific process of calculating the loss function using the self-supervision mode is as follows:

监督模块通过加入置信度的PIDNet网络产生伪标签，同时保留高置信度像素与激光点云数据，通过设置相机掩膜和激光雷达掩膜，得到自监督模式的损失函数，即：The supervision module generates pseudo labels by adding the confidence PIDNet network, while retaining high-confidence pixels and laser point cloud data. By setting the camera mask and lidar mask, the loss function of the self-supervised mode is obtained, namely:

L_{self-supervised}＝L_foc1+L_lov1+L_fov2+L_lov2+L_kl L _{self-supervised} = L _foc1 + L _lov1 + L _fov2 + L _lov2 + L _kl

其中，L_{self-supervised}表示自监督模式的损失函数，L_kl表示单向KL散度，L_foc1、L_lov1分别表示相机分支的预测结果和伪标签之间的聚焦损失和洛瓦兹损失，L_fov2、L_lov2表示激光雷达分支的预测结果和伪标签之间的聚焦损失和洛瓦兹损失，u、v分别表示预测结果特征图的长度和宽度，C表示置信度，focalloss(·)表示聚焦损失函数计算公式，Pred_camera表示相机分支的预测结果，Pred_Lidar表示激光雷达分支的预测结果，label表示伪标签值，M_θ1表示相机掩膜，M_θ2表示激光雷达掩膜，M_l表示激光雷达掩码。Among them, L _{self-supervised} represents the loss function of the self-supervised mode, L _kl represents the one-way KL divergence, L _foc1 , L _lov1 represent the focal loss and Lovalz loss between the prediction result of the camera branch and the pseudo label, respectively, L _fov2 , L _lov2 represent the focal loss and Lovalz loss between the prediction result of the lidar branch and the pseudo label, u, v represent the length and width of the prediction result feature map, respectively, C represents the confidence, focalloss(·) represents the focal loss function calculation formula, Pred _camera represents the prediction result of the camera branch, Pred _Lidar represents the prediction result of the lidar branch, label represents the pseudo label value, M _θ1 represents the camera mask, M _θ2 represents the lidar mask, and M _l represents the lidar mask.

进一步地，将步骤S4中得到的相机图像特征图和激光点云数据特征图输入监督模块，采用有监督模式计算损失函数的具体过程为：Furthermore, the camera image feature map and the laser point cloud data feature map obtained in step S4 are input into the supervision module, and the specific process of calculating the loss function in the supervised mode is as follows:

监督模块采用聚焦损失、洛瓦兹损失对参数权重进行调整，得到有监督模式的损失函数，即：The supervision module uses focus loss and Lovalz loss to adjust the parameter weights to obtain the loss function of the supervised mode, namely:

L_supervised＝L_foc1+L_lov1+L_fov2+L_lov2 L _supervised =L _foc1 +L _lov1 +L _fov2 +L _lov2

其中，L_supervised表示有监督模式的损失函数，L_foc1、L_lov1分别表示相机分支的预测结果和真值标签之间的聚焦损失与洛瓦兹损失、L_fov2、L_lov2分别表示激光雷达分支的预测结果和真值标签之间的聚焦损失与洛瓦兹损失。Among them, L _supervised represents the loss function of the supervised mode, L _foc1 and L _lov1 represent the focus loss and Lovalz loss between the prediction results of the camera branch and the true value label, respectively, and L _fov2 and L _lov2 represent the focus loss and Lovalz loss between the prediction results of the lidar branch and the true value label, respectively.

本发明具有以下有益效果：The present invention has the following beneficial effects:

1.本发明所提出的一种基于相机和激光融合的三维语义分割方法，通过有效结合相机图像的纹理信息和激光雷达的距离信息，提高语义分割的准确率，同时针对小物体激光点云较为稀疏的情况，通过引入相机图像信息，使得预测结果更加准确；1. The three-dimensional semantic segmentation method based on camera and laser fusion proposed in the present invention improves the accuracy of semantic segmentation by effectively combining the texture information of camera images and the distance information of laser radar. At the same time, in view of the sparse laser point cloud of small objects, the prediction result is more accurate by introducing camera image information;

2.在融合模块采用滑动窗口注意力机制，在光照、颜色剧烈变化下，三维语义分割网络有更强的鲁棒性；2. The sliding window attention mechanism is used in the fusion module, which makes the 3D semantic segmentation network more robust under drastic changes in lighting and color;

3.利用加入置信度的PIDNet网络产生伪标签，不需要任何人工手动标注点云标签，仍可以通过跨数据、跨模态的方式训练三维语义分割网络，提高了三维语义分割网络的预测精度。3. The PIDNet network with confidence is used to generate pseudo labels. No manual annotation of point cloud labels is required. The 3D semantic segmentation network can still be trained in a cross-data and cross-modal manner, which improves the prediction accuracy of the 3D semantic segmentation network.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明所提出的一种基于相机和激光融合的三维语义分割方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a 3D semantic segmentation method based on camera and laser fusion proposed in the present invention;

图2为三维语义分割网络结构示意图；Figure 2 is a schematic diagram of the three-dimensional semantic segmentation network structure;

图3为融合模块结构示意图；Fig. 3 is a schematic diagram of the fusion module structure;

图4为监督模块中加入置信度的PIDNet网络的结构示意图；FIG4 is a schematic diagram of the structure of the PIDNet network with confidence added to the supervision module;

图5为三维语义分割网络分割结果示意图。Figure 5 is a schematic diagram of the segmentation results of the three-dimensional semantic segmentation network.

具体实施方式Detailed ways

下面对本发明的具体实施方式进行描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。The specific implementation modes of the present invention are described below to facilitate those skilled in the art to understand the present invention. However, it should be clear that the present invention is not limited to the scope of the specific implementation modes. For those of ordinary skill in the art, as long as various changes are within the spirit and scope of the present invention as defined and determined by the attached claims, these changes are obvious, and all inventions and creations utilizing the concept of the present invention are protected.

如图1所示，一种基于相机和激光融合的三维语义分割方法，包括以下步骤S1-S7：As shown in FIG1 , a 3D semantic segmentation method based on camera and laser fusion includes the following steps S1-S7:

如图2所示，图2为三维语义分割网络结构示意图；图2中三维语义分割网络包括相机模块、激光模块、融合模块以及监督模块。三维语义分割网络整体上采用双流架构，即两个输入端，相机图像输入相机模块，激光雷达的激光点云数据输入激光模块。其中，相机模块和激光模块结构相似，各自由特征图尺寸逐层减小的编码器和特征图尺寸逐层增加的解码器构成。在编码器阶段采用卷积神经网络提取相机图像或激光点云数据的局部特征，并通过池化层降低特征图的尺寸；在解码器阶段采用卷积神经网络和双线性上采样方法，逐层恢复特征图的尺寸直到恢复到原始输入的相机图像或激光点云数据图像的尺寸大小。同时，本实施例在相机模块或激光模块的编码器与解码器相同尺寸特征图之间加入跳跃连接结构，如图2在相机模块或激光模块中，特征直接由编码器通过虚线所示的路径传播到解码器中。As shown in FIG. 2 , FIG. 2 is a schematic diagram of the structure of a three-dimensional semantic segmentation network; the three-dimensional semantic segmentation network in FIG. 2 includes a camera module, a laser module, a fusion module, and a supervision module. The three-dimensional semantic segmentation network as a whole adopts a dual-stream architecture, that is, two input ends, the camera image is input into the camera module, and the laser point cloud data of the laser radar is input into the laser module. Among them, the camera module and the laser module have similar structures, each consisting of an encoder whose feature map size decreases layer by layer and a decoder whose feature map size increases layer by layer. In the encoder stage, a convolutional neural network is used to extract local features of the camera image or laser point cloud data, and the size of the feature map is reduced by a pooling layer; in the decoder stage, a convolutional neural network and a bilinear upsampling method are used to restore the size of the feature map layer by layer until it is restored to the size of the original input camera image or laser point cloud data image. At the same time, this embodiment adds a jump connection structure between the encoder and the decoder of the same size feature map of the camera module or laser module, as shown in FIG. 2 in the camera module or laser module, and the feature is directly propagated from the encoder to the decoder through the path shown by the dotted line.

S1、将相机图像输入三维语义分割网络的相机模块提取图像特征，得到原始大小的相机图像特征图。S1. Input the camera image into the camera module of the 3D semantic segmentation network to extract image features and obtain a camera image feature map of the original size.

S2、将激光点云数据输入三维语义分割网络的激光模块提取激光点云数据特征，得到原始大小的激光点云数据特征图。S2. Input the laser point cloud data into the laser module of the 3D semantic segmentation network to extract the features of the laser point cloud data and obtain a feature map of the laser point cloud data of the original size.

具体地，相机模块与激光模块由编码器和解码器构成，其中，编码器中的特征图尺寸逐层减小，解码器中的特征图尺寸逐层增加，并在图像尺寸相同的编码器层和解码器层之间加入跳跃连接结构。Specifically, the camera module and the laser module are composed of an encoder and a decoder, wherein the size of the feature map in the encoder decreases layer by layer, the size of the feature map in the decoder increases layer by layer, and a jump connection structure is added between the encoder layer and the decoder layer with the same image size.

本实施例中，通过逐层的方式扩大了卷积的感受视野，提高了不同大小的物体的分割性能，同时利用跳跃连接结构，保留了原始图像特征的边缘位置信息，使得语义分割更加精确。其中，编码器和解码器的层数可自由设置，并根据输入的图像尺寸进行微调，通常为4层。In this embodiment, the convolutional perception field of view is expanded layer by layer, the segmentation performance of objects of different sizes is improved, and the edge position information of the original image features is retained by using the jump connection structure, making the semantic segmentation more accurate. The number of layers of the encoder and decoder can be freely set and fine-tuned according to the input image size, usually 4 layers.

具体地，步骤S1具体包括S11-S13：Specifically, step S1 includes S11-S13:

S11、获取相机采集的相机图像，将相机图像输入编码器中，采用卷积神经网络提取相机图像的局部特征，得到相机图像特征图。S11. Obtain a camera image captured by a camera, input the camera image into an encoder, and use a convolutional neural network to extract local features of the camera image to obtain a camera image feature map.

S12、根据步骤S11中得到的相机图像特征图，利用池化层逐层降低相机图像特征图的尺寸。S12. According to the camera image feature map obtained in step S11, a size of the camera image feature map is reduced layer by layer using a pooling layer.

具体地，步骤S2具体包括S21-S24：Specifically, step S2 specifically includes S21-S24:

S21、获取激光雷达采集的激光点云数据，将激光点云数据进行相机平面的投影，得到二维激光点云数据。S21. Obtain laser point cloud data collected by the laser radar, and project the laser point cloud data onto the camera plane to obtain two-dimensional laser point cloud data.

具体地，将激光点云数据进行相机平面的投影的计算公式为：Specifically, the calculation formula for projecting the laser point cloud data onto the camera plane is:

M_l[u_i][v_i]＝1M _l [u _i ][v _i ]＝1

本实施例中，由于相机数据为二维数据，激光点云数据为三维数据，因此存在空间不统一的问题，通常机械旋转式激光雷达的视角范围要大于相机视角范围，因此本实施例中将激光点云数据进行相机平面的投影，转换成二维的激光点云数据。其中，公式[x′_i，y′_i，z′_i]^T＝K×T_r×[x_i，y_i，z_i，1]^T为激光点云数据向相机图像投影的公式，激光点云为P_i＝{x_i，y_i，z_i}^T、x_i，y_i，z_i分别表示激光点云在三维空间的坐标；T_r表示激光到相机的转移矩阵，即描述激光雷达和相机两个传感器之间的物理距离，K表示相机的内参，即三维空间映射到二维感光平面的关系，[x′_i，y′_i，z′_i]^T表示相机坐标系下的三维点；公式通过将尺度缩放z′_i得到[u_i，v_i，1]^T，即激光点云在相机平面对应的二维坐标；由于激光点云数据通常是稀疏的，因此并不是每个图像像素都有对应的激光点云，所以利用公式M_l[u_i][v_i]＝1表示有激光点云映射的位置。In this embodiment, since the camera data is two-dimensional data and the laser point cloud data is three-dimensional data, there is a problem of spatial inconsistency. Usually, the viewing angle range of a mechanical rotating laser radar is larger than the viewing angle range of a camera. Therefore, in this embodiment, the laser point cloud data is projected onto the camera plane and converted into two-dimensional laser point cloud data. Wherein, the formula [x′ _i , y′ _i , z′ _i ] ^T ＝K× _Tr ×[x _i , y _i , z _i , 1] ^T is the formula for projecting the laser point cloud data onto the camera image. The laser point cloud is P _i ＝{x _i , y _i , z _i } ^T , x _i , y _i , z _i respectively represent the coordinates of the laser point cloud in three-dimensional space; _Tr represents the transfer matrix from the laser to the camera, that is, it describes the physical distance between the two sensors of the laser radar and the camera, K represents the internal parameters of the camera, that is, the relationship between the three-dimensional space mapping to the two-dimensional photosensitive plane, [x′ _i , y′ _i , z′ _i ] ^T represents the three-dimensional point in the camera coordinate system; the formula By scaling z′ _i, we get [u _i , _vi , 1] ^T , which is the two-dimensional coordinates of the laser point cloud on the camera plane. Since the laser point cloud data is usually sparse, not every image pixel has a corresponding laser point cloud. Therefore, the formula M _l [u _i ][ _vi ] = 1 is used to indicate the position where the laser point cloud is mapped.

S22、将步骤S21中得到的二维激光点云数据输入编码器中，采用卷积神经网络提取二维激光点云数据的局部特征，得到激光点云数据特征图。S22, inputting the two-dimensional laser point cloud data obtained in step S21 into an encoder, using a convolutional neural network to extract local features of the two-dimensional laser point cloud data, and obtaining a laser point cloud data feature map.

S23、根据步骤S22中得到的激光点云数据特征图，利用池化层逐层降低激光点云数据特征图的尺寸。S23. According to the laser point cloud data feature map obtained in step S22, a size of the laser point cloud data feature map is reduced layer by layer using a pooling layer.

S3、将步骤S1中相机图像特征图与步骤S2中激光点云数据特征图输入三维语义分割网络的融合模块进行特征融合，得到融合后的相机图像特征和激光点云数据特征。S3, inputting the camera image feature map in step S1 and the laser point cloud data feature map in step S2 into the fusion module of the three-dimensional semantic segmentation network for feature fusion, so as to obtain fused camera image features and laser point cloud data features.

如图3所示，图3为融合模块结构示意图；图3(a)中，融合模块包括拼接模块(C)、卷积层(Conv)、滑动窗口注意力模块(Swin-transformer)。本实施例中，将相机图像特征图中的原始大小的图像特征(C_origin)与激光点云数据特征图中的原始大小的激光点云数据特征(L_origin)输入拼接模块，得到拼接特征(Concat Feature)，其次将拼接特征输入卷积层，得到融合特征(Fusion Feature)，然后将融合特征输入滑动窗口注意力模块，得到融合注意力特征(Self-Attention Feature)，最后将融合注意力特征按照比例与原始大小的图像特征与原始大小的激光点云数据特征进行融合，得到融合后的相机图像特征(C_fusion)和融合后的激光点云数据特征(L_fusion)。图3(b)中，在滑动窗口注意力模块设置了第一滑动窗口注意力层和第二滑动窗口注意力层，第一滑动窗口注意力层由层标准化模块(LN)、W-MSA模块(Windows Multi-Head Self-Attention)以及多层感知器模块MLP)构成，第二滑动窗口注意力层由层标准化模块(LN)、SW-MSA模块(Shifted Windows Multi-Head Self-Attention)以及多层感知器模块(MLP)构成；首先将融合特征打平得到补丁特征(PatchFeature)，然后经过第一滑动窗口注意力层和第二滑动窗口注意力层，第一滑动窗口注意力层和第二滑动窗口注意力层的不同之处仅仅在于，第一滑动窗口注意力层使用了W-MSA结构，W-MSA结构的目的是用于减少计算量，且只会在每个窗口内计算自注意力，第二滑动窗口注意力层使用了SW-MSA结构，SW-MSA结构的目的是通过平移窗口提供窗口间的信息交流。As shown in FIG3 , FIG3 is a schematic diagram of the fusion module structure; in FIG3 (a), the fusion module includes a splicing module (C), a convolution layer (Conv), and a sliding window attention module (Swin-transformer). In this embodiment, the original size image feature (C _origin ) in the camera image feature map and the original size laser point cloud data feature (L _origin ) in the laser point cloud data feature map are input into the splicing module to obtain a splicing feature (Concat Feature), and then the splicing feature is input into the convolution layer to obtain a fusion feature (Fusion Feature), and then the fusion feature is input into the sliding window attention module to obtain a fusion attention feature (Self-Attention Feature), and finally the fusion attention feature is fused according to the ratio with the original size image feature and the original size laser point cloud data feature to obtain a fused camera image feature (C _fusion ) and a fused laser point cloud data feature (L _fusion ). In Figure 3(b), the first sliding window attention layer and the second sliding window attention layer are set in the sliding window attention module. The first sliding window attention layer is composed of a layer normalization module (LN), a W-MSA module (Windows Multi-Head Self-Attention) and a multi-layer perceptron module MLP), and the second sliding window attention layer is composed of a layer normalization module (LN), a SW-MSA module (Shifted Windows Multi-Head Self-Attention) and a multi-layer perceptron module (MLP); first, the fusion feature is flattened to obtain a patch feature (PatchFeature), and then passes through the first sliding window attention layer and the second sliding window attention layer. The difference between the first sliding window attention layer and the second sliding window attention layer is that the first sliding window attention layer uses the W-MSA structure. The purpose of the W-MSA structure is to reduce the amount of calculation and only calculate self-attention in each window. The second sliding window attention layer uses the SW-MSA structure. The purpose of the SW-MSA structure is to provide information exchange between windows by shifting windows.

具体地，融合模块由拼接模块、卷积层、滑动窗口注意力模块构成，其中，滑动窗口注意力模块由第一滑动窗口注意力层和第二滑动窗口注意力层构成，第一滑动窗口注意力层由层标准化模块、W-MSA模块以及多层感知器模块构成，第二滑动窗口注意力层由层标准化模块、SW-MSA模块以及多层感知器模块构成。Specifically, the fusion module is composed of a splicing module, a convolutional layer, and a sliding window attention module, wherein the sliding window attention module is composed of a first sliding window attention layer and a second sliding window attention layer, the first sliding window attention layer is composed of a layer normalization module, a W-MSA module, and a multilayer perceptron module, and the second sliding window attention layer is composed of a layer normalization module, a SW-MSA module, and a multilayer perceptron module.

具体地，步骤S3具体包括S31-S34：Specifically, step S3 specifically includes S31-S34:

S31、将步骤S1中相机图像特征图与步骤S2中激光点云数据特征图输入融合模块的拼接模块，得到相机与激光雷达的拼接特征。S31, input the camera image feature map in step S1 and the laser point cloud data feature map in step S2 into the splicing module of the fusion module to obtain the splicing features of the camera and the laser radar.

S32、将步骤S31中得到的相机与激光雷达的拼接特征输入卷积层，得到相机与激光雷达的融合特征。S32: Input the concatenated features of the camera and the lidar obtained in step S31 into the convolution layer to obtain the fusion features of the camera and the lidar.

S33、将步骤S32中得到的相机与激光雷达的融合特征输入滑动窗口注意力模块，得到相机与激光雷达的融合注意力特征。S33. Input the fusion features of the camera and the laser radar obtained in step S32 into the sliding window attention module to obtain the fusion attention features of the camera and the laser radar.

具体地，步骤S34中融合后的图像特征与激光点云数据特征的计算公式为：Specifically, the calculation formula of the fused image features and laser point cloud data features in step S34 is:

L_fusion＝L_ori_gin+a₂×SelfAttension×FusionFeatureL _fusion = L _or i _g in + a ₂ × SelfAttension × FusionFeature

S4、将步骤S3中得到的融合后的图像特征与激光点云数据特征分别输入相机模块与激光模块，得到相机图像特征图和激光点云数据特征图。S4. Input the fused image features and laser point cloud data features obtained in step S3 into the camera module and the laser module respectively to obtain a camera image feature map and a laser point cloud data feature map.

本实施例中，将在融合模块融合后的图像特征与激光点云数据特征分别输入相机模块与激光模块，分别继续进行特征提取。设置融合模块的目的就是为了有效的进行数据交互。本实施例中的融合模块采用滑动窗口注意力的结构，相较于卷积的方式，不仅可以通过全局注意力机制更好的进行特征选择，而且将融合后的图像特征与激光点云数据特征通过注意力机制加权到图像、激光各自原始特征上，以便进行下一次的特征提取。In this embodiment, the image features and laser point cloud data features fused in the fusion module are input into the camera module and the laser module respectively, and feature extraction is continued respectively. The purpose of setting the fusion module is to effectively interact with data. The fusion module in this embodiment adopts a sliding window attention structure. Compared with the convolution method, it can not only better select features through the global attention mechanism, but also weight the fused image features and laser point cloud data features to the original features of the image and laser through the attention mechanism, so as to perform the next feature extraction.

S5、将步骤S4中得到的相机图像特征图和激光点云数据特征图输入三维语义分割网络的监督模块，采用自监督模式或有监督模式计算损失函数。S5. Input the camera image feature map and the laser point cloud data feature map obtained in step S4 into the supervision module of the three-dimensional semantic segmentation network, and calculate the loss function in a self-supervised mode or a supervised mode.

本实施例中，在监督模块采用了两种模式，一种为有监督模式，一种为自监督模式；有监督模式通过采用真实标签的方式，即通过原始数据集中激光点云的真实标签进行训练，监督相机的预测值和激光雷达的预测值；自监督模式通过伪标签的方式，即没有激光点云的真实标签，则通过在其他训练集预先训练好的2D图像分割网络对图像做推理，同时保留伪标签结果和置信度，利用伪标签和置信度的方式联合监督网络收敛，利用伪标签监督相机的预测值和激光雷达的预测值。In this embodiment, two modes are adopted in the supervision module, one is a supervised mode and the other is a self-supervised mode; the supervised mode adopts the real label method, that is, the real label of the laser point cloud in the original data set is used for training to supervise the camera's prediction value and the lidar's prediction value; the self-supervised mode uses a pseudo-label method, that is, there is no real label of the laser point cloud, and the image is inferred by a 2D image segmentation network pre-trained in other training sets, while retaining the pseudo-label results and confidence, and using the pseudo-label and confidence method to jointly supervise the network convergence, and using the pseudo-label to supervise the camera's prediction value and the lidar's prediction value.

具体地，将步骤S4中得到的相机图像特征图和激光点云数据特征图输入监督模块，采用自监督模式计算损失函数的具体过程为：Specifically, the camera image feature map and the laser point cloud data feature map obtained in step S4 are input into the supervision module, and the specific process of calculating the loss function using the self-supervision mode is as follows:

本实施例中，只有当置信度大于θ1/θ2时，相机掩膜或激光雷达掩膜被激活。In this embodiment, the camera mask or the lidar mask is activated only when the confidence is greater than θ1/θ2.

如图4所示，图4为监督模块中加入置信度的PIDNet网络的结构示意图；本实施例中，采用自监督模式生成伪标签对网络进行训练，并利用伪标签预测相机的预测值以及激光雷达的预测值。图4为加入置信度的PIDNet网络，利用该网络可以计算置信度以及伪标签，该网络仅仅用作产生伪标签使用，不参与到整个三维语义分割网络的训练过程。其中，置信度C根据熵E来定义，熵E通过概率p计算，若三维语义分割网络的输出越集中在某一类，则熵E越小，置信度C越接近于1。置信度的计算公式为：n表示语义分割的类别数目。As shown in Figure 4, Figure 4 is a schematic diagram of the structure of the PIDNet network with confidence added to the supervision module; in this embodiment, the self-supervised mode is used to generate pseudo labels to train the network, and the pseudo labels are used to predict the camera's prediction value and the lidar's prediction value. Figure 4 is a PIDNet network with confidence added. The network can be used to calculate confidence and pseudo labels. The network is only used to generate pseudo labels and does not participate in the training process of the entire three-dimensional semantic segmentation network. Among them, the confidence C is defined according to the entropy E, and the entropy E is calculated by the probability p. If the output of the three-dimensional semantic segmentation network is more concentrated in a certain category, the smaller the entropy E, and the confidence C is closer to 1. The calculation formula for confidence is: n represents the number of categories for semantic segmentation.

具体地，将步骤S4中得到的相机图像特征图和激光点云数据特征图输入监督模块，采用有监督模式计算损失函数的具体过程为：Specifically, the camera image feature map and the laser point cloud data feature map obtained in step S4 are input into the supervision module, and the specific process of calculating the loss function in the supervised mode is as follows:

S6、根据步骤S5中计算的损失函数，计算三维语义分割网络的相机模块、激光模块、融合模块以及监督模块的梯度，并采用梯度下降法更新三维语义分割网络的参数权重，得到训练好的三维语义分割网络。S6. According to the loss function calculated in step S5, the gradients of the camera module, laser module, fusion module and supervision module of the three-dimensional semantic segmentation network are calculated, and the parameter weights of the three-dimensional semantic segmentation network are updated by the gradient descent method to obtain a trained three-dimensional semantic segmentation network.

本实施例中，利用计算的损失函数，通过计算相机模块、激光模块、融合模块以及监督模块的梯度进行逐层反向传播，更新三维语义分割网络的参数权重，最终使三维语义分割网络收敛，得到训练好的三维语义分割网络。In this embodiment, the calculated loss function is used to perform back propagation layer by layer by calculating the gradients of the camera module, laser module, fusion module and supervision module, and the parameter weights of the three-dimensional semantic segmentation network are updated, so that the three-dimensional semantic segmentation network is finally converged to obtain a trained three-dimensional semantic segmentation network.

本实施例中，利用本发明所提出的三维语义分割方法与纯激光雷达分割方法进行实验对比，发现本发明提出的三维语义分割方法比纯激光雷达分割方法的性能提升了2.6％，同时图像的引入在小物体分割上更加具有优势。具体分割结果如表1所示：In this embodiment, the three-dimensional semantic segmentation method proposed in the present invention is compared with the pure laser radar segmentation method. It is found that the performance of the three-dimensional semantic segmentation method proposed in the present invention is improved by 2.6% compared with the pure laser radar segmentation method. At the same time, the introduction of images has more advantages in the segmentation of small objects. The specific segmentation results are shown in Table 1:

表1在SemanticKITTI数据集上的结果Table 1 Results on the SemanticKITTI dataset

^*由我们实现，前视激光雷达数据来源于此，⁺其他结果来自于基准^*，加粗为最高结果，下划线为第二高结果 ^* Implemented by us, forward-looking lidar data comes from here, ⁺ other results come from benchmarks ^* , bold is the highest result, underline is the second highest result

如图5所示，图5为三维语义分割网络分割结果示意图；图5为利用本发明所提出的一种基于相机和激光融合的三维语义分割方法对相机图像和激光点云数据进行分割，得到相机图像和激光点云数据的三维语义分割结果。从图5中可以看出，因为道路左侧的树木阴影，地面变得很难分辨出来，但是采用本发明所提出的三维语义分割方法，仍可以精确的识别道路边界，同时远处的骑自行车的人，虽然反射点云稀疏，但仍能被三维语义分割网络准确预测出来。As shown in FIG5 , FIG5 is a schematic diagram of the segmentation result of the 3D semantic segmentation network; FIG5 is a 3D semantic segmentation result of the camera image and the laser point cloud data obtained by segmenting the camera image and the laser point cloud data using the 3D semantic segmentation method based on camera and laser fusion proposed by the present invention. As can be seen from FIG5 , due to the shadow of the trees on the left side of the road, the ground becomes difficult to distinguish, but the 3D semantic segmentation method proposed by the present invention can still accurately identify the road boundary, and at the same time, although the reflection point cloud of the cyclist in the distance is sparse, it can still be accurately predicted by the 3D semantic segmentation network.

本发明中应用了具体实施例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The present invention uses specific embodiments to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea. At the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as a limitation on the present invention.

本领域的普通技术人员将会意识到，这里所述的实施例是为了帮助读者理解本发明的原理，应被理解为本发明的保护范围并不局限于这样的特别陈述和实施例。本领域的普通技术人员可以根据本发明公开的这些技术启示做出各种不脱离本发明实质的其它各种具体变形和组合，这些变形和组合仍然在本发明的保护范围内。Those skilled in the art will appreciate that the embodiments described herein are intended to help readers understand the principles of the present invention, and should be understood that the protection scope of the present invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific variations and combinations that do not deviate from the essence of the present invention based on the technical revelations disclosed by the present invention, and these variations and combinations are still within the protection scope of the present invention.

Claims

1. A three-dimensional semantic segmentation method based on camera and laser fusion, characterized in that it includes the following steps:

S1, input the camera image into the camera module of the 3D semantic segmentation network to extract image features and obtain a camera image feature map of the original size;

S2, inputting the laser point cloud data into the laser module of the 3D semantic segmentation network to extract the features of the laser point cloud data, and obtaining a laser point cloud data feature map of the original size;

S3, inputting the camera image feature map in step S1 and the laser point cloud data feature map in step S2 into a fusion module of a three-dimensional semantic segmentation network for feature fusion, to obtain fused camera image features and laser point cloud data features;

S4, inputting the fused image features and laser point cloud data features obtained in step S3 into the camera module and the laser module respectively to obtain a camera image feature map and a laser point cloud data feature map;

S5, inputting the camera image feature map and the laser point cloud data feature map obtained in step S4 into the supervision module of the 3D semantic segmentation network, and calculating the loss function in a self-supervised mode or a supervised mode;

S6. According to the loss function calculated in step S5, the gradients of the camera module, laser module, fusion module and supervision module of the 3D semantic segmentation network are calculated, and the parameter weights of the 3D semantic segmentation network are updated by using the gradient descent method to obtain a trained 3D semantic segmentation network.

S7, obtaining camera images and laser point cloud data, inputting the three-dimensional semantic segmentation network trained in step S6, and obtaining semantic segmentation results of the laser point cloud data and the camera images.

2. According to claim 1, a three-dimensional semantic segmentation method based on camera and laser fusion is characterized in that the camera module and the laser module are composed of an encoder and a decoder, wherein the size of the feature map in the encoder decreases layer by layer, the size of the feature map in the decoder increases layer by layer, and a jump connection structure is added between the encoder layer and the decoder layer with the same image size.

3. According to the three-dimensional semantic segmentation method based on camera and laser fusion according to claim 2, it is characterized in that step S1 specifically comprises:

S11, obtaining a camera image captured by a camera, inputting the camera image into an encoder, and using a convolutional neural network to extract local features of the camera image to obtain a camera image feature map;

S12, according to the camera image feature map obtained in step S11, using a pooling layer to reduce the size of the camera image feature map layer by layer;

S13. Input the reduced-size camera image feature map into the decoder, and use a convolutional neural network and a bilinear upsampling method to restore the size of the camera image feature map layer by layer to obtain the camera image feature map of the original size.

4. According to the method of claim 2, the three-dimensional semantic segmentation method based on camera and laser fusion is characterized in that step S2 specifically comprises:

S21, acquiring laser point cloud data collected by the laser radar, and projecting the laser point cloud data onto a camera plane to obtain two-dimensional laser point cloud data;

S22, inputting the two-dimensional laser point cloud data obtained in step S21 into an encoder, and using a convolutional neural network to extract local features of the two-dimensional laser point cloud data to obtain a laser point cloud data feature map;

S23, according to the laser point cloud data feature map obtained in step S22, using a pooling layer to reduce the size of the laser point cloud data feature map layer by layer;

S24. Input the reduced-size laser point cloud data feature map into the decoder, and use a convolutional neural network and a bilinear upsampling method to restore the size of the laser point cloud data feature map layer by layer to obtain the laser point cloud data feature map of the original size.

5. According to the three-dimensional semantic segmentation method based on camera and laser fusion according to claim 4, it is characterized in that the calculation formula for projecting the laser point cloud data on the camera plane is:

[ ^xi′ _, yi _′ ,zi _′ ] ^T ⁼ K×T _r ×[ _xi,yi _, ^zi _, 1] ^T

M _l [u _i ][v _i ]＝1

Among them, x _i ^′ , y _i ^′ , z _i ^′ respectively represent the position of the i-th laser point cloud data in the camera coordinate system, T represents transpose, K represents the intrinsic parameter of the camera, _Tr represents the transfer matrix from laser to camera, x _i , y _i , z _i represent the position of the i-th laser point cloud data on the x, y and z axes, _ui , _vi represent the index of the i-th laser point cloud in the vertical and horizontal directions of the camera plane, respectively, and M _l represents the lidar mask.

6. According to a three-dimensional semantic segmentation method based on camera and laser fusion according to claim 1, it is characterized in that the fusion module is composed of a splicing module, a convolutional layer, and a sliding window attention module, wherein the sliding window attention module is composed of a first sliding window attention layer and a second sliding window attention layer, the first sliding window attention layer is composed of a layer normalization module, a W-MSA module and a multi-layer perceptron module, and the second sliding window attention layer is composed of a layer normalization module, a SW-MSA module and a multi-layer perceptron module.

7. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 6, characterized in that step S3 specifically comprises:

S31, inputting the camera image feature map in step S1 and the laser point cloud data feature map in step S2 into a splicing module of a fusion module to obtain splicing features of the camera and the laser radar;

S32, inputting the splicing features of the camera and the laser radar obtained in step S31 into the convolution layer to obtain the fusion features of the camera and the laser radar;

S33, inputting the fusion features of the camera and the laser radar obtained in step S32 into the sliding window attention module to obtain the fusion attention features of the camera and the laser radar;

S34. Integrate the fused attention features of the camera and the laser radar obtained in step S33 and the fused features of the camera and the laser radar obtained in step S32 into the image feature map in step S1 and the laser point cloud data feature map in step S2 in proportion to obtain fused camera image features and laser point cloud data features.

8. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 7, characterized in that the calculation formula of the fused image features and laser point cloud data features in step S34 is:

C _fusion = C _origin + a ₁ × SelfAttension × FusionFeature

L _fusion = L _origin + a ₂ × SelfAttension × FusionFeature

Among them, C _fusion represents the fused camera image features, C _origin represents the camera image features of the original size, a ₁ and a ₂ represent the fusion scale factors respectively, SelfAttension represents the fusion attention features of the camera and lidar, FusionFeature represents the fusion features of the camera and lidar, L _fusion represents the fused laser point cloud data features, and L _origin represents the laser point cloud data features of the original size.

9. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 1 is characterized in that the camera image feature map and the laser point cloud data feature map obtained in step S4 are input into the supervision module, and the specific process of calculating the loss function in the self-supervision mode is as follows:

The supervision module generates pseudo labels by adding the confidence PIDNet network, while retaining high-confidence pixels and laser point cloud data. By setting the camera mask and lidar mask, the loss function of the self-supervised mode is obtained, namely:

L _{self-supervised} = L _foc1 + L _lov1 + L _fov2 + L _lov2 + L _kl

Among them, L _{self-supervised} represents the loss function of the self-supervised mode, L _kl represents the one-way KL divergence, L _foc1 , L _lov1 represent the focal loss and Lovalz loss between the prediction result of the camera branch and the pseudo label, respectively, L _fov2 , L _lov2 represent the focal loss and Lovalz loss between the prediction result of the lidar branch and the pseudo label, u, v represent the length and width of the prediction result feature map, respectively, C represents the confidence, focalloss(·) represents the focal loss function calculation formula, Pred _camera represents the prediction result of the camera branch, Pred _Lidar represents the prediction result of the lidar branch, label represents the pseudo label value, M _θ1 represents the camera mask, M _θ2 represents the lidar mask, and M _l represents the lidar mask.

10. The three-dimensional semantic segmentation method based on camera and laser fusion according to claim 1 is characterized in that the camera image feature map and the laser point cloud data feature map obtained in step S4 are input into the supervision module, and the specific process of using the supervised mode to calculate the loss function is:

The supervision module uses focus loss and Lovalz loss to adjust the parameter weights to obtain the loss function of the supervised mode, namely:

L _supervised =L _foc1 +L _lov1 +L _fov2 +L _lov2

Among them, L _supervised represents the loss function of the supervised mode, L _foc1 and L _lov1 represent the focus loss and Lovalz loss between the prediction results of the camera branch and the true value label, respectively, and L _fov2 and L _lov2 represent the focus loss and Lovalz loss between the prediction results of the lidar branch and the true value label, respectively.