CN112598675A

CN112598675A - Indoor scene semantic segmentation method based on improved full convolution neural network

Info

Publication number: CN112598675A
Application number: CN202011559942.5A
Authority: CN
Inventors: 周武杰; 岳雨纯; 雷景生; 强芳芳; 周扬; 邱薇薇; 何成; 王海江; 马骁; 郭翔
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-02
Anticipated expiration: 2040-12-25
Also published as: CN112598675B

Abstract

The invention discloses an indoor scene semantic segmentation method based on an improved full convolutional neural network. First, a convolutional neural network is constructed, and its hidden layer includes 5 neural network blocks, 5 feature re-extraction convolutional layer blocks, 5 block attention convolutional blocks, 12 fusion layers, and 4 upsampling layers; using the original The indoor scene image is input into the convolutional neural network for training, and the corresponding semantic segmentation prediction map is obtained; then the set of the semantic segmentation prediction map corresponding to the original indoor scene image is calculated and the corresponding real semantic segmentation image is processed into a single hot Encode the loss function value between sets of images to obtain the optimal weight vector and bias term of the convolutional neural network classification training model; input the indoor scene images to be semantically segmented into the trained convolutional neural network classification training In the model, the predicted semantic segmentation image is obtained. The advantages of the present invention are that the efficiency and accuracy of semantic segmentation of indoor scene images are improved.

Description

Indoor scene semantic segmentation method based on improved fully convolutional neural network

技术领域technical field

本发明涉及一种深度学习的语义分割方法，尤其涉及一种基于改进全卷积神经网络的室内场景语义分割方法。The invention relates to a deep learning semantic segmentation method, in particular to an indoor scene semantic segmentation method based on an improved full convolutional neural network.

背景技术Background technique

图像语义分割是计算机视觉的最具挑战性的任务之一，在如自动驾驶、医疗图像分析、虚拟现实、人机交互等应用中起着关键作用。语义分割的核心目的是为一张图片里面的每个像素点给出类别标签，判断该像素点属于哪一类，由于语义分割的数据集一般涉及室内场景或室外场景，分割物体众多，因此在本质上是属于多分类问题。Image semantic segmentation is one of the most challenging tasks in computer vision and plays a key role in applications such as autonomous driving, medical image analysis, virtual reality, and human-computer interaction. The core purpose of semantic segmentation is to give a category label to each pixel in a picture, and to determine which category the pixel belongs to. Since the dataset of semantic segmentation generally involves indoor scenes or outdoor scenes, there are many segmentation objects, so in the It is essentially a multi-classification problem.

图像语义分割从监督学习的角度可以分为全监督，半监督和无监督三种类型，但从可操作性和理论应用等方面来看，目前主流模型多采用全监督类型，少数采用半监督类型，因为更容易实现同时模型也更易训练。From the perspective of supervised learning, image semantic segmentation can be divided into three types: fully supervised, semi-supervised and unsupervised. However, from the perspective of operability and theoretical application, the current mainstream models mostly use fully supervised types, and a few use semi-supervised types. , because it is easier to implement and the model is easier to train.

在模型应用方面，基于全卷积神经网络的出现和发展，应用全卷积神经网络在图像语义分割任务中已经实现了优越的性能和分割效果，但仍存在很多不足和缺陷，比如参数量大、存在大量冗余信息、特征表达提取不充分等。因此，基于全卷积神经网络的图像语义分割模型还有很大的提升空间，针对图像本身的特质、模型的结构、生物上人眼视觉系统的运作原理等来提出和训练更优越的模型已成为图像语义分割领域当下及未来的发展目标。In terms of model application, based on the emergence and development of fully convolutional neural networks, the application of fully convolutional neural networks has achieved superior performance and segmentation effects in image semantic segmentation tasks, but there are still many shortcomings and defects, such as a large number of parameters. , there is a lot of redundant information, and the feature expression extraction is insufficient. Therefore, there is still a lot of room for improvement in the image semantic segmentation model based on the fully convolutional neural network. According to the characteristics of the image itself, the structure of the model, and the operation principle of the biological human visual system, it has been proposed and trained. It has become the current and future development goal in the field of image semantic segmentation.

发明内容SUMMARY OF THE INVENTION

为了解决背景技术中的问题，本发明提供了一种基于改进全卷积神经网络的室内场景语义分割方法，其从语义特征表达，人眼视觉系统运作原理等方面获取模型设计方向和思路，改进了传统的全卷积神经网络，有效提高了图像分割性能。In order to solve the problems in the background art, the present invention provides an indoor scene semantic segmentation method based on an improved fully convolutional neural network, which obtains the model design direction and ideas from the expression of semantic features, the operation principle of the human visual system, etc., and improves Compared with the traditional fully convolutional neural network, the image segmentation performance is effectively improved.

本发明采用的技术方案包括以下步骤：The technical scheme adopted in the present invention comprises the following steps:

步骤1：选取Q对原始室内场景图像及对应的真实语义分割图像，将所有原始室内场景图像及其对应的真实语义分割图像构成训练集；每对原始室内场景图像包括原始室内场景彩色图像和原始室内场景深度图像，采用独热编码技术(one-hot)将训练集中的真实语义分割图像处理成41幅独热编码图像；Step 1: Select Q pairs of original indoor scene images and corresponding real semantic segmentation images, and form a training set of all original indoor scene images and their corresponding real semantic segmentation images; each pair of original indoor scene images includes the original indoor scene color image and the original Depth images of indoor scenes, using one-hot encoding technology (one-hot) to process the real semantic segmentation images in the training set into 41 one-hot encoded images;

步骤2：构建卷积神经网络分类训练模型：卷积神经网络分类训练模型包括输入层、隐层和输出层；输入层包括彩色图输入层和深度图输入层；隐层包括彩色图像处理模块和深度图像处理模块；彩色图像处理模块和深度图像处理模块的结构对称，均包括五个神经网络块、五个特征再提取卷积块和十个融合层；隐层还包括五个分块注意力卷积块、四个上采样层和两个融合层；Step 2: Build a convolutional neural network classification training model: the convolutional neural network classification training model includes an input layer, a hidden layer and an output layer; the input layer includes a color map input layer and a depth map input layer; the hidden layer includes a color image processing module and an output layer. The depth image processing module; the color image processing module and the depth image processing module are symmetrical in structure, including five neural network blocks, five feature re-extraction convolution blocks and ten fusion layers; the hidden layer also includes five block attention Convolutional blocks, four upsampling layers and two fusion layers;

步骤3：将训练集输入到步骤2的卷积神经网络分类训练模型中进行训练，训练过程中，每次迭代训练处理得到每对原始室内场景图像对应的41幅语义分割预测图像，计算41幅语义分割预测图像构成的集合与真实语义分割图像对应的41幅独热编码图像构成的集合之间的损失函数值；Step 3: Input the training set into the convolutional neural network classification training model of Step 2 for training. During the training process, 41 semantic segmentation prediction images corresponding to each pair of original indoor scene images are obtained by each iteration of training processing, and 41 images are calculated. The loss function value between the set of semantic segmentation prediction images and the set of 41 one-hot encoded images corresponding to the real semantic segmentation images;

损失函数值采用分类交叉熵获得；The loss function value is obtained by categorical cross entropy;

步骤4：重复执行步骤3一共V次，并共得到Q×V个损失函数值；然后从Q×V个损失函数值中找出最小的损失函数值，将最小的损失函数值对应的权值矢量和偏置项作为卷积神经网络分类训练模型的最优权值矢量和最优偏置项，从而完成卷积神经网络分类训练模型的训练；Step 4: Repeat step 3 for a total of V times, and obtain a total of Q×V loss function values; then find the smallest loss function value from the Q×V loss function values, and assign the weight corresponding to the smallest loss function value The vector and the bias term are used as the optimal weight vector and the optimal bias term of the convolutional neural network classification training model, so as to complete the training of the convolutional neural network classification training model;

步骤5：利用训练后获得的卷积神经网络分类训练模型对待预测的室内场景图像进行预测处理，输出获得对应的预测语义分割图像，实现室内场景图像语义分割。Step 5: Use the convolutional neural network classification training model obtained after training to perform prediction processing on the indoor scene image to be predicted, and output the corresponding predicted semantic segmentation image to realize the semantic segmentation of the indoor scene image.

所述步骤2)具体为：Described step 2) is specifically:

彩色图输入层和深度图输入层分别输入彩色图像处理模块和深度图像处理模块中的第一个神经网络块；The color map input layer and the depth map input layer respectively input the first neural network block in the color image processing module and the depth image processing module;

彩色图像处理模块和深度图像处理模块的结构相同，具体为：The structure of the color image processing module and the depth image processing module is the same, specifically:

第一个神经网络块其中一个输出经第一个特征再提取卷积块输入第一个融合层，第一个神经网络块另一个输出输入第一个融合层；第二个神经网络块其中一个输出经第二个特征再提取卷积块输入第三个融合层，第二个神经网络块另一个输出输入第三个融合层；第三个神经网络块其中一个输出经第三个特征再提取卷积块输入第五个融合层，第三个神经网络块另一个输出输入第五个融合层；第四个神经网络块其中一个输出经第四个特征再提取卷积块输入第七个融合层，第四个神经网络块另一个输出输入第七个融合层；第五个神经网络块其中一个输出经第五个特征再提取卷积块输入第九个融合层，第五个神经网络块另一个输出输入第九个融合层；每个融合层的两个输入均通过逐元素相加的方式相融合；One of the outputs of the first neural network block is extracted by the first feature and then the convolution block is input to the first fusion layer, the other output of the first neural network block is input to the first fusion layer; one of the second neural network block The output is extracted by the second feature and the convolution block is input to the third fusion layer, and the other output of the second neural network block is input to the third fusion layer; one of the outputs of the third neural network block is extracted by the third feature. The convolution block is input to the fifth fusion layer, and the other output of the third neural network block is input to the fifth fusion layer; one of the outputs of the fourth neural network block is extracted by the fourth feature, and the convolution block is input to the seventh fusion. layer, the other output of the fourth neural network block is input to the seventh fusion layer; one of the outputs of the fifth neural network block is extracted by the fifth feature and then the convolution block is input to the ninth fusion layer, the fifth neural network block The other output is input to the ninth fusion layer; the two inputs of each fusion layer are fused by element-wise addition;

第一个融合层的输出分别输入第一个分块注意力卷积块和对应的第二融合层，第三个融合层的输出分别输入第二个分块注意力卷积块和对应的第四融合层，第五个融合层的输出分别输入第三个分块注意力卷积块和对应的第六融合层，第七个融合层的输出分别输入第四个分块注意力卷积块和对应的第八融合层，第九个融合层的输出分别输入第五个分块注意力卷积块和对应的第十融合层；The output of the first fusion layer is input to the first block attention convolution block and the corresponding second fusion layer, respectively, and the output of the third fusion layer is input to the second block attention convolution block and the corresponding Four fusion layers, the output of the fifth fusion layer is respectively input to the third block attention convolution block and the corresponding sixth fusion layer, and the output of the seventh fusion layer is input to the fourth block attention convolution block respectively and the corresponding eighth fusion layer, the output of the ninth fusion layer is respectively input into the fifth block attention convolution block and the corresponding tenth fusion layer;

第一个分块注意力卷积块的两个输出分别输入彩色图像处理模块和深度图像处理模块的第二个融合层，第二个分块注意力卷积块的两个输出分别输入彩色图像处理模块和深度图像处理模块的第四个融合层，第三个分块注意力卷积块的两个输出分别输入彩色图像处理模块和深度图像处理模块的第六个融合层，第四个分块注意力卷积块的两个输出分别输入彩色图像处理模块和深度图像处理模块的第八个融合层，第五个分块注意力卷积块的两个输出分别输入彩色图像处理模块和深度图像处理模块的第十个融合层；The two outputs of the first block attention convolution block are input to the second fusion layer of the color image processing module and the depth image processing block, respectively, and the two outputs of the second block attention convolution block are input to the color image, respectively The fourth fusion layer of the processing module and the depth image processing module, the two outputs of the third block attention convolution block are respectively input to the sixth fusion layer of the color image processing module and the depth image processing module, and the fourth The two outputs of the block attention convolution block are input to the eighth fusion layer of the color image processing module and the depth image processing module, respectively, and the two outputs of the fifth block attention convolution block are input to the color image processing module and the depth image processing module, respectively. The tenth fusion layer of the image processing module;

第二个融合层的两个输入通过逐元素相加的方式相融合后分别输入第十一个融合层和对应的第二个神经网络块，第四个融合层的两个输入通过逐元素相加的方式相融合后分别输入第一个上采样层和对应的第三个神经网络块，第六个融合层的两个输入通过逐元素相加的方式相融合后分别输入第二个上采样层和对应的第四个神经网络块，第八个融合层的两个输入通过逐元素相加的方式相融合后分别输入第三个上采样层和对应的第五个神经网络块；第十个融合层的输出输入第四个上采样层；The two inputs of the second fusion layer are fused by element-by-element addition and then input to the eleventh fusion layer and the corresponding second neural network block respectively, and the two inputs of the fourth fusion layer are added element-by-element. After fusion, the first upsampling layer and the corresponding third neural network block are input respectively, and the two inputs of the sixth fusion layer are fused by element-by-element addition and then input to the second upsampling respectively. layer and the corresponding fourth neural network block, the two inputs of the eighth fusion layer are fused by element-by-element addition and then input to the third upsampling layer and the corresponding fifth neural network block; the tenth The output of each fusion layer is input to the fourth upsampling layer;

第十一个融合层、第一个上采样层、第二个上采样层、第三个上采样层和第四个上采样层的两个输入通过逐元素相加的方式融合后均输入第十二个融合层；The two inputs of the eleventh fusion layer, the first upsampling layer, the second upsampling layer, the third upsampling layer and the fourth upsampling layer are fused by element-by-element addition and input into the first Twelve fusion layers;

第十二个融合层将所有输入采用concatenate方式连接后经输出层输出，输出层主要由依次连接的卷积层和第五个上采样层组成。The twelfth fusion layer concatenates all the inputs and outputs them through the output layer. The output layer is mainly composed of a convolutional layer and a fifth upsampling layer that are connected in sequence.

所述对应的表示前一个输入和后一个输出位于同一个彩色图像处理模块或同一个深度图像处理模块。The corresponding representation indicates that the former input and the latter output are located in the same color image processing module or the same depth image processing module.

所述的五个神经网络块采用MobileNet V2网络结构，第一个神经网络块采用MobileNetV2中的1～4层(重复次数n为1、1、2共四层)，第二个神经网络块采用MobileNetV2中的5～7层(重复次数n为3共三层)，第三个神经网络块采用MobileNetV2中的8～11层(重复次数n为4共四层)，第四个神经网络块采用MobileNetV2中的12～14层(重复次数n为3共三层)，第五个神经网络块采用MobileNetV2中的15～17层(重复次数n为3共三层)。The five neural network blocks use the MobileNet V2 network structure. The first neural network block uses layers 1 to 4 in MobileNetV2 (the number of repetitions n is 1, 1, and 2, a total of four layers), and the second neural network block uses Layers 5 to 7 in MobileNetV2 (the number of repetitions n is 3 and a total of three layers), the third neural network block uses layers 8 to 11 in MobileNetV2 (the number of repetitions n is 4 and a total of four layers), and the fourth neural network block uses The 12-14 layers in MobileNetV2 (the number of repetitions n is 3, a total of three layers), and the fifth neural network block uses the 15-17 layers in MobileNetV2 (the number of repetitions n is 3, a total of three layers).

每个所述的特征再提取卷积块均由依次连接的四个再提取模块组成，每个再提取模块包括依次连接的卷积层、标准化层和激活层；所有激活层的激活方式均采用ReLU6；所有卷积层的步长均为1；每个再提取模块中卷积层的卷积核个数与标准化层的标准化参数相同；第一个再提取模块、第三个再提取模块和第四个再提取模块中卷积层的卷积核大小相同，第二个再提取模块中卷积层的卷积核大小为第一个再提取模块中卷积层的卷积核大小的一半；第一个再提取模块和第三个再提取模块中卷积层的卷积核大小均为3*3，第二个再提取模块和第四个再提取模块中卷积层的卷积核大小均为1*1；第一个再提取模块中卷积层的膨胀因子为1，第三个再提取模块中卷积层的膨胀因子为2；第一个再提取模块中卷积层的补零参数为1，第三个再提取模块中卷积层的补零参数为2，第二个再提取模块和第四个再提取模块中卷积层的补零参数均为0。Each of the feature re-extraction convolution blocks is composed of four re-extraction modules connected in sequence, and each re-extraction module includes a convolution layer, a normalization layer and an activation layer connected in sequence; the activation methods of all the activation layers are ReLU6; the stride of all convolutional layers is 1; the number of convolution kernels of the convolutional layer in each re-extraction module is the same as the normalization parameters of the normalization layer; the first re-extraction module, the third re-extraction module and the The convolution kernel size of the convolution layer in the fourth re-extraction module is the same, and the convolution kernel size of the convolution layer in the second re-extraction module is half of the convolution kernel size of the convolution layer in the first re-extraction module. ; The convolution kernel size of the convolution layer in the first re-extraction module and the third re-extraction module is 3*3, and the convolution kernel of the convolution layer in the second re-extraction module and the fourth re-extraction module. The size is 1*1; the expansion factor of the convolutional layer in the first re-extraction module is 1, and the expansion factor of the convolutional layer in the third re-extraction module is 2; the expansion factor of the convolutional layer in the first re-extraction module The zero-padding parameter is 1, the zero-padding parameter of the convolutional layer in the third re-extraction module is 2, and the zero-padding parameter of the convolutional layer in the second and fourth re-extraction modules is both 0.

第一个特征再提取卷积块中第一个再提取模块的卷积层的卷积核大小为24，第二个特征再提取卷积块中第一个再提取模块的卷积层的卷积核大小为32，第三个特征再提取卷积块中第一个再提取模块的卷积层的卷积核大小为64，第四个特征再提取卷积块中第一个再提取模块的卷积层的卷积核大小为96，第五个特征再提取卷积块中第一个再提取模块的卷积层的卷积核大小为160；The convolution kernel size of the convolution layer of the first re-extraction module in the first feature re-extraction convolution block is 24, and the volume of the convolution layer of the first re-extraction module in the second feature re-extraction convolution block The kernel size is 32, the convolution kernel size of the convolution layer of the first re-extraction module in the third feature re-extraction convolution block is 64, and the fourth feature re-extraction convolution block The first re-extraction module in the convolution block The convolution kernel size of the convolutional layer is 96, and the convolution kernel size of the convolutional layer of the first re-extraction module in the fifth feature re-extraction convolution block is 160;

第一个特征再提取卷积块中四个依次设置的卷积层的分组参数分别为24、12、24、24，第二个特征再提取卷积块中四个依次设置的卷积层的分组参数分别为32、16、16、32，第三个特征再提取卷积块中四个依次设置的卷积层的分组参数分别为64、32、32、64，第四个特征再提取卷积块中四个依次设置的卷积层的分组参数分别为64、48、48、99，第五个特征再提取卷积块中四个依次设置的卷积层的分组参数分别为96、80、80、160。The grouping parameters of the four sequentially set convolutional layers in the first feature re-extraction convolution block are 24, 12, 24, and 24, respectively, and the second feature re-extracts the four sequentially set convolutional layers in the convolutional block. The grouping parameters are 32, 16, 16, and 32, respectively. The grouping parameters of the four successively set convolutional layers in the third feature re-extraction convolution block are 64, 32, 32, and 64, respectively. The fourth feature re-extracts the volume. The grouping parameters of the four sequentially set convolutional layers in the accumulation block are 64, 48, 48, and 99, respectively, and the fifth feature extraction convolutional block The grouping parameters of the four sequentially set convolutional layers are 96, 80 , 80, 160.

每个所述的分块注意力卷积块包括分块层、通道注意力层和空间尺寸注意力层，分块注意力卷积块的输入经分块层后分别输入通道注意力层和空间尺寸注意力层，分块注意力卷积块的输入分别与通道注意力层的输出和空间尺寸注意力层的输出相乘后，再将相乘的结果相加作为分块注意力卷积块的输出。Each of the block attention convolution blocks includes a block layer, a channel attention layer, and a spatial size attention layer. For the size attention layer, the input of the block attention convolution block is multiplied by the output of the channel attention layer and the output of the spatial size attention layer, and then the multiplied results are added together as the block attention convolution block. Output.

所述分块层采用pytorch自带的split函数，参数为分块注意力卷积块的输入特征图对应通道数的一半；The block layer adopts the split function that comes with pytorch, and the parameter is half of the number of channels corresponding to the input feature map of the block attention convolution block;

所述通道注意力层包括依次连接的适应性最大池化层、通道注意力第一卷积层、通道注意力第二卷积层和通道注意力激活层；适应性最大池化层的最大池化参数为1；通道注意力第一卷积层和通道注意力第二卷积层的卷积核大小均为1x1、步长均为1、偏置项均为False、分组参数相同，通道注意力第一卷积层中卷积核个数为通道注意力第二卷积层卷积核个数的一半；The channel attention layer includes an adaptive max pooling layer, a first convolutional layer of channel attention, a second convolutional layer of channel attention, and a channel attention activation layer connected in sequence; the max pooling layer of the adaptive max pooling layer The parameterization parameter is 1; the convolution kernel size of the first convolutional layer of channel attention and the second convolutional layer of channel attention are both 1x1, the stride is 1, the bias term is False, the grouping parameters are the same, and the channel attention is The number of convolution kernels in the first convolution layer is half of the number of convolution kernels in the second convolution layer of channel attention;

第一个分块注意力卷积块中通道注意力层的通道注意力第一卷积层的卷积核个数为12、分组参数为12；第二个分块注意力卷积块中通道注意力层的通道注意力第一卷积层的卷积核个数为16、分组参数为16；第三个分块注意力卷积块中通道注意力层的通道注意力第一卷积层的卷积核个数为32、分组参数为32；第四个分块注意力卷积块中通道注意力层的通道注意力第一卷积层的卷积核个数为48、分组参数为48；第五个分块注意力卷积块中通道注意力层的通道注意力第一卷积层的卷积核个数为80、分组参数为80；The number of convolution kernels in the first convolutional layer is 12 and the grouping parameter is 12; the channel attention in the second block attention convolution block The channel attention of the attention layer The number of convolution kernels of the first convolution layer is 16, and the grouping parameter is 16; the channel attention of the channel attention layer in the third block attention convolution block is the first convolution layer. The number of convolution kernels is 32, and the grouping parameter is 32; the channel attention of the channel attention layer in the fourth block attention convolution block The number of convolution kernels in the first convolutional layer is 48, and the grouping parameter is 48; The channel attention of the channel attention layer in the fifth block attention convolution block The number of convolution kernels in the first convolution layer is 80, and the grouping parameter is 80;

空间尺寸注意力层包含依次连接的按通道最大化层、空间尺寸注意力卷积层和空间尺寸注意力激活层；按通道最大化层采用pytorch自带的max函数；空间尺寸注意力卷积层的卷积核大小为3×3，卷积核个数为1，膨胀因子为2、补零参数为2，分组参数为1，偏置项为False；The spatial size attention layer consists of a channel-wise maximization layer, a spatial-scale attention convolutional layer, and a spatial-scale attention activation layer; the channel-wise maximization layer uses the max function that comes with pytorch; the spatial size attention convolutional layer The size of the convolution kernel is 3 × 3, the number of convolution kernels is 1, the expansion factor is 2, the zero-padding parameter is 2, the grouping parameter is 1, and the bias term is False;

通道注意力激活层和空间尺寸注意力激活层采用的激活方式为Sigmoid函数。The activation method used by the channel attention activation layer and the spatial dimension attention activation layer is the sigmoid function.

所述上采样层采用pytorch中自带的UpsamlingBilinear2d函数，第一个上采样层的函数参数为2，第二个上采样层、第三个上采样层、第四个上采样层和第五个上采样层的函数参数为4；所述输出层中卷积层的卷积核大小为1×1、卷积核个数为41、步长为1、偏置项为False。The upsampling layer adopts the UpsamlingBilinear2d function that comes with pytorch. The function parameter of the first upsampling layer is 2, the second upsampling layer, the third upsampling layer, the fourth upsampling layer and the fifth upsampling layer. The function parameter of the upsampling layer is 4; the convolution kernel size of the convolution layer in the output layer is 1×1, the number of convolution kernels is 41, the step size is 1, and the bias term is False.

所述彩色图输入层的输入端接收室内场景彩色图像，深度图输入层的输入端接收室内场景深度图像，输出层的输出为41幅与输入层输入的室内场景图像对应的语义分割预测图像。The input end of the color map input layer receives the color image of the indoor scene, the input end of the depth map input layer receives the depth image of the indoor scene, and the output of the output layer is 41 semantic segmentation prediction images corresponding to the indoor scene images input by the input layer.

本发明的有益效果：Beneficial effects of the present invention:

1)本发明方法从有效提取图像通道和空间语义信息出发，并尽可能使梯度传播过程中减少信息损失，同时不显著增加模型参数数量的条件下，设计了一个叫做特征再提取模块的模块。该模块结构大致呈均匀的柱状，其中包含了一个1*1卷积和两个3*3卷积神经网络，一个3*3卷积神经网络采用了膨胀卷积的设置。为促进梯度的有效传播，两个3*3卷积神经网络位于模块两端，1*1卷积位于模块内部。1) The method of the present invention starts from the effective extraction of image channel and spatial semantic information, and reduces the information loss during the gradient propagation process as much as possible, and at the same time does not significantly increase the number of model parameters, a module called feature re-extraction module is designed. The module structure is roughly uniform columnar, which contains a 1*1 convolution and two 3*3 convolutional neural networks, and a 3*3 convolutional neural network adopts the setting of dilated convolution. To facilitate the effective propagation of gradients, two 3*3 convolutional neural networks are located at both ends of the module, and 1*1 convolution is located inside the module.

2)本发明方法从人眼视觉系统运作原理出发，结合人眼视觉的注意力机制和分块卷积，设计了一个叫做分块注意力卷积块的模块。该模块包括了基于通道和空间的注意力机制模块，使用分块卷积的原理先将输入卷积特征图按通道分为两部分，一部分用来进行基于通道注意力机制的特征筛选，另一部分用来进行基于空间注意力机制的特征筛选，最后将两部分筛选后的特征与输入卷积特征图相乘，有效减少了冗余特征。2) The method of the present invention starts from the operation principle of the human visual system, combines the attention mechanism of human vision and block convolution, and designs a module called block attention convolution block. This module includes a channel-based and space-based attention mechanism module. Using the principle of block convolution, the input convolution feature map is first divided into two parts by channel, one part is used for feature screening based on the channel attention mechanism, and the other part is used for feature screening based on the channel attention mechanism. It is used for feature screening based on the spatial attention mechanism, and finally the two filtered features are multiplied with the input convolution feature map, which effectively reduces redundant features.

3)本发明方法结合以上两个模块的设计，以MobileNetV2作为主框架，设计了本发明的网络模型MSCNet。实验表明，本模型参数较少，速度较快，每个训练轮次时间只需43秒左右的时间，同时保持了较高精度，是一种适用于移动端的轻量级网络。3) The method of the present invention combines the designs of the above two modules, and uses MobileNetV2 as the main frame to design the network model MSCNet of the present invention. Experiments show that this model has fewer parameters and faster speed. Each training round only takes about 43 seconds, while maintaining high accuracy. It is a lightweight network suitable for mobile terminals.

附图说明Description of drawings

图1为本发明方法的总体实现框图；Fig. 1 is the overall realization block diagram of the method of the present invention;

图2a为同一场景的第1幅原始的室内场景彩色图像；Figure 2a is the first original indoor scene color image of the same scene;

图2b为同一场景的第1幅原始的室内场景深度图像；Figure 2b is the first original indoor scene depth image of the same scene;

图2c为利用本发明方法对图2a和2b所示的原始的室内场景图像进行预测，得到的预测语义分割图像；Figure 2c is a predicted semantic segmentation image obtained by using the method of the present invention to predict the original indoor scene image shown in Figures 2a and 2b;

图3a为同一场景的第2幅原始的室内场景彩色图像；Figure 3a is the second original indoor scene color image of the same scene;

图3b为同一场景的第2幅原始的室内场景深度图像；Figure 3b is the second original indoor scene depth image of the same scene;

图3c为利用本发明方法对图3a和3b所示的原始的室内场景图像进行预测，得到的预测语义分割图像；Figure 3c is a predicted semantic segmentation image obtained by using the method of the present invention to predict the original indoor scene images shown in Figures 3a and 3b;

图4a为同一场景的第3幅原始的室内场景彩色图像；Fig. 4a is the 3rd original indoor scene color image of the same scene;

图4b为同一场景的第3幅原始的室内场景深度图像；Figure 4b is the third original indoor scene depth image of the same scene;

图4c为利用本发明方法对图4a和4b所示的原始的室内场景图像进行预测，得到的预测语义分割图像；Figure 4c is a predicted semantic segmentation image obtained by using the method of the present invention to predict the original indoor scene images shown in Figures 4a and 4b;

图5a为同一场景的第4幅原始的室内场景彩色图像；Fig. 5a is the 4th original indoor scene color image of the same scene;

图5b为同一场景的第4幅原始的室内场景深度图像；Figure 5b is the fourth original indoor scene depth image of the same scene;

图5c为利用本发明方法对图5a和5b所示的原始的室内场景图像进行预测，得到的预测语义分割图像。Fig. 5c is a predicted semantic segmentation image obtained by using the method of the present invention to predict the original indoor scene images shown in Figs. 5a and 5b.

图6为本发明的分块注意力卷积块的结构框图。FIG. 6 is a structural block diagram of the block attention convolution block of the present invention.

具体实施方式Detailed ways

下面结合附图及具体实施例对本发明作进一步详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

本发明提出的一种基于改进全卷积神经网络的室内场景语义分割方法，其总体实现框图如图1所示，其包括训练阶段和测试阶段两个过程；An indoor scene semantic segmentation method based on an improved fully convolutional neural network proposed by the present invention, the overall implementation block diagram of which is shown in Figure 1, which includes two processes: a training phase and a testing phase;

所述的训练阶段过程的具体步骤为：The specific steps of the training phase process are:

步骤1_1：分别选取Q对原始的室内场景RGB彩色图像和Depth深度图图像及每对原始室内场景图像对应的真实语义分割图像，并构成训练集，将训练集中的第q对原始的室内场景图像记为{RGB^q(i,j),Depth^q(i,j)}，将训练集中{RGB^q(i,j),Depth^q(i,j)}与对应的真实语义分割图像记为

然后采用现有的独热编码技术(one-hot)将训练集中的每对原始的室内场景图像对应的真实语义分割图像处理成41幅独热编码图像，将

处理成的41幅独热编码图像构成的集合记为

其中，室内场景图像包括RGB彩色图像和Depth深度图，Q为正整数，Q≥200，如取Q＝794，q为正整数，1≤q≤Q，1≤i≤W，1≤j≤H，W表示{RGB^q(i,j)，Depth^q(i,j)}中彩色图RGB和深度图Dept的宽度，H表示{RGB^q(i,j),Depth^q(i j,)}中彩色图RGB和深度图Depth的高度，如取W＝480、H＝640，RGB^q(i,j)，Depth^q(i,j)分别表示{RGB^q(i,j),Depth^q(i,j)}中坐标位置为(i,j)的像素点的像素值，

表示

中坐标位置为(i,j)的像素点的像素值；在此，原始的室内场景图像直接选用室内场景图像数据库NYUv2训练集中的1448幅图像，为了更方便训练，将每幅图像尺寸都缩减到宽度480，高度640。此外，为了有效缓解模型过拟合问题，采用了随机裁剪、随机水平翻转、随机缩放三种数据增强方法来扩充训练集中的数据。Step 1_1: Select Q pairs of original indoor scene RGB color images and Depth map images and the real semantic segmentation images corresponding to each pair of original indoor scene images respectively, and form a training set, and combine the qth pair of original indoor scene images in the training set. Denote it as {RGB ^q (i,j),Depth ^q (i,j)}, and denote the training set {RGB ^q (i,j),Depth ^q (i,j)} and the corresponding real semantic segmentation image as

Then, the existing one-hot encoding technology (one-hot) is used to process the real semantic segmentation images corresponding to each pair of original indoor scene images in the training set into 41 one-hot encoded images, and the

The set of processed 41 one-hot encoded images is denoted as

Among them, the indoor scene image includes RGB color image and Depth depth map, Q is a positive integer, Q≥200, if Q=794, q is a positive integer, 1≤q≤Q, 1≤i≤W, 1≤j≤ H, W represent the width of the color map RGB and depth map Dept in {RGB ^q (i, j), Depth ^q (i, j)}, H represents {RGB ^q (i, j), Depth ^q (ij,)} The height of the color map RGB and depth map Depth, such as W=480, H=640, RGB ^q (i, j), Depth ^q (i, j) represent {RGB ^q (i, j), Depth ^q ( The pixel value of the pixel whose coordinate position is (i,j) in i,j)},

express

The pixel value of the pixel whose median coordinate position is (i, j); here, the original indoor scene image is directly selected from the 1448 images in the indoor scene image database NYUv2 training set. In order to facilitate training, the size of each image is reduced to a width of 480 and a height of 640. In addition, in order to effectively alleviate the problem of model overfitting, three data augmentation methods, random cropping, random horizontal flipping, and random scaling, are used to expand the data in the training set.

步骤1_2：构建卷积神经网络分类训练模型：卷积神经网络包括输入层、隐层和输出层；隐层包括依次设置的第1个神经网络块、第2个神经网络块、第3个神经网络块、第4个神经网络块、第5个神经网络块、第1个特征再提取卷积块、第2个特征再提取卷积块、第3个特征再提取卷积块、第4个特征再提取卷积块、第5个特征再提取卷积块、第1个分块注意力卷积块、第2个分块注意力卷积块、第3个分块注意力卷积块、第4个分块注意力卷积块、第5个分块注意力卷积块、第1个融合层、第2个融合层、第3个融合层、第4个融合层、第5个融合层、第6个融合层、第7个融合层、第8个融合层、第9个融合层、第10个融合层、第11个融合层、第12个融合层、第1个上采样层，第2个上采样层、第3个上采样层、第4个上采样层。Step 1_2: Build a convolutional neural network classification training model: the convolutional neural network includes an input layer, a hidden layer and an output layer; the hidden layer includes the first neural network block, the second neural network block, and the third neural network set in sequence. Network block, fourth neural network block, fifth neural network block, first feature extraction convolution block, second feature extraction convolution block, third feature extraction convolution block, fourth feature extraction Feature re-extraction convolution block, 5th feature re-extraction convolution block, 1st block attention convolution block, 2nd block attention convolution block, 3rd block attention convolution block, 4th block attention convolution block, 5th block attention convolution block, 1st fusion layer, 2nd fusion layer, 3rd fusion layer, 4th fusion layer, 5th fusion layer, 6th fusion layer, 7th fusion layer, 8th fusion layer, 9th fusion layer, 10th fusion layer, 11th fusion layer, 12th fusion layer, 1st upsampling layer , the second upsampling layer, the third upsampling layer, and the fourth upsampling layer.

对于输入层，本发明有两个输入，分别是彩色图RGB输入层及深度图Depth输入层，输入层的彩色图RGB输入端接收一幅原始RGB输入图像的R通道分量、G通道分量、B通道分量，输入层的深度图Depth输入端接收一幅原始Depth输入图像的单通道分量，输入层的彩色图RGB输出端输出原始输入图像的R通道分量、G通道分量和B通道分量给隐层，输入层的深度图Depth输出端输出原始输入图像的单通道分量给隐层；其中，要求输入层的输入端接收的原始输入图像的宽度为480、高度为640；此外输入是彩色图RGB的卷积神经网络结构与输入是深度图Depth的卷积神经网络结构是对称的。For the input layer, the present invention has two inputs, namely the color map RGB input layer and the depth map Depth input layer. The color map RGB input end of the input layer receives the R channel components, G channel components, B channel components of an original RGB input image. Channel component, the depth map Depth input of the input layer receives a single-channel component of the original Depth input image, and the color map RGB output of the input layer outputs the R channel component, G channel component and B channel component of the original input image to the hidden layer. , the depth map Depth output terminal of the input layer outputs the single-channel component of the original input image to the hidden layer; wherein, the width of the original input image received by the input terminal of the input layer is required to be 480 and the height is 640; in addition, the input is a color map RGB The convolutional neural network structure is symmetric with the convolutional neural network structure whose input is the depth map Depth.

对于五个神经网络块，五个神经网络块采用MobileNet V2网络结构，第1个神经网络块采用MobileNetV2中的1～4层(重复次数n为1、1、2共四层)，第2个神经网络块采用MobileNetV2中的5～7层(重复次数n为3共三层)，第3个神经网络块采用MobileNetV2中的8～11层(重复次数n为4共四层)，第4个神经网络块采用MobileNetV2中的12～14层(重复次数n为3共三层)，第5个神经网络块采用MobileNetV2中的15～17层(重复次数n为3共三层)。For the five neural network blocks, the five neural network blocks use the MobileNet V2 network structure, and the first neural network block uses layers 1 to 4 in MobileNetV2 (the number of repetitions n is 1, 1, and 2, a total of four layers), and the second The neural network block adopts 5 to 7 layers in MobileNetV2 (the number of repetitions n is 3 in total), the third neural network block adopts 8 to 11 layers in MobileNetV2 (the number of repetitions n is 4 in total), and the fourth The neural network block adopts layers 12 to 14 in MobileNetV2 (the number of repetitions n is 3, a total of three layers), and the fifth neural network block adopts layers 15 to 17 in MobileNetV2 (the number of repetitions n is 3, a total of three layers).

第1个神经网络块的彩色图RGB输入端接收输入层的输出端输出的原始输入图像的R通道分量、G通道分量和B通道分量，深度图Depth输入端接收输入层的输出端输出的原始输入图像的单通道分量，第1个神经网络块的彩色图RGB输出端输出24幅特征图，将其构成的集合记为R1，第1个神经网络块的深度图Depth输出端输出24幅特征图，将其构成的集合记为D1；R1和D1中的每幅特征图的宽度为

高度为

第2个神经网络块的彩色图RGB输出端输出32幅特征图，其构成的集合记为R2；第2个神经网络块的深度图Depth输入端接收DF1中的所有特征图，第2个神经网络块的深度图Depth输出端输出32幅特征图，其构成的集合记为D2；R2和D2中的每幅特征图的宽度均为

高度均为

第3个神经网络块的彩色图RGB输入端接收RF2中的所有特征图，第3个神经网络块的彩色图RGB输出端输出64幅特征图，将64幅特征图构成的集合记为R3；第3个神经网络块的深度图Depth输入端接收DF2中的所有特征图，第3个神经网络块的深度图Depth输出端输出64幅特征图，将64幅特征图构成的集合记为D3；R3和D3中每幅特征图宽度均为

高度均为

第4个神经网络块的彩色图RGB输入端接收RF3中的所有特征图，第4个神经网络块的彩色图RGB输出端输出96幅特征图，将96幅特征图构成的集合记为R4；第4个神经网络块的深度图Depth输入端接收DF3中的所有特征图，第4个神经网络块的深度图Depth输出端输出96幅特征图，将其构成的集合记为D4；R4和D4中的每幅特征图的宽度均为

高度均为

第5个神经网络块的彩色图RGB输入端接收RF4中的所有特征图，第5个神经网络块的彩色图RGB输出端输出160幅特征图，将160幅特征图构成的集合记为R5；第5个神经网络块的深度图Depth输入端接收DF4中的所有特征图，第5个神经网络块的深度图Depth输出端输出160幅特征图，将160幅特征图构成的集合记为D5；R5和D5中的每幅特征图的宽度为

高度为

The color map RGB input terminal of the first neural network block receives the R channel components, G channel components and B channel components of the original input image output by the output terminal of the input layer, and the depth map Depth input terminal receives the output terminal of the input layer. The single-channel component of the input image, the color map RGB output of the first neural network block outputs 24 feature maps, and the set composed of them is recorded as R1, and the depth map Depth output of the first neural network block outputs 24 features. Figure, denote the set formed by it as D1; the width of each feature map in R1 and D1 is

height is

The color map RGB output terminal of the second neural network block outputs 32 feature maps, and the set formed by them is denoted as R2; the depth map Depth input terminal of the second neural network block receives all the feature maps in DF1, and the second neural network block receives all the feature maps in DF1. The depth map Depth output of the network block outputs 32 feature maps, and the set formed by them is denoted as D2; the width of each feature map in R2 and D2 is

height is

The color map RGB input terminal of the third neural network block receives all the feature maps in RF2, and the color map RGB output terminal of the third neural network block outputs 64 feature maps, and the set of 64 feature maps is recorded as R3; The depth map Depth input of the third neural network block receives all the feature maps in DF2, and the depth map Depth output of the third neural network block outputs 64 feature maps, and the set of 64 feature maps is recorded as D3; The width of each feature map in R3 and D3 is

height is

The color map RGB input end of the fourth neural network block receives all the feature maps in RF3, and the color map RGB output end of the fourth neural network block outputs 96 feature maps, and the set composed of 96 feature maps is recorded as R4; The depth map Depth input of the fourth neural network block receives all feature maps in DF3, and the depth map Depth output of the fourth neural network block outputs 96 feature maps, and the set formed by them is recorded as D4; R4 and D4 The width of each feature map in is

height is

The color map RGB input terminal of the fifth neural network block receives all the feature maps in RF4, and the color map RGB output terminal of the fifth neural network block outputs 160 feature maps, and the set of 160 feature maps is recorded as R5; The depth map Depth input of the fifth neural network block receives all feature maps in DF4, and the depth map Depth output of the fifth neural network block outputs 160 feature maps, and the set of 160 feature maps is recorded as D5; The width of each feature map in R5 and D5 is

height is

对于第1个特征再提取卷积块，其由依次设置的第五十一卷积层、第五十一标准化层、第五十一激活层、第五十二卷积层、第五十二标准化层、第五十二激活层、第五十三卷积层、第五十三标准化层、第五十三激活层组成、第五十四卷积层、第五十四标准化层、第五十四激活层组成；第1个特征再提取卷积块的输入端接收R1和D1中的所有特征图，第1个特征提取块的彩色图RGB输出端输出24幅特征图，将其构成的集合记为RS1，深度图Depth输出端输出24幅特征图，将其构成的集合记为DS1；其中，第五十一卷积层的卷积核大小为3×3、卷积核个数为24、步长(stride)为1、膨胀因子(dilation)为1、补零(padding)参数为1，分组(groups)参数为24，第五十一层标准化参数为24；第五十二卷积层的卷积核大小为1x1、卷积核个数为12、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为12，第五十二层标准化参数为12；第五十三卷积层的卷积核大小为3×3、卷积核个数为24、步长(stride)为1、膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为24，第五十三层标准化参数为24；第五十四卷积层的卷积核大小为1x1、卷积核个数为24、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为24，第五十四层标准化参数为24。所有激活层的激活方式均为“ReLU6”，RS1和DS1中的每幅特征图的宽度为

高度为

For the first feature re-extraction convolution block, it consists of the fifty-first convolutional layer, the fifty-first normalization layer, the fifty-first activation layer, the fifty-second convolutional layer, and the fifty-second convolutional layer. Normalization layer, fifty-second activation layer, fifty-third convolutional layer, fifty-third normalization layer, fifty-third activation layer composition, fifty-fourth convolutional layer, fifty-fourth normalization layer, fifth Fourteen activation layers; the input of the first feature extraction convolution block receives all the feature maps in R1 and D1, and the color map RGB output of the first feature extraction block outputs 24 feature maps, which are composed of The set is denoted as RS1, the output of the depth map Depth outputs 24 feature maps, and the set composed of them is denoted as DS1; among them, the size of the convolution kernel of the fifty-first convolutional layer is 3 × 3, and the number of convolution kernels is 24. The stride is 1, the dilation factor is 1, the padding parameter is 1, the groups parameter is 24, and the standardization parameter of the fifty-first layer is 24; Volume 52 The convolution kernel size of the product layer is 1x1, the number of convolution kernels is 12, the stride is 1, the padding parameter is 0, the groups parameter is 12, and the fifty-second layer standardization parameter is 12; the size of the convolution kernel of the fifty-third convolutional layer is 3×3, the number of convolution kernels is 24, the stride is 1, the dilation factor is 2, and the padding parameter is is 2, the groups parameter is 24, and the normalization parameter of the fifty-third layer is 24; the convolution kernel size of the fifty-fourth convolution layer is 1x1, the number of convolution kernels is 24, and the stride is 1. The padding parameter is 0, the groups parameter is 24, and the fifty-fourth layer normalization parameter is 24. The activation mode of all activation layers is "ReLU6", and the width of each feature map in RS1 and DS1 is

height is

对于第一个融合层，第1个融合层的彩色图RGB输入端接收R1中的所有特征图和RS1中的所有特征图，通过现有的add(逐元素相加)方式融合R1和RS1得到集合RA1，第1个融合层的彩色图RGB输出端输出RA1；第1个融合层的深度图Depth输入端接收D1中的所有特征图和DS1中的所有特征图，第1个连接层通过现有的add(逐元素相加)方式融合D1和DS1得到集合DA1，第1个融合层的深度图Depth输出端输出DA1；其中，RA1和DA1中包含的特征图的总幅数均为24，RA1和DA1中的每幅特征图的宽度均为

高度均为

For the first fusion layer, the color map RGB input terminal of the first fusion layer receives all feature maps in R1 and all feature maps in RS1, and fuses R1 and RS1 through the existing add (element-by-element addition) method to obtain Set RA1, the color map RGB output terminal of the first fusion layer outputs RA1; the depth map Depth input terminal of the first fusion layer receives all feature maps in D1 and all feature maps in DS1, and the first connection layer Some add (element-by-element addition) methods fuse D1 and DS1 to obtain a set DA1, and the depth map Depth output of the first fusion layer outputs DA1; among them, the total number of feature maps contained in RA1 and DA1 is 24, The width of each feature map in RA1 and DA1 is

height is

对于第1个分块注意力卷积块，结构如图6所示，其由依次设置的第一分块层、第一通道注意力层、第一空间尺寸注意力层；第1个分块注意力卷积块的彩色图RGB输入端接收RA1中的所有特征图，第1个分块注意力卷积块的深度图Depth输入端接收DA1中的所有特征图，RA1和DA1中的所有特征图分别输入第一分块注意力卷积块，先将输入的所有特征图经第一分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第1个分块注意力卷积块的彩色图RGB输出端输出24幅特征图，将其集合记为RT1，第1个分块注意力卷积块的深度图Depth输出端输出24幅特征图，将其集合记为DT1；其中，第一分块层采用pytorch自带的split函数，参数为原输入特征图及对应通道数的一半；第一通道注意力层包含第一适应性最大池化、第一通道注意力第一卷积层、第一通道注意力第二卷积层以及激活层，其中第一适应性最大池化参数为1，第一通道注意力第一卷积层卷积核大小为1x1，卷积核个数为12，步长(stride)为1，分组(groups)参数为12，偏置项(bias)为False；第一通道注意力第二卷积层卷积核大小为1x1，卷积核个数为24，步长(stride)为1，分组(groups)参数为12，偏置项(bias)为False；第一通道注意力激活层激活方式为“Sigmoid”；第一空间尺寸注意力层包含第一按通道最大化层、第一空间尺寸注意力卷积层及激活函数，其中第一按通道最大化层采用pytorch自带的max函数；第一空间尺寸注意力卷积层卷积核大小为3×3，卷积核个数为1，膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为1，偏置项(bias)为False，第一空间尺寸注意力激活层激活方式为“Sigmoid”。RT1和DT1中的每幅特征图的宽度均为

高度均为

For the first block attention convolution block, the structure is shown in Figure 6, which consists of the first block layer, the first channel attention layer, and the first spatial size attention layer set in sequence; the first block The color map RGB input of the attention convolution block receives all feature maps in RA1, the depth map Depth input of the first block attention convolution block receives all feature maps in DA1, and all features in RA1 and DA1 The graphs are respectively input into the first block attention convolution block. First, all the input feature maps are divided into two parts by channel through the first block layer, one part is used as the input of the channel attention layer, and the other part is used as the input of the spatial dimension attention layer. Input, and finally multiply the results of the attention processing on both sides with the original input feature map and add the multiplied results. The color map RGB output of the first block attention convolution block outputs 24 feature maps, which are The set is recorded as RT1, the depth map Depth output of the first block attention convolution block outputs 24 feature maps, and the set is recorded as DT1; among them, the first block layer uses the split function that comes with pytorch, and the parameters is half of the original input feature map and the number of corresponding channels; the first channel attention layer includes the first adaptive maximum pooling, the first channel attention first convolution layer, the first channel attention second convolution layer and activation layer, where the first adaptive maximum pooling parameter is 1, the first channel attention The first convolutional layer convolution kernel size is 1x1, the number of convolution kernels is 12, the stride is 1, and the groups are ) parameter is 12, the bias term (bias) is False; the size of the convolution kernel of the first channel attention second convolution layer is 1x1, the number of convolution kernels is 24, the stride is 1, and the groups are ) parameter is 12, the bias term (bias) is False; the activation mode of the first channel attention activation layer is "Sigmoid"; the first space size attention layer includes the first channel-wise maximization layer, the first space size attention layer The convolution layer and activation function, in which the first channel maximization layer adopts the max function that comes with pytorch; the first spatial dimension attention convolution layer convolution kernel size is 3 × 3, the number of convolution kernels is 1, and the expansion The factor (dilation) is 2, the padding parameter is 2, the grouping (groups) parameter is 1, the bias term (bias) is False, and the activation mode of the first spatial size attention activation layer is "Sigmoid". The width of each feature map in RT1 and DT1 is

height is

对于第二个融合层，第2个融合层的彩色图RGB输入端接收DT1中的所有特征图，通过现有的add(逐元素相加)方式融合RA1和DT1得到集合RF1，第2个融合层的彩色图RGB输出端输出RF1；第2个融合层的深度图Depth输入端接收RT1中的所有特征图和DA1中的所有特征图，第2个融合通过现有的add(逐元素相加)方式融合RT1和DA1得到集合DF1，第2个融合层的深度图Depth输出端输出DF1；其中，RF1和DF1中包含的特征图的总幅数均为24，RF1和DF1中的每幅特征图的宽度均为

高度均为

For the second fusion layer, the color map RGB input terminal of the second fusion layer receives all the feature maps in DT1, and fuses RA1 and DT1 through the existing add (element-by-element addition) method to obtain the set RF1, and the second fusion layer The color map RGB output of the layer outputs RF1; the depth map Depth input of the second fusion layer receives all feature maps in RT1 and all feature maps in DA1, and the second fusion passes the existing add (element-by-element addition). ) way to fuse RT1 and DA1 to obtain a set DF1, and the depth map Depth output of the second fusion layer outputs DF1; among them, the total number of feature maps contained in RF1 and DF1 is 24, and each feature in RF1 and DF1 The width of the graph is

height is

对于第2个特征再提取卷积块，其由依次设置的第五十五卷积层、第五十五标准化层、第五十五激活层、第五十六卷积层、第五十六标准化层、第五十六激活层、第五十七卷积层、第五十七标准化层、第五十七激活层、第五十八卷积层、第五十八标准化层、第五十八激活层组成；第2个特征再提取卷积块的彩色图RGB输入端接收R2中的所有特征图，第2个特征再提取卷积块的彩色图RGB输出端输出32幅特征图，将其构成的集合记为RS2；第2个特征再提取卷积块的深度图Depth输入端接收D2中的所有特征图，第2个特征再提取卷积块的深度图Depth输出端输出32幅特征图，将其构成的集合记为DS2；其中，第五十五卷积层的卷积核大小为3×3、卷积核个数为32、步长(stride)为1、膨胀因子(dilation)为1、补零(padding)参数为1，分组(groups)参数为32，第五十五层标准化参数为32；第五十六卷积层的卷积核大小为1x1、卷积核个数16、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为16，第五十六层标准化参数为16；第五十七卷积层的卷积核大小为3×3、卷积核个数为32、步长(stride)为1、膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为16，第五十七层标准化参数为32；第五十八卷积层的卷积核大小为1x1、卷积核个数为32、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为32，第五十八层标准化参数为32。所有激活层的激活方式均为“ReLU6”，RS2和DS2中的每幅特征图的宽度为

高度为

For the second feature re-extraction convolution block, it consists of the fifty-fifth convolution layer, the fifty-fifth normalization layer, the fifty-fifth activation layer, the fifty-sixth convolution layer, and the fifty-sixth convolution layer. Normalization layer, fifty-sixth activation layer, fifty-seventh convolutional layer, fifty-seventh normalization layer, fifty-seventh activation layer, fifty-eighth convolutional layer, fifty-eighth normalization layer, fifty It consists of eight activation layers; the color map RGB input end of the second feature re-extraction convolution block receives all the feature maps in R2, and the color map RGB output end of the second feature re-extraction convolution block outputs 32 feature maps. The set formed by it is denoted as RS2; the depth map Depth input of the second feature re-extraction convolution block receives all the feature maps in D2, and the depth map Depth output of the second feature re-extraction convolution block outputs 32 features. Figure, the set composed of it is recorded as DS2; among them, the size of the convolution kernel of the fifty-fifth convolution layer is 3 × 3, the number of convolution kernels is 32, the stride is 1, and the dilation factor is 1. ) is 1, the padding parameter is 1, the groups parameter is 32, and the normalization parameter of the fifty-fifth layer is 32; the convolution kernel size of the fifty-sixth convolutional layer is 1x1, and the convolution kernel The number is 16, the stride is 1, the padding parameter is 0, the groups parameter is 16, and the normalization parameter of the fifty-sixth layer is 16; the size of the convolution kernel of the fifty-seventh convolutional layer It is 3×3, the number of convolution kernels is 32, the stride is 1, the dilation factor is 2, the padding parameter is 2, the groups parameter is 16, and the fifty-seventh The layer normalization parameter is 32; the convolution kernel size of the fifty-eighth convolution layer is 1x1, the number of convolution kernels is 32, the stride is 1, the padding parameter is 0, and the groups are The parameter is 32, and the normalization parameter of the fifty-eighth layer is 32. The activation method of all activation layers is "ReLU6", and the width of each feature map in RS2 and DS2 is

height is

对于第三个融合层，第3个融合层的彩色图RGB输入端接收R2中的所有特征图和RS2中的所有特征图，第3个融合层通过现有的add(逐元素相加)方式融合R2和RS2得到集合RA2，第3个融合层的彩色图RGB输出端输出RA2；第3个融合层的深度图Depth输入端接收D2中的所有特征图和DS2中的所有特征图，第2个连接层通过现有的add(逐元素相加)方式融合D2和DS2得到集合DA2，第3个融合层的深度图Depth输出端输出DA2；其中，RA2和DA2中包含的特征图的总幅数均为32，RA2和DA2中的每幅特征图的宽度均为

高度均为

For the third fusion layer, the color map RGB input terminal of the third fusion layer receives all feature maps in R2 and all feature maps in RS2, and the third fusion layer uses the existing add (element-by-element addition) method Fuse R2 and RS2 to get the set RA2, the color map RGB output of the third fusion layer outputs RA2; the depth map Depth input of the third fusion layer receives all feature maps in D2 and all feature maps in DS2, the second Each connection layer fuses D2 and DS2 through the existing add (element-by-element addition) method to obtain a set DA2, and the depth map Depth output of the third fusion layer outputs DA2; among them, the total width of the feature maps contained in RA2 and DA2 The number is 32, and the width of each feature map in RA2 and DA2 is

height is

对于第2个分块注意力卷积块，其由依次设置的第二分块层、第二通道注意力层、第二空间尺寸注意力层；第2个分块注意力卷积块的彩色图RGB输入端接收RA2中的所有特征图，RA2中的所有特征图输入第二分块注意力卷积块，先将输入所有的所有特征图经第二分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第2个分块注意力卷积块的彩色图RGB输出端输出32幅特征图，将其构成的集合记为RT2；第2个分块注意力卷积块的深度图Depth输入端接收DA2中的所有特征图，DA2中的所有特征图输入第二分块注意力卷积块，先将输入所有的所有特征图经第二分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第2个分块注意力卷积块的深度图Depth输出端输出32幅特征图，将其构成的集合记为DT2；其中，第二分块层采用pytorch自带的split函数，参数为原输入特征图及对应通道数的一半；第二通道注意力层包含第二适应性最大池化、第二通道注意力第一卷积层、第二通道注意力第二卷积层以及激活层，其中第二适应性最大池化参数为1，第二通道注意力第一卷积层卷积核大小为1x1，卷积核个数为16，步长(stride)为1，分组(groups)参数为16，偏置项(bias)为False；第二通道注意力第二卷积层卷积核大小为1x1，卷积核个数为32，步长(stride)为1，分组(groups)参数为16，偏置项(bias)为False；第二通道注意力激活层激活方式为“Sigmoid”；第二空间尺寸注意力层包含第二按通道最大化层、第二空间尺寸注意力卷积层及激活函数，其中第二按通道最大化层采用pytorch自带的max函数；第二空间尺寸注意力卷积层卷积核大小为3×3，卷积核个数为1，膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为1，偏置项(bias)为False，第二空间尺寸注意力激活层激活方式为“Sigmoid”。RT2和DT2中的每幅特征图的宽度均为

高度均为

For the second block attention convolution block, it consists of the second block layer, the second channel attention layer, and the second spatial dimension attention layer set in sequence; the color of the second block attention convolution block Figure RGB input receives all feature maps in RA2, all feature maps in RA2 are input into the second block attention convolution block, first all input feature maps are divided into two parts by channel through the second block layer, one part As the input of the channel attention layer, the other part is used as the input of the spatial dimension attention layer. Finally, the results of the attention processing on both sides are multiplied by the original input feature map respectively, and the multiplication results are added together. The second block attention The color map RGB output of the convolution block outputs 32 feature maps, and the set composed of them is recorded as RT2; the depth map Depth input of the second block attention convolution block receives all the feature maps in DA2. All feature maps are input into the second block attention convolution block. First, all input feature maps are divided into two parts by channel through the second block layer, one part is used as the input of the channel attention layer, and the other part is used as the spatial dimension attention layer. The input of the force layer, and finally the results of the attention processing on both sides are multiplied by the original input feature maps and then the multiplication results are added together. The depth map Depth output of the second block attention convolution block outputs 32 feature maps. , and the set formed by it is recorded as DT2; among them, the second block layer adopts the split function that comes with pytorch, and the parameters are half of the original input feature map and the number of corresponding channels; the second channel attention layer contains the second largest adaptability Pooling, second channel attention to the first convolutional layer, second channel attention to the second convolutional layer and activation layer, where the second adaptive maximum pooling parameter is 1, the second channel attention to the first convolutional layer The convolution kernel size is 1x1, the number of convolution kernels is 16, the stride is 1, the groups parameter is 16, and the bias term is False; the second channel pays attention to the second convolution layer The size of the convolution kernel is 1x1, the number of convolution kernels is 32, the stride is 1, the groups parameter is 16, and the bias term is False; the activation method of the second channel attention activation layer is "Sigmoid"; the second spatial size attention layer includes the second channel-wise maximization layer, the second spatial-scale attention convolutional layer and the activation function, wherein the second channel-wise maximization layer adopts the max function that comes with pytorch; The size of the convolution kernel of the two-dimensional attention convolution layer is 3×3, the number of convolution kernels is 1, the dilation factor is 2, the padding parameter is 2, and the groups parameter is 1. The bias term (bias) is False, and the activation mode of the second spatial dimension attention activation layer is "Sigmoid". The width of each feature map in RT2 and DT2 is

height is

对于第四个融合层，第4个融合层的彩色图RGB输入端接收RA2和DT2中的所有特征图，第4个融合层通过现有的add(逐元素相加)方式融合RA2和DT2得到集合RF2，第4个融合层的彩色图RGB输出端输出RF2；第4个融合层的深度图Depth输入端接收RT2中的所有特征图和DA2中的所有特征图，第4个融合通过现有的add(逐元素相加)方式融合RT2和DA2得到集合DF2，第4个融合层的深度图Depth输出端输出DF2；其中，RF2和DF2中包含的特征图的总幅数均为32，RF2和DF2中的每幅特征图的宽度均为

高度均为

For the fourth fusion layer, the color map RGB input terminal of the fourth fusion layer receives all the feature maps in RA2 and DT2, and the fourth fusion layer is obtained by fusing RA2 and DT2 through the existing add (element-by-element addition) method. Set RF2, the color map RGB output of the fourth fusion layer outputs RF2; the depth map Depth input of the fourth fusion layer receives all feature maps in RT2 and all feature maps in DA2, and the fourth fusion passes existing The add (element-by-element addition) method fuses RT2 and DA2 to obtain a set DF2, and the depth map Depth output of the fourth fusion layer outputs DF2; among them, the total number of feature maps contained in RF2 and DF2 is 32, RF2 and the width of each feature map in DF2 is

height is

对于第3个特征再提取卷积块，其由依次设置的第五十九卷积层、第五十九标准化层、第五十九激活层、第六十卷积层、第六十标准化层、第六十激活层、第六十一卷积层、第六十一标准化层、第六十一激活层、第六十二卷积层、第六十二标准化层、第六十二激活层组成；第3个特征再提取卷积块的彩色图RGB输入端接收R3中的所有特征图，第3个特征再提取卷积块的彩色图RGB输出端输出64幅特征图，将其构成的集合记为RS3；第3个特征再提取卷积块的深度图Depth输入端接收D3中的所有特征图，第3个特征再提取卷积块的深度图Depth输出端输出64幅特征图，将其构成的集合记为DS3；其中，第五十九卷积层的卷积核大小为3×3、卷积核个数为64、步长(stride)为1、膨胀因子(dilation)为1、补零(padding)参数为1，分组(groups)参数为64，第五十九层标准化参数为64；第六十卷积层的卷积核大小为1x1、卷积核个数32、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为32，第六十层标准化参数为32；第六十一卷积层的卷积核大小为3×3、卷积核个数为64、步长(stride)为1、膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为32，第六十一层标准化参数为64；第六十二卷积层的卷积核大小为1x1、卷积核个数为64、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为64，第六十二层标准化参数为64。所有激活层的激活方式均为“ReLU6”，RS3和DS3中的每幅特征图的宽度为

高度为

For the third feature re-extraction convolution block, it consists of the fifty-ninth convolutional layer, the fifty-ninth normalization layer, the fifty-ninth activation layer, the sixtieth convolutional layer, and the sixtieth normalization layer. , 60th activation layer, 61st convolution layer, 61st normalization layer, 61st activation layer, 62nd convolution layer, 62nd normalization layer, 62nd activation layer Composition; the color map RGB input end of the third feature extraction convolution block receives all the feature maps in R3, and the color map RGB output end of the third feature extraction convolution block outputs 64 feature maps, which are composed of The set is denoted as RS3; the depth map Depth input of the third feature re-extraction convolution block receives all feature maps in D3, and the depth map Depth output of the third feature re-extraction convolution block outputs 64 feature maps. The set composed of them is denoted as DS3; among them, the convolution kernel size of the fifty-ninth convolutional layer is 3×3, the number of convolution kernels is 64, the stride is 1, and the dilation factor is 1. , the padding parameter is 1, the groups parameter is 64, and the normalization parameter of the fifty-ninth layer is 64; the convolution kernel size of the sixtieth convolution layer is 1x1, the number of convolution kernels is 32, and the step The stride is 1, the padding parameter is 0, the groups parameter is 32, and the standardization parameter of the sixtieth layer is 32; the size of the convolution kernel of the sixty-first convolutional layer is 3×3, The number of convolution kernels is 64, the stride is 1, the dilation factor is 2, the padding parameter is 2, the groups parameter is 32, and the standardization parameter of the sixty-first layer is 64. ; The convolution kernel size of the 62nd convolution layer is 1x1, the number of convolution kernels is 64, the stride is 1, the padding parameter is 0, the groups parameter is 64, and the The sixty-two layer normalization parameter is 64. The activation method of all activation layers is "ReLU6", and the width of each feature map in RS3 and DS3 is

height is

对于第五个融合层，第5个融合层的彩色图RGB输入端接收R3中的所有特征图和RS3中的所有特征图，第5个融合层通过现有的add(逐元素相加)方式融合R3和RS3得到集合RA3，第5个融合层的彩色图RGB输出端输出RA3；第5个融合层的深度图Depth输入端接收D3中的所有特征图和DS3中的所有特征图，第3个融合层通过现有的add(逐元素相加)方式融合D3和DS3得到集合DA3，第5个融合层的深度图Depth输出端输出DA3；其中，RA3和DA3中包含的特征图的总幅数均为64，RA3和DA3中的每幅特征图的宽度均为

高度均为

For the fifth fusion layer, the color map RGB input of the fifth fusion layer receives all feature maps in R3 and all feature maps in RS3, and the fifth fusion layer uses the existing add (element-by-element addition) method Fusing R3 and RS3 to get the set RA3, the color map RGB output of the fifth fusion layer outputs RA3; the depth map Depth input of the fifth fusion layer receives all feature maps in D3 and all feature maps in DS3, the third Each fusion layer fuses D3 and DS3 through the existing add (element-by-element addition) method to obtain a set DA3, and the depth map Depth output of the fifth fusion layer outputs DA3; among them, the total width of the feature maps contained in RA3 and DA3 The number is 64, and the width of each feature map in RA3 and DA3 is

height is

对于第3个分块注意力卷积块，其由依次设置的第三分块层、第三通道注意力层、第三空间尺寸注意力层；第3个分块注意力卷积块的彩色图RGB输入端接收中RA3的所有特征图，RA3中的所有特征图输入第三分块注意力卷积块，先将输入所有的所有特征图经第三分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第3个分块注意力卷积块的彩色图RGB输出端输出64幅特征图，将64幅特征图构成的集合记为RT3；第3个分块注意力卷积块的深度图Depth输入端接收中DA3的所有特征图，DA3中的所有特征图输入第三分块注意力卷积块，先将输入所有的所有特征图经第三分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第3个分块注意力卷积块的深度图Depth输出端输出64幅特征图，将64幅特征图构成的集合记为DT3；其中，第三分块层采用pytorch自带的split函数，参数为原输入特征图及对应通道数的一半；第三通道注意力层包含第三适应性最大池化、第三通道注意力第一卷积层、第三通道注意力第二卷积层以及激活层，其中第三适应性最大池化参数为1，第三通道注意力第一卷积层卷积核大小为1x1，卷积核个数为32，步长(stride)为1，分组(groups)参数为32，偏置项(bias)为False；第三通道注意力第二卷积层卷积核大小为1x1，卷积核个数为64，步长(stride)为1，分组(groups)参数为32，偏置项(bias)为False；第三通道注意力激活层激活方式为“Sigmoid”；第三空间尺寸注意力层包含第三按通道最大化层、第三空间尺寸注意力卷积层及激活函数，其中第三按通道最大化层采用pytorch自带的max函数；第三空间尺寸注意力卷积层卷积核大小为3×3，卷积核个数为1，膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为1，偏置项(bias)为False，第三空间尺寸注意力激活层激活方式为“Sigmoid”。RT3和DT3中的每幅特征图的宽度均为

高度均为

For the third block attention convolution block, it consists of the third block layer, the third channel attention layer, and the third spatial dimension attention layer set in sequence; the color of the third block attention convolution block Figure RGB input receives all feature maps of RA3, all feature maps in RA3 are input into the third block attention convolution block, first all input feature maps are divided into two parts by channel through the third block layer, one part As the input of the channel attention layer, the other part is used as the input of the spatial dimension attention layer. Finally, the results of the attention processing on both sides are multiplied by the original input feature map respectively, and the multiplication results are added together. The third block attention The color map RGB output of the convolution block outputs 64 feature maps, and the set of 64 feature maps is recorded as RT3; the depth map Depth input of the third block attention convolution block receives all the feature maps in DA3 , All feature maps in DA3 are input into the third block attention convolution block, first all input feature maps are divided into two parts by channel through the third block layer, one part is used as the input of the channel attention layer, and the other part is used as the input of the channel attention layer. The input of the spatial size attention layer, and finally the results of the attention processing on both sides are multiplied by the original input feature map respectively, and then the multiplication results are added together. The depth map Depth output of the third block attention convolution block outputs 64 The set of 64 feature maps is recorded as DT3; among them, the third block layer adopts the split function that comes with pytorch, and the parameters are half of the original input feature map and the number of corresponding channels; the third channel attention layer Contains the third adaptive max pooling, the third channel attention first convolution layer, the third channel attention second convolution layer and the activation layer, where the third adaptive max pooling parameter is 1, the third channel attention The size of the convolution kernel of the first convolutional layer is 1x1, the number of convolution kernels is 32, the stride is 1, the groups parameter is 32, and the bias term is False; the third channel pays attention to The size of the convolution kernel of the second convolution layer is 1x1, the number of convolution kernels is 64, the stride is 1, the groups parameter is 32, and the bias term is False; the third channel pays attention to The activation method of the force activation layer is "Sigmoid"; the third spatial dimension attention layer includes the third channel-based maximization layer, the third spatial-scale attention convolutional layer and the activation function, of which the third channel-based maximization layer adopts pytorch automatic layer. The max function of the belt; the third spatial dimension attention convolution layer convolution kernel size is 3 × 3, the number of convolution kernels is 1, the expansion factor (dilation) is 2, the padding parameter is 2, and the grouping ( groups) parameter is 1, the bias term (bias) is False, and the activation mode of the third spatial dimension attention activation layer is "Sigmoid". The width of each feature map in RT3 and DT3 is

height is

对于第六个融合层，第6个融合层的彩色图RGB输入端接收RA3和DT3中的所有特征图，第6个融合层通过现有的add(逐元素相加)方式融合RA3和DT3得到集合RF3，第6个融合层的彩色图RGB输出端输出RF3；第6个融合层的深度图Depth输入端接收RT3中的所有特征图和DA3中的所有特征图，第6个融合通过现有的add(逐元素相加)方式融合RT3和DA3得到集合DF3，第6个融合层的深度图Depth输出端输出DF3；其中，RF3和DF3中包含的特征图的总幅数均为64，RF3和DF3中的每幅特征图的宽度均为

高度均为

For the sixth fusion layer, the color map RGB input terminal of the sixth fusion layer receives all the feature maps in RA3 and DT3, and the sixth fusion layer fuses RA3 and DT3 through the existing add (element-by-element addition) method to obtain Set RF3, the color map RGB output of the sixth fusion layer outputs RF3; the depth map Depth input of the sixth fusion layer receives all feature maps in RT3 and all feature maps in DA3, and the sixth fusion passes existing The add (element-by-element addition) method fuses RT3 and DA3 to obtain a set DF3, and the depth map Depth output of the sixth fusion layer outputs DF3; among them, the total number of feature maps contained in RF3 and DF3 is 64, RF3 and the width of each feature map in DF3 is

height is

对于第4个特征再提取卷积块，其由依次设置的第六十三卷积层、第六十三标准化层、第六十三激活层、第六十四卷积层、第六十四标准化层、第六十四激活层、第六十五卷积层、第六十五标准化层、第六十五激活层、第六十六卷积层、第六十六标准化层、第六十六激活层；第4个特征再提取卷积块的彩色图RGB输入端接收R4中的所有特征图，第4个特征再提取卷积块的输出端输出96幅特征图，将96幅特征图构成的集合记为RS4；第4个特征再提取卷积块的深度图Depth输入端接收D4中的所有特征图，第4个特征再提取卷积块的输出端输出96幅特征图，将96幅特征图构成的集合记为DS4；其中第六十三卷积层的卷积核大小为3×3、卷积核个数为96、步长(stride)为1、膨胀因子(dilation)为1、补零(padding)参数为1，分组(groups)参数为64，第六十三层标准化参数为96；第六十四卷积层的卷积核大小为1x1、卷积核个数48、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为48，第六十四层标准化参数为48；第六十五卷积层的卷积核大小为3×3、卷积核个数为96、步长(stride)为1、膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为48，第六十五层标准化参数为96；第六十六卷积层的卷积核大小为1x1、卷积核个数为96、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为96，第六十六层标准化参数为96。所有激活层的激活方式均为“ReLU6”，RS4和DS4中的每幅特征图的宽度为

高度为

For the 4th feature re-extraction convolution block, it consists of the sixty-third convolutional layer, the sixty-third normalization layer, the sixty-third activation layer, the sixty-fourth convolutional layer, and the sixty-fourth convolutional layer. Normalization layer, 64th activation layer, 65th convolution layer, 65th normalization layer, 65th activation layer, 66th convolution layer, 66th normalization layer, 60th Six activation layers; the color map RGB input of the fourth feature re-extraction convolution block receives all the feature maps in R4, the output of the fourth feature re-extraction convolution block outputs 96 feature maps, and the 96 feature maps are The formed set is denoted as RS4; the depth map Depth input of the fourth feature re-extraction convolution block receives all the feature maps in D4, and the output of the fourth feature re-extraction convolution block outputs 96 feature maps. The set of feature maps is denoted as DS4; the size of the convolution kernel of the sixty-third convolutional layer is 3 × 3, the number of convolution kernels is 96, the stride is 1, and the dilation factor is 1. The padding parameter is 1, the groups parameter is 64, and the standardization parameter of the 63rd layer is 96; the size of the convolution kernel of the 64th convolution layer is 1x1, and the number of convolution kernels is 48 , the stride is 1, the padding parameter is 0, the groups parameter is 48, the standardization parameter of the sixty-fourth layer is 48; the size of the convolution kernel of the sixty-fifth convolutional layer is 3 ×3, the number of convolution kernels is 96, the stride is 1, the dilation factor is 2, the padding parameter is 2, the groups parameter is 48, and the 65th layer is normalized The parameter is 96; the size of the convolution kernel of the sixty-sixth convolutional layer is 1x1, the number of convolution kernels is 96, the stride is 1, the padding parameter is 0, and the groups parameter is 96, the standardization parameter of the sixty-sixth layer is 96. The activation method of all activation layers is "ReLU6", and the width of each feature map in RS4 and DS4 is

height is

对于第七个融合层，第7个融合层的彩色图RGB输入端接收R4中的所有特征图和RS4中的所有特征图，第7个融合层通过现有的add(逐元素相加)方式融合R4和RS4得到集合RA4，第7个融合层的彩色图RGB输出端输出RA4；第7个融合层的深度图Depth输入端接收D4中的所有特征图和DS4中的所有特征图，第7个融合层通过现有的add(逐元素相加)方式融合D4和DS4得到集合DA4，第7个融合层的深度图Depth输出端输出DA4；其中，RA4和DA4中包含的特征图的总幅数均为96，RA4和DA4中的每幅特征图的宽度均为

高度均为

For the seventh fusion layer, the color map RGB input terminal of the seventh fusion layer receives all feature maps in R4 and all feature maps in RS4, and the seventh fusion layer uses the existing add (element-by-element addition) method. Fusing R4 and RS4 to get the set RA4, the color map RGB output of the seventh fusion layer outputs RA4; the depth map Depth input of the seventh fusion layer receives all feature maps in D4 and all feature maps in DS4, the seventh Each fusion layer fuses D4 and DS4 through the existing add (element-by-element addition) method to obtain a set DA4, and the depth map Depth output of the seventh fusion layer outputs DA4; among them, the total width of the feature maps contained in RA4 and DA4 The number is 96, and the width of each feature map in RA4 and DA4 is

height is

对于第4个分块注意力卷积块，其由依次设置的第四分块层、第四通道注意力层、第四空间尺寸注意力层；第4个分块注意力卷积块的彩色图RGB输入端接收中RA4的所有特征图，RA4中的所有特征图输入第四分块注意力卷积块，先将输入所有的所有特征图经第四分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第4个分块注意力卷积块的输出端输出96幅特征图，将96幅特征图构成的集合记为RT4；第4个分块注意力卷积块的深度图Depth输入端接收中DA4的所有特征图，DA4中的所有特征图输入第四分块注意力卷积块，先将输入所有的所有特征图经第四分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第4个分块注意力卷积块的深度图Depth输出端输出96幅特征图，将96幅特征图构成的集合记为DT4；其中，第四分块层采用pytorch自带的split函数，参数为原输入特征图及对应通道数的一半；第四通道注意力层包含第四适应性最大池化、第四通道注意力第一卷积层、第四通道注意力第二卷积层以及激活层，其中第四适应性最大池化参数为1，第四通道注意力第一卷积层卷积核大小为1x1，卷积核个数为48，步长(stride)为1，分组(groups)参数为48，偏置项(bias)为False；第四通道注意力第二卷积层卷积核大小为1x1，卷积核个数为96，步长(stride)为1，分组(groups)参数为48，偏置项(bias)为False；第四通道注意力激活层激活方式为“Sigmoid”；第四空间尺寸注意力层包含第四按通道最大化层、第四空间尺寸注意力卷积层及激活函数，其中第四按通道最大化层采用pytorch自带的max函数；第四空间尺寸注意力卷积层卷积核大小为3×3，卷积核个数为1，膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为1，偏置项(bias)为False，第四空间尺寸注意力激活层激活方式为“Sigmoid”。RT4和DT4中的每幅特征图的宽度为

高度为

For the fourth block attention convolution block, it consists of the fourth block layer, the fourth channel attention layer, and the fourth spatial dimension attention layer set in sequence; the color of the fourth block attention convolution block Figure RGB input receives all feature maps of RA4, all feature maps in RA4 are input into the fourth block attention convolution block, first all input feature maps are divided into two parts by channel through the fourth block layer, one part As the input of the channel attention layer, the other part is used as the input of the spatial size attention layer. Finally, the results of the attention processing on both sides are multiplied with the original input feature map respectively, and the multiplication results are added together. The fourth block attention The output of the convolution block outputs 96 feature maps, and the set of 96 feature maps is recorded as RT4; the depth map Depth input of the fourth block attention convolution block receives all the feature maps of DA4 in DA4. All feature maps of the input are input to the fourth block attention convolution block. First, all input feature maps are divided into two parts by channel through the fourth block layer, one part is used as the input of the channel attention layer, and the other part is used as the spatial dimension attention. The input of the force layer, and finally the results of the attention processing on both sides are multiplied by the original input feature maps and then the multiplication results are added together. The depth map Depth output of the fourth block attention convolution block outputs 96 feature maps. , the set composed of 96 feature maps is recorded as DT4; among them, the fourth block layer adopts the split function that comes with pytorch, and the parameters are half of the original input feature map and the number of corresponding channels; the fourth channel attention layer contains the fourth Adaptive max pooling, the first convolutional layer of the fourth channel attention, the second convolutional layer of the fourth channel attention, and the activation layer, where the fourth adaptive max pooling parameter is 1, and the fourth channel attention is the first The convolutional layer convolution kernel size is 1x1, the number of convolution kernels is 48, the stride is 1, the groups parameter is 48, and the bias term is False; the fourth channel attention is the second The convolutional layer convolution kernel size is 1x1, the number of convolution kernels is 96, the stride is 1, the groups parameter is 48, and the bias term is False; the fourth channel attention activation layer The activation method is "Sigmoid"; the fourth spatial size attention layer includes the fourth channel maximization layer, the fourth spatial dimension attention convolution layer and the activation function, of which the fourth channel maximization layer uses the max that comes with pytorch Function; the fourth spatial dimension of the attention convolution layer convolution kernel size is 3 × 3, the number of convolution kernels is 1, the expansion factor (dilation) is 2, the padding parameter is 2, and the groups parameter is 2. is 1, the bias term (bias) is False, and the activation mode of the fourth spatial dimension attention activation layer is "Sigmoid". The width of each feature map in RT4 and DT4 is

height is

对于第八个融合层，第8个融合层的彩色图RGB输入端接收RA4和DT4中的所有特征图，第8个融合层通过现有的add(逐元素相加)方式融合RA4和DT4得到集合RF4，第8个融合层的彩色图RGB输出端输出RF4；第8个融合层的深度图Depth输入端接收RT4中的所有特征图和DA4中的所有特征图，第8个融合层通过现有的add(逐元素相加)方式融合RT4和DA4得到集合DF4，第8个融合层的深度图Depth输出端输出DF4；其中，RF4和DF4中包含的特征图的总幅数均为96，RF4和DF4中的每幅特征图的宽度均为

高度均为

For the eighth fusion layer, the color map RGB input terminal of the eighth fusion layer receives all the feature maps in RA4 and DT4, and the eighth fusion layer fuses RA4 and DT4 through the existing add (element-by-element addition) method to obtain Set RF4, the color map RGB output terminal of the eighth fusion layer outputs RF4; the depth map Depth input terminal of the eighth fusion layer receives all feature maps in RT4 and all feature maps in DA4, and the eighth fusion layer Some add (element-by-element addition) methods fuse RT4 and DA4 to obtain a set DF4, and the depth map Depth output of the eighth fusion layer outputs DF4; among them, the total number of feature maps contained in RF4 and DF4 is 96, The width of each feature map in RF4 and DF4 is

height is

对于第5个特征再提取卷积块，其由依次设置的第六十七卷积层、第六十七标准化层、第六十七激活层、第六十八卷积层、第六十八标准化层、第六十八激活层、第六十九卷积层、第六十九标准化层、第六十九激活层、第七十卷积层、第七十标准化层、第七十激活层组成；第5个特征再提取卷积块的彩色图RGB输入端接收R5中的所有特征图，第5个特征再提取卷积块的彩色图RGB输出端输出160幅特征图，将160幅特征图构成的集合记为RS5；第5个特征再提取卷积块的深度图Depth输入端接收D5中的所有特征图，第5个特征再提取卷积块的深度图Depth输出端输出160幅特征图，将160幅特征图构成的集合记为DS5；其中，第六十七卷积层的卷积核大小为3×3、卷积核个数为160、步长(stride)为1、膨胀因子(dilation)为1、补零(padding)参数为1，分组(groups)参数为96，第六十三层标准化参数为160；第六十八卷积层的卷积核大小为1x1、卷积核个数80、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为80，第六十八层标准化参数为80；第六十九卷积层的卷积核大小为3×3、卷积核个数为160、步长(stride)为1、膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为80，第六十九层标准化参数为160；第七十卷积层的卷积核大小为1x1、卷积核个数为160、步长(stride)为1、补零(padding)参数为0，分组(groups)参数为160，第七十层标准化参数为160。所有激活层的激活方式均为“ReLU6”，RS5和DS5中的每幅特征图的宽度为

高度为

For the fifth feature re-extraction convolution block, it consists of the sixty-seventh convolutional layer, the sixty-seventh normalization layer, the sixty-seventh activation layer, the sixty-eighth convolutional layer, and the sixty-eighth convolutional layer. Normalization layer, 68th activation layer, 69th convolution layer, 69th normalization layer, 69th activation layer, 70th convolution layer, 70th normalization layer, 70th activation layer Composition; the color map RGB input end of the fifth feature extraction convolution block receives all the feature maps in R5, the color map RGB output end of the fifth feature extraction convolution block outputs 160 feature maps, and the 160 feature maps are The set of images is denoted as RS5; the depth map Depth input of the fifth feature re-extraction convolution block receives all feature maps in D5, and the fifth feature re-extraction depth map Depth output of the convolution block outputs 160 features Figure, the set composed of 160 feature maps is marked as DS5; among them, the size of the convolution kernel of the sixty-seventh convolution layer is 3×3, the number of convolution kernels is 160, the stride is 1, and the dilation is 1. The factor (dilation) is 1, the padding parameter is 1, the grouping (groups) parameter is 96, and the standardization parameter of the sixty-third layer is 160; the convolution kernel size of the sixty-eighth convolution layer is 1x1, the volume The number of product kernels is 80, the stride is 1, the padding parameter is 0, the groups parameter is 80, and the standardization parameter of the sixty-eighth layer is 80; the volume of the sixty-ninth convolutional layer The kernel size is 3×3, the number of convolution kernels is 160, the stride is 1, the dilation factor is 2, the padding parameter is 2, and the groups parameter is 80. The standardization parameter of the sixty-ninth layer is 160; the convolution kernel size of the seventieth convolution layer is 1x1, the number of convolution kernels is 160, the stride is 1, the padding parameter is 0, and the grouping ( groups) parameter is 160, and the seventieth layer normalization parameter is 160. The activation method of all activation layers is "ReLU6", and the width of each feature map in RS5 and DS5 is

height is

对于第九个融合层，第9个融合层的彩色图RGB输入端接收R5中的所有特征图和RS5中的所有特征图，第9个融合层通过现有的add(逐元素相加)方式融合R5和RS5得到集合RA5，第9个融合层的彩色图RGB输出端输出RA5；第9个融合层的深度图Depth输入端接收D5中的所有特征图和DS5中的所有特征图，第9个融合层通过现有的add(逐元素相加)方式融合D5和DS5得到集合DA5，第9个融合层的深度图Depth输出端输出DA5；其中，RA5和DA5中包含的特征图的总幅数均为160，RA5和DA5中的每幅特征图的宽度均为

高度均为

For the ninth fusion layer, the color map RGB input of the ninth fusion layer receives all feature maps in R5 and all feature maps in RS5, and the ninth fusion layer uses the existing add (element-by-element addition) method Fusing R5 and RS5 to get the set RA5, the color map RGB output of the ninth fusion layer outputs RA5; the depth map Depth input of the ninth fusion layer receives all feature maps in D5 and all feature maps in DS5, the ninth Each fusion layer fuses D5 and DS5 through the existing add (element-by-element addition) method to obtain a set DA5, and the depth map Depth output of the ninth fusion layer outputs DA5; among them, the total width of the feature maps contained in RA5 and DA5 The number is 160, and the width of each feature map in RA5 and DA5 is

height is

对于第5个分块注意力卷积块，其由依次设置的第五分块层、第五通道注意力层、第五空间尺寸注意力层；第5个分块注意力卷积块的彩色图RGB输入端接收中RA5的所有特征图，RA5中的所有特征图输入第五分块注意力卷积块，先将输入所有的所有特征图经第五分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第5个分块注意力卷积块的彩色图RGB输出端输出160幅特征图，将其构成的集合记为RT5；第5个分块注意力卷积块的深度图Depth输入端接收中DA5的所有特征图，DA5中的所有特征图输入第五分块注意力卷积块，先将输入所有的所有特征图经第五分块层按通道分成两部分，一部分作为通道注意力层的输入，另一部分作为空间尺寸注意力层的输入，最后把两边注意力处理的结果分别与原输入特征图相乘再将相乘结果相加，第5个分块注意力卷积块的深度图Depth输出端输出160幅特征图，将其构成的集合记为DT5其中，第五分块层采用pytorch自带的split函数，参数为原输入特征图及对应通道数的一半；第五通道注意力层包含第五适应性最大池化、第五通道注意力第一卷积层、第五通道注意力第二卷积层以及激活层，其中第五适应性最大池化参数为1，第五通道注意力第一卷积层卷积核大小为1x1，卷积核个数为80，步长(stride)为1，分组(groups)参数为80，偏置项(bias)为False；第五通道注意力第二卷积层卷积核大小为1x1，卷积核个数为160，步长(stride)为1，分组(groups)参数为80，偏置项(bias)为False；第五通道注意力激活层激活方式为“Sigmoid”；第五空间尺寸注意力层包含第五按通道最大化层、第五空间尺寸注意力卷积层及激活函数，其中第五按通道最大化层采用pytorch自带的max函数；第五空间尺寸注意力卷积层卷积核大小为3×3，卷积核个数为1，膨胀因子(dilation)为2、补零(padding)参数为2，分组(groups)参数为1，偏置项(bias)为False，第五空间尺寸注意力激活层激活方式为“Sigmoid”。RT5和DT5中的每幅特征图的宽度为

高度为

For the fifth block attention convolution block, it consists of the fifth block layer, the fifth channel attention layer, and the fifth spatial dimension attention layer set in sequence; the color of the fifth block attention convolution block Figure RGB input receives all feature maps of RA5, all feature maps in RA5 are input into the fifth block attention convolution block, first all input feature maps are divided into two parts by channel through the fifth block layer, one part As the input of the channel attention layer, the other part is used as the input of the spatial size attention layer. Finally, the results of the attention processing on both sides are multiplied by the original input feature map respectively, and the multiplication results are added together. The fifth block attention The color map RGB output of the convolution block outputs 160 feature maps, and the set composed of them is recorded as RT5; the depth map Depth input of the fifth block attention convolution block receives all the feature maps of DA5 in DA5. All feature maps are input into the fifth block attention convolution block. First, all input feature maps are divided into two parts by channel through the fifth block layer, one part is used as the input of the channel attention layer, and the other part is used as the spatial dimension attention. The input of the force layer, and finally the results of the attention processing on both sides are multiplied by the original input feature maps and then the multiplication results are added together. The depth map Depth output of the fifth block attention convolution block outputs 160 feature maps. , the set composed of it is recorded as DT5. The fifth block layer adopts the split function that comes with pytorch, and the parameters are half of the original input feature map and the number of corresponding channels; the fifth channel attention layer contains the fifth adaptive maximum pool , the fifth channel pays attention to the first convolutional layer, the fifth channel pays attention to the second convolutional layer and the activation layer, where the fifth adaptive maximum pooling parameter is 1, and the fifth channel pays attention to the first convolutional layer volume The kernel size is 1x1, the number of convolution kernels is 80, the stride is 1, the groups parameter is 80, and the bias term is False; the fifth channel pays attention to the second convolutional layer volume The kernel size is 1x1, the number of convolution kernels is 160, the stride is 1, the groups parameter is 80, and the bias term is False; the activation method of the fifth channel attention activation layer is "Sigmoid"; the fifth spatial size attention layer includes the fifth channel maximization layer, the fifth spatial dimension attention convolution layer and the activation function, of which the fifth channel maximization layer uses the max function that comes with pytorch; the fifth The size of the convolution kernel of the attention convolution layer is 3 × 3, the number of convolution kernels is 1, the dilation factor (dilation) is 2, the padding parameter is 2, the grouping (groups) parameter is 1, and the partial Set the item (bias) to False, and the activation mode of the fifth spatial dimension attention activation layer is "Sigmoid". The width of each feature map in RT5 and DT5 is

height is

对于第十个融合层，第10个融合层的彩色图RGB输入端接收RA5和DT5中的所有特征图，第10个融合层通过现有的add(逐元素相加)方式融合RA5和DT5得到集合RF5，第10个融合层的彩色图RGB输出端输出RF5；第10个融合层的深度图Depth输入端接收RT5中的所有特征图和DA5中的所有特征图，第10个融合层通过现有的add(逐元素相加)方式融合RT5和DA5得到集合DF5，第10个融合层的深度图Depth输出端输出DF5；其中，RF5和DF5中包含的特征图的总幅数均为160，RF5和DF5中的每幅特征图的宽度均为

高度均为

For the tenth fusion layer, the color map RGB input terminal of the tenth fusion layer receives all the feature maps in RA5 and DT5, and the tenth fusion layer is obtained by fusing RA5 and DT5 through the existing add (element-by-element addition) method. Set RF5, the color map RGB output terminal of the 10th fusion layer outputs RF5; the depth map Depth input terminal of the 10th fusion layer receives all feature maps in RT5 and all feature maps in DA5, and the 10th fusion layer Some add (element-by-element addition) methods fuse RT5 and DA5 to obtain a set DF5, and the depth map Depth output of the 10th fusion layer outputs DF5; among them, the total number of feature maps contained in RF5 and DF5 is 160, The width of each feature map in RF5 and DF5 is

height is

对于第一个上采样层，第一个上采样层输入端接收RF2中的所有特征图和DF2中的所有特征图，先把RF2和DF2通过现有的add(逐元素相加)方式融合得到集合RD2，第一个上采样层通过pytorch中自带的UpsamlingBilinear2d函数得到(高度和宽度)放大两倍后的RD2特征图，函数参数为2，将其构成的集合记为RD2x2，第一个上采样层输出端输出RD2x2；其中RD2x2包含的特征图的总幅数(通道数)为32，RD2x2中的每幅特征图的宽度为

高度为

For the first upsampling layer, the input of the first upsampling layer receives all feature maps in RF2 and all feature maps in DF2, and first fuses RF2 and DF2 through the existing add (element-by-element addition) method to obtain Set RD2, the first upsampling layer obtains the twice-enlarged RD2 feature map (height and width) through the UpsamlingBilinear2d function that comes with pytorch. The function parameter is 2, and the set formed by it is recorded as RD2x2. The output of the sampling layer outputs RD2x2; the total number of feature maps (channels) contained in RD2x2 is 32, and the width of each feature map in RD2x2 is

height is

对于第二个上采样层，第二个上采样层输入端接收RF3中的所有特征图和DF3中的所有特征图，先把RF3和DF3通过现有的add(逐元素相加)方式融合得到集合RD3，第二个上采样层通过pytorch中自带的UpsamlingBilinear2d函数得到放大四倍后的RD3特征图，函数参数为4，将其构成的集合记为RD3x4，第二个上采样层输出端输出RD3x4；其中RD3x4包含的特征图的总幅数为64，RD3x4中的每幅特征图的宽度为

高度为

For the second upsampling layer, the input of the second upsampling layer receives all feature maps in RF3 and all feature maps in DF3, and first fuses RF3 and DF3 through the existing add (element-by-element addition) method to obtain Set RD3, the second upsampling layer obtains the quadrupled RD3 feature map through the UpsamlingBilinear2d function that comes with pytorch, the function parameter is 4, the set formed by it is recorded as RD3x4, the output of the second upsampling layer output RD3x4; the total number of feature maps contained in RD3x4 is 64, and the width of each feature map in RD3x4 is

height is

对于第三个上采样层，第三个上采样层输入端接收RF4中的所有特征图和DF4中的所有特征图，先把RF4和DF4通过现有的add(逐元素相加)方式融合得到集合RD4，第三个上采样层通过pytorch中自带的UpsamlingBilinear2d函数得到放大四倍后的RD4特征图，函数参数为4，将其构成的集合记为RD4x4，第三个上采样层输出端输出RD4x4；其中RD4x4包含的特征图的总幅数为96，RD4x4中的每幅特征图的宽度为

高度为

For the third upsampling layer, the input end of the third upsampling layer receives all feature maps in RF4 and all feature maps in DF4, and first fuses RF4 and DF4 through the existing add (element-by-element addition) method to obtain Set RD4, the third upsampling layer obtains the quadrupled RD4 feature map through the UpsamlingBilinear2d function that comes with pytorch, the function parameter is 4, the set formed by it is recorded as RD4x4, the output of the third upsampling layer output RD4x4; the total number of feature maps contained in RD4x4 is 96, and the width of each feature map in RD4x4 is

height is

高度为

height is

对于第四个上采样层，第四个上采样层输入端接收RF5中的所有特征图和DF5中的所有特征图，先把RF5和DF5通过现有的add(逐元素相加)方式融合得到集合RD5，第四个上采样层通过pytorch中自带的UpsamlingBilinear2d函数得到放大四倍后的RD5特征图，函数参数为4，将其构成的集合记为RD5x4，第四个上采样层输出端输出RD5x4；其中RD5x4包含的特征图的总幅数为160，RD5x4中的每幅特征图的宽度为

高度为

For the fourth upsampling layer, the input end of the fourth upsampling layer receives all feature maps in RF5 and all feature maps in DF5, and first fuses RF5 and DF5 through the existing add (element-by-element addition) method to obtain Set RD5, the fourth upsampling layer obtains the quadrupled RD5 feature map through the UpsamlingBilinear2d function that comes with pytorch, the function parameter is 4, the set formed by it is recorded as RD5x4, the output of the fourth upsampling layer output RD5x4; the total number of feature maps contained in RD5x4 is 160, and the width of each feature map in RD5x4 is

height is

对于第十一个融合层，第十一个融合层输入端接收RF1中的所有特征图和DF1中的所有特征图，将RF1和DF1通过现有的add(逐元素相加)方式融合得到集合RD1，其中RD1包含的特征图的总幅数为24，RD1中的每幅特征图的宽度为，高度为

高度为

For the eleventh fusion layer, the input of the eleventh fusion layer receives all feature maps in RF1 and all feature maps in DF1, and fuses RF1 and DF1 through the existing add (element-by-element addition) method to obtain a set RD1, where the total number of feature maps contained in RD1 is 24, the width of each feature map in RD1 is , and the height is

height is

对于第十二个融合层，第十二个融合层输入端接收RD1、RD2x2、RD3x4、RD4x4和RD5x4中的所有特征图，第十二个融合层通过现有的concatenate方式连接RD1、RD2x2、RD3x4、RD4x4和RD5x4得到集合RDM，第十二个融合层输出端输出RDM；其中，RDM中包含的特征图的总幅数为376(24+32+64+96+160)，RDM中每幅特征图的宽度为

For the twelfth fusion layer, the input of the twelfth fusion layer receives all feature maps in RD1, RD2x2, RD3x4, RD4x4 and RD5x4, and the twelfth fusion layer connects RD1, RD2x2, RD3x4 through the existing concatenate method , RD4x4 and RD5x4 to obtain the set RDM, and the output terminal of the twelfth fusion layer outputs the RDM; among them, the total number of feature maps contained in the RDM is 376 (24+32+64+96+160), and each feature in the RDM The width of the graph is

对于输出层，其由依次连接第七十一卷积层和第五上采样层组成，其中，第七十一卷积层的卷积核大小为1×1、卷积核个数为41、步长(stride)为1、偏置项(bias)为False；第五上采样层通过pytorch中自带的UpsamlingBilinear2d函数得到放大四倍后的特征图，函数参数为4；输出层的输入端接收RDM中的所有特征图，输出层的输出端输出41幅与原始输入图像对应的语义分割预测图。For the output layer, it consists of connecting the seventy-first convolutional layer and the fifth upsampling layer in sequence, wherein the convolution kernel size of the seventy-first convolutional layer is 1×1, the number of convolution kernels is 41, The step size (stride) is 1, and the bias term (bias) is False; the fifth upsampling layer obtains the quadrupled feature map through the UpsamlingBilinear2d function that comes with pytorch, and the function parameter is 4; the input of the output layer receives For all feature maps in RDM, the output of the output layer outputs 41 semantic segmentation prediction maps corresponding to the original input image.

步骤1_3：将训练集中的每对原始的室内场景RGB图像和Depth图像作为原始输入图像，输入到卷积神经网络中进行训练，得到训练集中的每对原始的室内场景图像对应的41幅语义分割预测图，将{RGB^q(i,j),Depth^q(i,j)}对应的41幅语义分割预测图构成的集合记为

Step 1_3: Take each pair of original indoor scene RGB images and Depth images in the training set as the original input image, input them into the convolutional neural network for training, and obtain 41 semantic segmentations corresponding to each pair of original indoor scene images in the training set Prediction map, the set of 41 semantic segmentation prediction maps corresponding to {RGB ^q (i,j),Depth ^q (i,j)} is recorded as

步骤1_4：计算训练集中的每对原始的室内场景图像对应的41幅语义分割预测图构成的集合与对应的真实语义分割图像处理成的41幅独热编码图像构成的集合之间的损失函数值，将

与

之间的损失函数值记为

采用分类交叉熵(categorical crossentropy)获得。Step 1_4: Calculate the loss function value between the set of 41 semantic segmentation prediction images corresponding to each pair of original indoor scene images in the training set and the set of 41 one-hot encoded images processed from the corresponding real semantic segmentation images. ,Will

and

The loss function value between is denoted as

Obtained using categorical crossentropy.

步骤1_5：重复执行步骤1_3和步骤1_4共V次，得到卷积神经网络分类训练模型，并共得到Q×V个损失函数值；然后从Q×V个损失函数值中找出值最小的损失函数值；接着将值最小的损失函数值对应的权值矢量和偏置项对应作为卷积神经网络分类训练模型的最优权值矢量和最优偏置项，对应记为W^best和b^best；其中，V＞1，在本实施例中取V＝500。Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain a convolutional neural network classification training model, and obtain a total of Q×V loss function values; then find the loss with the smallest value from the Q×V loss function values. function value; then the weight vector and bias term corresponding to the loss function value with the smallest value are corresponding to the optimal weight vector and optimal bias term of the convolutional neural network classification training model, which are correspondingly recorded as W ^best and b ^best ; Wherein, V>1, in this embodiment, take V=500.

所述的测试阶段过程的具体步骤为：The specific steps of the test phase process are:

步骤2_1：从原始数据集中分离出P对彩色图和深度图作为测试集，其中P为正整数，本发明中P取699，令{RGB^p(i',j'),Depth^p(i',j')}表示测试集中待语义分割的室内场景图像，1≤p≤P；其中，1≤i'≤W'，1≤j'≤H'，W'表示{RGB^p(i',j'),Depth^p(i',j')}彩色图和深度图的宽度，H'表示{RGB^p(i',j'),Depth^p(i',j)'}彩色图和深度图的高度，RGB^p(i',j')，Depth^p(i',j')分别表示{RGB^p(i',j'),Depth^p(i',j')}中彩色图RGB和深度图Depth坐标位置为(i,j)的像素点的像素值。Step 2_1: Separate P pairs of color maps and depth maps from the original data set as a test set, where P is a positive integer, and in the present invention, P is 699, let {RGB ^p (i', j'), Depth ^p (i',j')} represents the indoor scene image to be semantically segmented in the test set, 1≤p≤P; among them, 1≤i'≤W', 1≤j'≤H', W' denotes {RGB ^p (i', j'), Depth ^p (i', j')} the width of the color map and depth map, H' represents {RGB ^p (i', j'), Depth ^p (i', j)'} color map and depth The height of the image, RGB ^p (i', j'), Depth ^p (i', j') represent the color image RGB in {RGB ^p (i', j'), Depth ^p (i', j')} respectively And the pixel value of the pixel whose depth map Depth coordinate position is (i, j).

步骤2_2：将{RGB^p(i',j'),Depth^p(i',j')}中的彩色图RGB和深度图Depth输入已经训练好的改进全卷积神经网络语义分割模型中，并利用训练中获得的卷积核最优权重W^best和最优偏置项b^best进行预测，得到{RGB^p(i',j'),Depth^p(i',j')}对应的预测语义分割图像，记为

其中，

表示

中坐标位置为(i',j')的像素点的像素值。Step 2_2: Input the color map RGB and depth map Depth in {RGB ^p (i',j'),Depth ^p (i',j')} into the trained improved fully convolutional neural network semantic segmentation model, And use the convolution kernel optimal weight W ^best and the optimal bias term b ^best obtained in the training to predict, and get the prediction corresponding to {RGB ^p (i',j'),Depth ^p (i',j')} Semantic segmentation of images, denoted as

in,

express

The pixel value of the pixel whose middle coordinate position is (i', j').

为进一步验证本发明的效果，以下通过具体实验来进行验证。In order to further verify the effect of the present invention, the following is verified through specific experiments.

本发明使用基于python3.6的深度学习库Pytorch1.1.0搭建的改进全卷积神经网络语义分割模型的架构。采用室内场景图像数据库NYUv2测试集来分析利用本发明方法预测得到的室内场景图像(取699对室内场景图像)的分割效果。这里，采用评估语义分割方法的3个常用客观参量作为评价指标，即类精确度(Class Acurracy)、平均像素准确率(MeanPixel Accuracy，MPA)、分割图像与标签图像交集与并集的比值(Mean Intersection overUnion，MIoU)来评价预测语义分割图像的分割性能。三类指标值越高，说明本发明中模型性能越好。The present invention uses the structure of the improved full convolutional neural network semantic segmentation model built by the deep learning library Pytorch1.1.0 based on python3.6. The indoor scene image database NYUv2 test set is used to analyze the segmentation effect of the indoor scene images (take 699 pairs of indoor scene images) predicted by the method of the present invention. Here, three commonly used objective parameters for evaluating semantic segmentation methods are used as evaluation indicators, namely Class Acuracy, Mean Pixel Accuracy (MPA), and the ratio of the intersection and union of the segmented image and the label image (Mean Pixel Accuracy). Intersection over Union, MIoU) to evaluate the segmentation performance of predicted semantically segmented images. The higher the index value of the three types, the better the performance of the model in the present invention.

利用本发明方法对室内场景图像数据库NYUv2测试集中的每对室内场景图像进行预测，得到每对室内图像对应的预测语义分割图像，反映本发明方法的语义分割效果的类精确度CA、平均像素准确率MPA、分割图像与标签图像交集与并集的比值MIoU如表1所列。从表1中的数据可以看出，利用本发明方法在类精确度CA、平均像素准确率MPA、分割图像与标签图像交集与并集的比值MIoU三类指标上都取得了不错的结果，说明本发明方法的有效性。The method of the present invention is used to predict each pair of indoor scene images in the NYUv2 test set of the indoor scene image database, and the predicted semantic segmentation image corresponding to each pair of indoor images is obtained, which reflects the class accuracy CA and the average pixel accuracy of the semantic segmentation effect of the method of the present invention. The rate MPA, the ratio of the intersection and union of the segmented image and the label image, MIoU, are listed in Table 1. As can be seen from the data in Table 1, the method of the present invention has achieved good results in three categories of indicators: class accuracy CA, average pixel accuracy MPA, and ratio MIoU of the intersection and union of segmented images and label images. effectiveness of the method of the present invention.

表1利用本发明方法在测试集上的评测结果Table 1. Evaluation results on the test set using the method of the present invention

CACA 61.01％61.01% MPAMPA 74.34％74.34% MIoUMIoU 47.92％47.92%

图2a给出了同一场景的第1幅原始的室内场景彩色RGB图像，图2b给出了同一场景的第1幅原始的室内场景深度Depth图像，图2c给出了利用本发明方法对图2a和2b所示的原始的室内场景图像进行预测，得到的预测语义分割图像；图3a给出了同一场景的第2幅原始的室内场景彩色RGB图像，图3b给出了同一场景的第2幅原始的室内场景深度Depth图像，图3c给出了利用本发明方法对图3a和3b所示的原始的室内场景图像进行预测，得到的预测语义分割图像；图4a给出了同一场景的第3幅原始的室内场景彩色RGB图像，图4b给出了同一场景的第3幅原始的室内场景深度Depth图像，图4c给出了利用本发明方法对图4a和4b所示的原始的室内场景图像进行预测，得到的预测语义分割图像；图5a给出了同一场景的第4幅原始的室内场景彩色RGB图像，图5b给出了同一场景的第4幅原始的室内场景深度Depth图像，图5c给出了利用本发明方法对图5a和5b所示的原始的室内场景图像进行预测，得到的预测语义分割图像。对比图2a、2b和图2c，对比图3a、3b和图3c，对比图4a、4b和图4c，对比图5a、5b和图5c，可以看出利用本发明方法得到的预测语义分割图像的分割精度较高。Figure 2a shows the first original indoor scene color RGB image of the same scene, Figure 2b shows the first original indoor scene depth Depth image of the same scene, and Figure 2c shows the method of the present invention. Predict with the original indoor scene image shown in 2b, and obtain the predicted semantic segmentation image; Figure 3a shows the second original indoor scene color RGB image of the same scene, and Figure 3b shows the second image of the same scene The original indoor scene depth Depth image, Figure 3c shows the predicted semantic segmentation image obtained by using the method of the present invention to predict the original indoor scene image shown in Figures 3a and 3b; Figure 4a shows the third image of the same scene. An original indoor scene color RGB image, Figure 4b shows the third original indoor scene depth Depth image of the same scene, Figure 4c shows the original indoor scene image shown in Figures 4a and 4b using the method of the present invention Make predictions and get the predicted semantic segmentation images; Figure 5a shows the 4th original indoor scene color RGB image of the same scene, Figure 5b shows the 4th original indoor scene depth Depth image of the same scene, Figure 5c The predicted semantic segmentation images obtained by using the method of the present invention to predict the original indoor scene images shown in Figures 5a and 5b are given. Comparing Fig. 2a, 2b and Fig. 2c, comparing Fig. 3a, 3b and Fig. 3c, comparing Fig. 4a, 4b and Fig. 4c, comparing Fig. 5a, 5b and Fig. 5c, it can be seen that the predicted semantic segmentation image obtained by the method of the present invention The segmentation accuracy is high.

Claims

1. The indoor scene semantic segmentation method based on the improved full convolution neural network is characterized by comprising the following steps of:

step 1: selecting Q pairs of original indoor scene images and corresponding real semantic segmentation images, and forming a training set by all the original indoor scene images and the corresponding real semantic segmentation images; each pair of original indoor scene images comprises an original indoor scene color image and an original indoor scene depth image, and the real semantic segmentation images in the training set are processed into 41 independent thermal coding images by adopting an independent thermal coding technology;

step 2: constructing a convolutional neural network classification training model: the convolutional neural network classification training model comprises an input layer, a hidden layer and an output layer; the input layer comprises a color image input layer and a depth image input layer; the hidden layer comprises a color image processing module and a depth image processing module; the color image processing module and the depth image processing module are symmetrical in structure and respectively comprise five neural network blocks, five feature re-extraction volume blocks and ten fusion layers; the hidden layer also comprises five block attention volume blocks, four upper sampling layers and two fusion layers;

and step 3: inputting the training set into the convolutional neural network classification training model in the step 2 for training, in the training process, performing iterative training processing each time to obtain 41 semantic segmentation predicted images corresponding to each pair of original indoor scene images, and calculating a loss function value between a set formed by the 41 semantic segmentation predicted images and a set formed by 41 one-hot coded images corresponding to real semantic segmentation images;

and 4, step 4: repeating the step 3 for a total of V times to obtain Q multiplied by V loss function values; then finding out the minimum loss function value from the Q multiplied by V loss function values, and taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, thereby completing the training of the convolutional neural network classification training model;

and 5: and performing prediction processing on an indoor scene image to be predicted by using the convolutional neural network classification training model obtained after training, outputting and obtaining a corresponding prediction semantic segmentation image, and realizing indoor scene image semantic segmentation.

2. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 1, wherein: the step 2) is specifically as follows:

the color image input layer and the depth image input layer are respectively input into a first neural network block in the color image processing module and the depth image processing module;

the color image processing module and the depth image processing module have the same structure, and specifically comprise:

one output of the first neural network block is input into a first fusion layer through a first feature re-extraction convolution block, and the other output of the first neural network block is input into a first fusion layer; one output of the second neural network block is input into a third fusion layer through a second feature re-extraction convolution block, and the other output of the second neural network block is input into the third fusion layer; one output of the third neural network block is input into a fifth fusion layer through a third feature re-extraction convolution block, and the other output of the third neural network block is input into the fifth fusion layer; one output of the fourth neural network block is input into the seventh fusion layer through the fourth feature re-extraction convolution block, and the other output of the fourth neural network block is input into the seventh fusion layer; one output of the fifth neural network block is input into a ninth fusion layer through a fifth feature re-extraction convolution block, and the other output of the fifth neural network block is input into the ninth fusion layer; the two inputs of each fusion layer are fused in an element-by-element addition mode;

the output of the first fusion layer is respectively input into a first block attention volume block and a corresponding second fusion layer, the output of the third fusion layer is respectively input into a second block attention volume block and a corresponding fourth fusion layer, the output of the fifth fusion layer is respectively input into a third block attention volume block and a corresponding sixth fusion layer, the output of the seventh fusion layer is respectively input into a fourth block attention volume block and a corresponding eighth fusion layer, and the output of the ninth fusion layer is respectively input into a fifth block attention volume block and a corresponding tenth fusion layer;

two outputs of the first block attention volume block are respectively input into a second fusion layer of the color image processing module and a second fusion layer of the depth image processing module, two outputs of the second block attention volume block are respectively input into a fourth fusion layer of the color image processing module and a fourth fusion layer of the depth image processing module, two outputs of the third block attention volume block are respectively input into a sixth fusion layer of the color image processing module and a sixth fusion layer of the depth image processing module, two outputs of the fourth block attention volume block are respectively input into an eighth fusion layer of the color image processing module and a eighth fusion layer of the depth image processing module, and two outputs of the fifth block attention volume block are respectively input into a tenth fusion layer of the color image processing module and the tenth fusion layer of the depth image processing module;

the two inputs of the second fusion layer are fused in an element-by-element addition mode and then respectively input into an eleventh fusion layer and a corresponding second neural network block, the two inputs of the fourth fusion layer are fused in an element-by-element addition mode and then respectively input into a first up-sampling layer and a corresponding third neural network block, the two inputs of the sixth fusion layer are fused in an element-by-element addition mode and then respectively input into a second up-sampling layer and a corresponding fourth neural network block, and the two inputs of the eighth fusion layer are fused in an element-by-element addition mode and then respectively input into a third up-sampling layer and a corresponding fifth neural network block; the output of the tenth fusion layer is input into the fourth upsampling layer;

the two inputs of the eleventh fusion layer, the first up-sampling layer, the second up-sampling layer, the third up-sampling layer and the fourth up-sampling layer are fused in an element-by-element addition mode and then are all input into the twelfth fusion layer;

and all the inputs of the twelfth fusion layer are connected in a concatenate mode and then output through the output layer, and the output layer mainly comprises a convolution layer and a fifth upper sampling layer which are sequentially connected.

3. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 1, wherein: the five neural network blocks adopt a MobileNet V2 network structure, the first neural network block adopts 1-4 layers in MobileNet V2, the second neural network block adopts 5-7 layers in MobileNet V2, the third neural network block adopts 8-11 layers in MobileNet V2, the fourth neural network block adopts 12-14 layers in MobileNet V2, and the fifth neural network block adopts 15-17 layers in MobileNet V2.

4. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 2, wherein: each feature re-extraction convolution block consists of four re-extraction modules which are connected in sequence, and each re-extraction module comprises a convolution layer, a normalization layer and an activation layer which are connected in sequence;

the activation mode of all the activation layers adopts ReLU 6; the step length of all the convolution layers is 1; the number of convolution kernels of the convolution layer in each re-extraction module is the same as the standardized parameters of the standardized layer;

the convolution kernels of the convolution layers in the first re-extraction module, the third re-extraction module and the fourth re-extraction module are the same in size, and the convolution kernel of the convolution layer in the second re-extraction module is half of the convolution kernel of the convolution layer in the first re-extraction module;

the convolution kernels of the convolution layers in the first re-extraction module and the third re-extraction module are both 3x 3, and the convolution kernels of the convolution layers in the second re-extraction module and the fourth re-extraction module are both 1x 1; the expansion factor of the convolution layer in the first re-extraction module is 1, and the expansion factor of the convolution layer in the third re-extraction module is 2; the zero padding parameter of the convolution layer in the first re-extraction module is 1, the zero padding parameter of the convolution layer in the third re-extraction module is 2, and the zero padding parameters of the convolution layers in the second re-extraction module and the fourth re-extraction module are both 0.

5. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 4, wherein: the convolution kernel size of the convolution layer of the first re-extraction module in the first feature re-extraction volume block is 24, the convolution kernel size of the convolution layer of the first re-extraction module in the second feature re-extraction volume block is 32, the convolution kernel size of the convolution layer of the first re-extraction module in the third feature re-extraction volume block is 64, the convolution kernel size of the convolution layer of the first re-extraction module in the fourth feature re-extraction volume block is 96, and the convolution kernel size of the convolution layer of the first re-extraction module in the fifth feature re-extraction volume block is 160;

grouping parameters of four sequentially arranged convolutional layers in the first feature re-extraction convolutional block are respectively 24, 12, 24 and 24, grouping parameters of four sequentially arranged convolutional layers in the second feature re-extraction convolutional block are respectively 32, 16 and 32, grouping parameters of four sequentially arranged convolutional layers in the third feature re-extraction convolutional block are respectively 64, 32 and 64, grouping parameters of four sequentially arranged convolutional layers in the fourth feature re-extraction convolutional block are respectively 64, 48 and 99, and grouping parameters of four sequentially arranged convolutional layers in the fifth feature re-extraction convolutional block are respectively 96, 80 and 160.

6. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 2, wherein: each of the block attention convolution blocks comprises a block layer, a channel attention layer and a space size attention layer, wherein the input of the block attention convolution block is input into the channel attention layer and the space size attention layer after passing through the block layer, the input of the block attention convolution block is multiplied by the output of the channel attention layer and the output of the space size attention layer, and the multiplied results are added to be used as the output of the block attention convolution block.

7. The method for indoor scene semantic segmentation based on the improved full convolution neural network as claimed in claim 6, wherein: the blocking layer adopts a split function carried by the pytorech, and the parameter is half of the number of channels corresponding to the input characteristic diagram of the blocking attention volume block;

the channel attention layer comprises an adaptive maximum pooling layer, a channel attention first convolution layer, a channel attention second convolution layer and a channel attention activation layer which are sequentially connected; the maximum pooling parameter of the adaptive maximum pooling layer is 1; the convolution kernels of the channel attention first convolution layer and the channel attention second convolution layer are both 1x1 in size, the step length is 1, the bias terms are False, the grouping parameters are the same, and the number of convolution kernels in the channel attention first convolution layer is half of that of the channel attention second convolution layer;

the number of convolution kernels of the channel attention first convolution layer of the channel attention layer in the first block attention convolution block is 12, and the grouping parameter is 12; the number of convolution kernels of the channel attention first convolution layer of the channel attention layer in the second block attention convolution block is 16, and the grouping parameter is 16; the number of convolution kernels of the first convolution layer of the channel attention layer in the third block attention convolution block is 32, and the grouping parameter is 32; the number of convolution kernels of the first convolution layer of the channel attention layer in the fourth block attention convolution block is 48, and the grouping parameter is 48; the number of convolution kernels of the first convolution layer of the channel attention layer in the fifth block attention convolution block is 80, and the grouping parameter is 80;

the spatial dimension attention layer comprises a per-channel maximization layer, a spatial dimension attention convolution layer and a spatial dimension attention activation layer which are sequentially connected; adopting a max function carried by the pytorch according to a channel maximization layer; the convolution kernel size of the spatial dimension attention convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, the expansion factor is 2, the zero padding parameter is 2, the grouping parameter is 1, and the bias term is False;

the channel attention layer and the space size attention layer adopt an activation mode which is a Sigmoid function.

8. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 2, wherein: the upsampling layer adopts an upsampling Biliner 2d function carried in the pytorech, the function parameter of the first upsampling layer is 2, and the function parameters of the second upsampling layer, the third upsampling layer, the fourth upsampling layer and the fifth upsampling layer are 4; the convolution kernel size of the convolution layer in the output layer is 1 multiplied by 1, the number of the convolution kernels is 41, the step length is 1, and the bias term is False.

9. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 4, wherein: the input end of the color image input layer receives an indoor scene color image, the input end of the depth image input layer receives an indoor scene depth image, and the output of the output layer is 41 semantic segmentation predicted images corresponding to the indoor scene image input by the input layer.