CN109460815A - A kind of monocular depth estimation method - Google Patents

A kind of monocular depth estimation method Download PDF

Info

Publication number
CN109460815A
CN109460815A CN201811246664.0A CN201811246664A CN109460815A CN 109460815 A CN109460815 A CN 109460815A CN 201811246664 A CN201811246664 A CN 201811246664A CN 109460815 A CN109460815 A CN 109460815A
Authority
CN
China
Prior art keywords
feature maps
layer
output
feature
width
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811246664.0A
Other languages
Chinese (zh)
Other versions
CN109460815B (en
Inventor
周武杰
袁建中
吕思嘉
钱亚冠
向坚
张宇来
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201811246664.0A priority Critical patent/CN109460815B/en
Publication of CN109460815A publication Critical patent/CN109460815A/en
Application granted granted Critical
Publication of CN109460815B publication Critical patent/CN109460815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Error Detection And Correction (AREA)

Abstract

本发明公开了一种单目视觉深度估计方法,其先构建卷积神经网络,其包括输入层、隐层和输出层;隐层包括编码框架、译码框架和上采样框架;然后使用训练集中的单目图像作为原始输入图像,输入到卷积神经网络中进行训练,得到训练集中的每幅原始的单目图像对应的估计深度图像;接着通过计算训练集中的单目图像对应的估计深度图像与对应的真实深度图像之间的损失函数值,得到卷积神经网络训练模型及最优权值矢量和最优偏置项;再将待预测的单目图像输入到卷积神经网络训练模型中,并利用最优权值矢量和最优偏置项,预测得到对应的预测深度图像;优点是其预测精度高。

The invention discloses a monocular visual depth estimation method, which firstly constructs a convolutional neural network, which includes an input layer, a hidden layer and an output layer; the hidden layer includes an encoding frame, a decoding frame and an upsampling frame; and then uses a training set As the original input image, the monocular image is input into the convolutional neural network for training, and the estimated depth image corresponding to each original monocular image in the training set is obtained; then the estimated depth image corresponding to the monocular image in the training set is calculated. The loss function value between the corresponding real depth image and the convolutional neural network training model, the optimal weight vector and the optimal bias term are obtained; then the monocular image to be predicted is input into the convolutional neural network training model , and use the optimal weight vector and the optimal bias term to predict the corresponding predicted depth image; the advantage is that its prediction accuracy is high.

Description

一种单目视觉深度估计方法A Monocular Vision Depth Estimation Method

技术领域technical field

本发明涉及一种图像信号处理技术,尤其是涉及一种单目视觉深度估计方法。The invention relates to an image signal processing technology, in particular to a monocular visual depth estimation method.

背景技术Background technique

经济的高速发展带来了人们生活水平的不断提升,随着人们对好的生活质量的要求逐渐增强,交通的便利性也越来越好。汽车作为交通中的重要一环,其发展更加被重视。在人工智能大火的如今,无人驾驶也是近年来较为热门的话题之一,并且在百度宣布无人驾驶车进入批量生产即将投入使用之后,无人驾驶的热潮持续提高。车前的单目视觉深度估计是无人驾驶领域的一部分,它可以有效地保障汽车行驶过程中的安全。The rapid development of the economy has brought about the continuous improvement of people's living standards. As people's requirements for a good quality of life gradually increase, the convenience of transportation is also getting better and better. As an important part of transportation, the development of automobiles has been paid more and more attention. In the age of artificial intelligence, driverless cars are also one of the hottest topics in recent years, and after Baidu announced that driverless cars will enter mass production and will be put into use, the upsurge of driverless cars continues to increase. The monocular visual depth estimation in front of the car is a part of the field of unmanned driving, which can effectively ensure the safety of the car during driving.

单目视觉深度估计的方法主要有传统方法和深度学习方法。在深度学习方法出现之前,依赖于传统方法的深度估计得出的结果远不能满足人们的最低的期望标准;在深度学习方法出现后,在深度学习中使用端到端的训练方法,使用大量的训练数据,进行学习后深度估计得出的结果精度得到了极大的提升。Eigen等人在文献《Depth Map Predictionfrom a Single Image using a Multi-Scale Deep Network》(《基于多尺度深度网络的单幅图像深度图预测》)中提出的神经网络的基础上进行了进一步提升,《基于多尺度深度网络的单幅图像深度图预测》提出使用两个尺度的神经网络来做深度估计:粗规模网络预测全局深度分布和精细规模网络以局部细化深度图,而Eigen等人在这两个尺度的神经网络的基础上将其拓展到三个尺度。该三个尺度的神经网络架构首先使用第一个尺度来根据整个图像区域预测出一个较为粗略的结果,然后使用第二个尺度对其在中等分辨率的基础上进行优化,最后使用第三个尺度对结果上采样后做细化提炼获得预测深度图,但是,该三个尺度的神经网络架构是针对深度预测、表面法线估计和语义分割这三种不同的计算机视觉任务的联合预测而提出的,若将其单独用于深度估计,则深度估计的准确性却不是很高,而且最终得到的预测深度图只有原本图像尺寸的一半,而尺寸的不一致性不利于对其中深度信息的直接使用。The methods of monocular visual depth estimation mainly include traditional methods and deep learning methods. Before the emergence of deep learning methods, the results obtained by relying on traditional methods for depth estimation are far from meeting people's minimum expectations; after the emergence of deep learning methods, end-to-end training methods are used in deep learning, using a large number of training The accuracy of the results obtained by depth estimation after learning has been greatly improved. Eigen et al. further improved the neural network proposed in the document "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" ("Depth Map Prediction of Single Image Based on Multi-Scale Deep Network"). Single Image Depth Map Prediction Based on Multi-scale Deep Networks" proposed to use two-scale neural networks for depth estimation: a coarse-scale network to predict the global depth distribution and a fine-scale network to locally refine the depth map, and Eigen et al. Based on the neural network of two scales, it is extended to three scales. The three-scale neural network architecture first uses the first scale to predict a rougher result based on the entire image area, then uses the second scale to optimize it on the basis of medium resolution, and finally uses the third scale. The scales upsample the results and refine and refine the predicted depth map. However, the three-scale neural network architecture is proposed for the joint prediction of three different computer vision tasks: depth prediction, surface normal estimation, and semantic segmentation. If it is used for depth estimation alone, the accuracy of depth estimation is not very high, and the final predicted depth map is only half the size of the original image, and the inconsistency of the size is not conducive to the direct use of depth information. .

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是提供一种单目视觉深度估计方法,其预测精度高。The technical problem to be solved by the present invention is to provide a monocular visual depth estimation method with high prediction accuracy.

本发明解决上述技术问题所采用的技术方案为:一种单目视觉深度估计方法,其特征在于包括训练阶段和测试阶段两个过程;The technical solution adopted by the present invention to solve the above technical problems is: a monocular visual depth estimation method, which is characterized in that it includes two processes: a training phase and a testing phase;

所述的训练阶段过程的具体步骤为:The specific steps of the training phase process are:

步骤1_1:选取N幅原始的单目图像及每幅原始的单目图像对应的真实深度图像,并构成训练集,将训练集中的第n幅原始的单目图像记为{Qn(x,y)},将训练集中与{Qn(x,y)}对应的真实深度图像记为其中,N为正整数,N≥100,n为正整数,1≤n≤N,1≤x≤R,1≤y≤L,R表示{Qn(x,y)}和的宽度,L表示{Qn(x,y)}和的高度,R和L均能被2整除,Qn(x,y)表示{Qn(x,y)}中坐标位置为(x,y)的像素点的像素值,表示中坐标位置为(x,y)的像素点的像素值;Step 1_1: Select N original monocular images and the real depth image corresponding to each original monocular image, and form a training set, and record the nth original monocular image in the training set as {Q n (x, y)}, denote the real depth image corresponding to {Q n (x, y)} in the training set as Among them, N is a positive integer, N≥100, n is a positive integer, 1≤n≤N, 1≤x≤R, 1≤y≤L, R represents {Q n (x, y)} and The width of , L represents {Q n (x, y)} and The height of , R and L are both divisible by 2, Q n (x, y) represents the pixel value of the pixel at the coordinate position (x, y) in {Q n (x, y)}, express The pixel value of the pixel whose middle coordinate position is (x, y);

步骤1_2:构建端到端的卷积神经网络:卷积神经网络包括输入层、隐层和输出层;隐层包括编码框架、译码框架和上采样框架;Step 1_2: Build an end-to-end convolutional neural network: the convolutional neural network includes an input layer, a hidden layer, and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame;

对于输入层,输入层的输入端接收一幅原始输入图像,输入层的输出端输出原始输入图像给隐层;其中,要求输入层的输入端接收的原始输入图像的宽度为R、高度为L;For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; among them, the width of the original input image received by the input end of the input layer is required to be R and the height is L ;

对于编码框架,其由依次设置的第一卷积层、第一批规范化层、第一激活层、第一最大池化层、第二卷积层、第二批规范化层、第二激活层、第三卷积层、第三批规范化层、第一Concatenate融合层、第三激活层、第二最大池化层、第四卷积层、第四批规范化层、第四激活层、第五卷积层、第五批规范化层、第二Concatenate融合层、第五激活层、第三最大池化层、第一带孔卷积层、第六批规范化层、第六激活层、第二带孔卷积层、第七批规范化层、第三Concatenate融合层、第七激活层、第四最大池化层、第三带孔卷积层、第八批规范化层、第八激活层组成;对于译码框架,其由依次设置的第一反卷积层、第九批规范化层、第四Concatenate融合层、第九激活层、第六卷积层、第十批规范化层、第十激活层、第二反卷积层、第十一批规范化层、第五Concatenate融合层、第十一激活层、第七卷积层、第十二批规范化层、第十二激活层、第三反卷积层、第十三批规范化层、第六Concatenate融合层、第十三激活层、第八卷积层、第十四批规范化层、第十四激活层、第四反卷积层、第十五批规范化层、第七Concatenate融合层组成;对于上采样框架,其由依次设置的第一上采样层、第十卷积层、第十七批规范化层、第十七激活层、第二上采样层、第十一卷积层、第十八批规范化层、第十八激活层、第三上采样层、第十二卷积层、第十九批规范化层、第十九激活层、第四上采样层、第十三卷积层、第二十批规范化层、第二十激活层组成;对于输出层,其由依次设置的第十五激活层、第九卷积层、第十六批规范化层、第十六激活层组成,其中,第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层、第一反卷积层至第四反卷积层各自的卷积核大小为3×3,第一卷积层的卷积核个数为32、第二卷积层和第三卷积层的卷积核个数为64、第四卷积层和第五卷积层的卷积核个数为128、第一带孔卷积层和第二带孔卷积层的卷积核个数为256、第三带孔卷积层的卷积核个数为512、第一反卷积层和第六卷积层的卷积核个数为256、第二反卷积层和第七卷积层的卷积核个数为128、第三反卷积层和第八卷积层的卷积核个数为64、第四反卷积层的卷积核个数为32、第九卷积层的卷积核个数为1、第十卷积层的卷积核个数为256、第十一卷积层的卷积核个数为128、第十二卷积层的卷积核个数为64、第十三卷积层的卷积核个数为32,第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层各自的卷积步长采用默认值,第一反卷积层至第四反卷积层各自的卷积步长为2×2,第一批规范化层至第二十批规范化层的参数采用默认值,第一激活层至第二十激活层的激活函数采用ReLu,第一最大池化层至第四最大池化层的池化步长为2×2,第一上采样层至第四上采样层的采样步长为2×2;For the encoding framework, it consists of the first convolutional layer, the first normalization layer, the first activation layer, the first max pooling layer, the second convolutional layer, the second normalization layer, the second activation layer, 3rd Convolutional Layer, 3rd Batch Normalization Layer, 1st Concatenate Fusion Layer, 3rd Activation Layer, 2nd Max Pooling Layer, 4th Convolutional Layer, 4th Batch Normalization Layer, 4th Activation Layer, 5th Volume Convolution layer, fifth batch normalization layer, second concatenate fusion layer, fifth activation layer, third max pooling layer, first convolutional layer with holes, sixth batch normalization layer, sixth activation layer, second hole Convolutional layer, seventh batch of normalization layer, third Concatenate fusion layer, seventh activation layer, fourth maximum pooling layer, third convolutional layer with holes, eighth batch of normalization layer, eighth activation layer; The code frame is composed of the first deconvolution layer, the ninth batch of normalization layers, the fourth Concatenate fusion layer, the ninth activation layer, the sixth convolution layer, the tenth batch of normalization layers, the tenth activation layer, the fourth batch of The second deconvolution layer, the eleventh normalization layer, the fifth concatenate fusion layer, the eleventh activation layer, the seventh convolution layer, the twelfth normalization layer, the twelfth activation layer, the third deconvolution layer , the thirteenth batch of normalization layer, the sixth Concatenate fusion layer, the thirteenth activation layer, the eighth convolution layer, the fourteenth batch of normalization layer, the fourteenth activation layer, the fourth deconvolution layer, the fifteenth batch The normalization layer and the seventh Concatenate fusion layer are composed; for the upsampling framework, it consists of the first upsampling layer, the tenth convolutional layer, the seventeenth batch normalization layer, the seventeenth activation layer, and the second upsampling layer. , the eleventh convolutional layer, the eighteenth batch of normalization layers, the eighteenth activation layer, the third upsampling layer, the twelfth convolutional layer, the nineteenth batch of normalization layers, the nineteenth activation layer, the fourth upper The sampling layer, the thirteenth convolutional layer, the twentieth batch of normalization layers, and the twentieth activation layer are composed; for the output layer, it consists of the fifteenth activation layer, the ninth convolutional layer, and the sixteenth batch of normalization set in sequence. layer and the sixteenth activation layer, among which, the first convolutional layer to the thirteenth convolutional layer, the first convolutional layer with holes to the third convolutional layer with holes, the first deconvolution layer to the fourth inverse convolutional layer The convolution kernel size of each convolution layer is 3 × 3, the number of convolution kernels of the first convolution layer is 32, the number of convolution kernels of the second convolution layer and the third convolution layer is 64, and the number of convolution kernels of the fourth convolution layer is 64. The number of convolution kernels of the convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first convolutional layer and the second convolutional layer with holes is 256, and the number of convolutional kernels of the third convolutional layer with holes The number of convolution kernels is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolution layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolution layer is 128, The number of convolution kernels in the third deconvolution layer and the eighth convolution layer is 64, the number of convolution kernels in the fourth deconvolution layer is 32, and the number of convolution kernels in the ninth convolution layer is 1. The number of convolution kernels in the tenth convolution layer is 256, the number of convolution kernels in the eleventh convolution layer is 128, the number of convolution kernels in the twelfth convolution layer is 64, and the number of convolution kernels in the thirteenth convolution layer is 64. The number of convolution kernels is 32, the first The convolution strides from the first convolutional layer to the thirteenth convolutional layer, the first convolutional convolutional layer to the third convolutional convolutional layer with holes adopt the default values, and the first deconvolutional layer to the fourth deconvolutional layer The respective convolution step size is 2×2, the parameters of the first batch of normalization layers to the twentieth batch of normalization layers use default values, the activation functions of the first activation layer to the twentieth activation layer use ReLu, and the first maximum pooling is used. The pooling step size from the layer to the fourth maximum pooling layer is 2×2, and the sampling step size from the first upsampling layer to the fourth upsampling layer is 2×2;

对于编码框架,第一卷积层的输入端接收输入层的输出端输出的原始输入图像,第一卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为J1,其中,J1中的每幅特征图的宽度为R、高度为L;第一批规范化层的输入端接收J1中的所有特征图,第一批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P1,其中,P1中的每幅特征图的宽度为R、高度为L;第一激活层的输入端接收P1中的所有特征图,第一激活层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为H1,其中,H1中的每幅特征图的宽度为R、高度为L;第一最大池化层的输入端接收H1中的所有特征图,第一最大池化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为Z1,其中,Z1中的每幅特征图的宽度为高度为第二卷积层的输入端接收Z1中的所有特征图,第二卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J2,其中,J2中的每幅特征图的宽度为高度为第二批规范化层的输入端接收J2中的所有特征图,第二批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P2,其中,P2中的每幅特征图的宽度为高度为第二激活层的输入端接收P2中的所有特征图,第二激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H2,其中,H2中的每幅特征图的宽度为高度为第三卷积层的输入端接收H2中的所有特征图,第三卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J3,其中,J3中的每幅特征图的宽度为高度为第三批规范化层的输入端接收J3中的所有特征图,第三批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P3,其中,P3中的每幅特征图的宽度为高度为第一Concatenate融合层的输入端接收P3中的所有特征图和H2中的所有特征图,第一Concatenate融合层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为C1,其中,C1中的每幅特征图的宽度为高度为第三激活层的输入端接收C1中的所有特征图,第三激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H3,其中,H3中的每幅特征图的宽度为高度为第二最大池化层的输入端接收H3中的所有特征图,第二最大池化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为Z2,其中,Z2中的每幅特征图的宽度为高度为第四卷积层的输入端接收Z2中的所有特征图,第四卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J4,其中,J4中的每幅特征图的宽度为高度为第四批规范化层的输入端接收J4中的所有特征图,第四批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P4,其中,P4中的每幅特征图的宽度为高度为第四激活层的输入端接收P4中的所有特征图,第四激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H4,其中,H4中的每幅特征图的宽度为高度为第五卷积层的输入端接收H4中的所有特征图,第五卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J5,其中,J5中的每幅特征图的宽度为高度为第五批规范化层的输入端接收J5中的所有特征图,第五批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P5,其中,P5中的每幅特征图的宽度为高度为第二Concatenate融合层的输入端接收P5中的所有特征图和H4中的所有特征图,第二Concatenate融合层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为C2,其中,C2中的每幅特征图的宽度为高度为第五激活层的输入端接收C2中的所有特征图,第五激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H5,其中,H5中的每幅特征图的宽度为高度为第三最大池化层的输入端接收H5中的所有特征图,第三最大池化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为Z3,其中,Z3中的每幅特征图的宽度为高度为第一带孔卷积层的输入端接收Z3中的所有特征图,第一带孔卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为K1,其中,K1中的每幅特征图的宽度为高度为第六批规范化层的输入端接收K1中的所有特征图,第六批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P6,其中,P6中的每幅特征图的宽度为高度为第六激活层的输入端接收P6中的所有特征图,第六激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H6,其中,H6中的每幅特征图的宽度为高度为第二带孔卷积层的输入端接收H6中的所有特征图,第二带孔卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为K2,其中,K2中的每幅特征图的宽度为高度为第七批规范化层的输入端接收K2中的所有特征图,第七批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P7,其中,P7中的每幅特征图的宽度为高度为第三Concatenate融合层的输入端接收P7中的所有特征图和H6中的所有特征图,第三Concatenate融合层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为C3,其中,C3中的每幅特征图的宽度为高度为第七激活层的输入端接收C3中的所有特征图,第七激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H7,其中,H7中的每幅特征图的宽度为高度为第四最大池化层的输入端接收H7中的所有特征图,第四最大池化层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为Z4,其中,Z4中的每幅特征图的宽度为高度为第三带孔卷积层的输入端接收Z4中的所有特征图,第三带孔卷积层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为K3,其中,K3中的每幅特征图的宽度为高度为第八批规范化层的输入端接收K3中的所有特征图,第八批规范化层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为P8,其中,P8中的每幅特征图的宽度为高度为第八激活层的输入端接收P8中的所有特征图,第八激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H8,H8也即为编码框架的输出,其中,H8中的每幅特征图的宽度为高度为 For the coding framework, the input of the first convolutional layer receives the original input image output by the output of the input layer, the output of the first convolutional layer outputs 32 feature maps, and the set of all the output feature maps is denoted as J 1 , where the width of each feature map in J 1 is R and the height is L; the input end of the first batch of normalization layers receives all the feature maps in J 1 , and the output end of the first batch of normalization layers outputs 32 features The set of all output feature maps is denoted as P 1 , wherein the width of each feature map in P 1 is R and the height is L; the input end of the first activation layer receives all feature maps in P 1 , the output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as H 1 , wherein the width of each feature map in H 1 is R and the height is L; the first largest The input end of the pooling layer receives all the feature maps in H 1 , the output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as Z 1 , wherein, in Z 1 The width of each feature map is height is The input of the second convolutional layer receives all the feature maps in Z 1 , the output of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 2 , where in J 2 The width of each feature map is height is The input end of the second batch of normalization layers receives all feature maps in J 2 , and the output end of the second batch of normalization layers outputs 64 feature maps. The width of each feature map is height is The input terminal of the second activation layer receives all feature maps in P 2 , and the output terminal of the second activation layer outputs 64 feature maps. The width of the feature map is height is The input end of the third convolutional layer receives all the feature maps in H 2 , and the output end of the third convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 3 , where in The width of each feature map is height is The input end of the third batch of normalization layers receives all feature maps in J 3 , the output end of the third batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 3 , where in P 3 The width of each feature map is height is The input of the first Concatenate fusion layer receives all the feature maps in P3 and all the feature maps in H2 , and the output of the first Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C 1 , where the width of each feature map in C 1 is height is The input terminal of the third activation layer receives all feature maps in C 1 , and the output terminal of the third activation layer outputs 128 feature maps. The width of the feature map is height is The input of the second maximum pooling layer receives all the feature maps in H 3 , the output of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Z 2 , where Z The width of each feature map in 2 is height is The input end of the fourth convolutional layer receives all the feature maps in Z 2 , and the output end of the fourth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 4 , where in J 4 The width of each feature map is height is The input terminal of the fourth batch of normalization layers receives all feature maps in J 4 , and the output terminal of the fourth batch of normalization layers outputs 128 feature maps. The width of each feature map is height is The input end of the fourth activation layer receives all the feature maps in P 4 , and the output end of the fourth activation layer outputs 128 feature maps. The width of the feature map is height is The input end of the fifth convolutional layer receives all the feature maps in H 4 , and the output end of the fifth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 5 , where in J 5 The width of each feature map is height is The input end of the fifth batch of normalization layers receives all feature maps in J 5 , and the output end of the fifth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 5 , where in P 5 The width of each feature map is height is The input end of the second Concatenate fusion layer receives all the feature maps in P 5 and all the feature maps in H 4 , and the output end of the second Concatenate fusion layer outputs 256 feature maps, and the set formed by all the output feature maps is recorded as C 2 , where the width of each feature map in C 2 is height is The input end of the fifth activation layer receives all the feature maps in C 2 , and the output end of the fifth activation layer outputs 256 feature maps. The width of the feature map is height is The input terminal of the third maximum pooling layer receives all the feature maps in H 5 , the output terminal of the third maximum pooling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Z 3 , where Z 3 The width of each feature map in 3 is height is The input of the first atrous convolutional layer receives all feature maps in Z 3 , the output of the first atrous convolutional layer outputs 256 feature maps, and the set of all the output feature maps is denoted as K 1 , where , the width of each feature map in K1 is height is The input end of the sixth batch of normalization layers receives all the feature maps in K 1 , the output end of the sixth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 6 , where in P 6 The width of each feature map is height is The input end of the sixth activation layer receives all the feature maps in P 6 , and the output end of the sixth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 6 , where each of the feature maps in H 6 The width of the feature map is height is The input end of the second atrous convolutional layer receives all the feature maps in H 6 , the output end of the second atrous convolutional layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as K 2 , where , the width of each feature map in K2 is height is The input end of the seventh batch of normalization layers receives all the feature maps in K 2 , the output end of the seventh batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 7 , where in P 7 The width of each feature map is height is The input end of the third Concatenate fusion layer receives all the feature maps in P 7 and all the feature maps in H 6 , and the output end of the third Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is denoted as C 3 , where the width of each feature map in C 3 is height is The input terminal of the seventh activation layer receives all the feature maps in C 3 , and the output terminal of the seventh activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H 7 . The width of the feature map is height is The input end of the fourth maximum pooling layer receives all the feature maps in H 7 , and the output end of the fourth maximum pooling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Z 4 , where Z 4 The width of each feature map in 4 is height is The input of the third atrous convolutional layer receives all the feature maps in Z 4 , the output of the third atrous convolutional layer outputs 512 feature maps, and the set of all the output feature maps is denoted as K 3 , where , the width of each feature map in K3 is height is The input end of the eighth batch of normalization layers receives all the feature maps in K 3 , the output end of the eighth batch of normalization layers outputs 512 feature maps, and the set formed by all the output feature maps is denoted as P 8 , where in P 8 The width of each feature map is height is The input end of the eighth activation layer receives all the feature maps in P8, and the output end of the eighth activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H 8 , and H 8 is also the coding frame. The output of , where the width of each feature map in H8 is height is

对于译码框架,第一反卷积层的输入端接收编码框架的输出即H8中的所有特征图,第一反卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为F1,其中,F1中的每幅特征图的宽度为高度为第九批规范化层的输入端接收F1中的所有特征图,第九批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P9,其中,P9中的每幅特征图的宽度为高度为第四Concatenate融合层的输入端接收P9中的所有特征图和P7中的所有特征图,第四Concatenate融合层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为C4,其中,C4中的每幅特征图的宽度为高度为第九激活层的输入端接收C4中的所有特征图,第九激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H9,其中,H9中的每幅特征图的宽度为高度为第六卷积层的输入端接收H9中的所有特征图,第六卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为J6,其中,J6中的每幅特征图的宽度为高度为第十批规范化层的输入端接收J6中的所有特征图,第十批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P10,其中,P10中的每幅特征图的宽度为高度为第十激活层的输入端接收P10中的所有特征图,第十激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H10,其中,H10中的每幅特征图的宽度为高度为第二反卷积层的输入端接收编码框架的输出即H10中的所有特征图,第二反卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为F2,其中,F2中的每幅特征图的宽度为高度为第十一批规范化层的输入端接收F2中的所有特征图,第十一批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P11,其中,P11中的每幅特征图的宽度为高度为第五Concatenate融合层的输入端接收P11中的所有特征图和P5中的所有特征图,第五Concatenate融合层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为C5,其中,C5中的每幅特征图的宽度为高度为第十一激活层的输入端接收C5中的所有特征图,第十一激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H11,其中,H11中的每幅特征图的宽度为高度为第七卷积层的输入端接收H11中的所有特征图,第七卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J7,其中,J7中的每幅特征图的宽度为高度为第十二批规范化层的输入端接收J7中的所有特征图,第十二批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P12,其中,P12中的每幅特征图的宽度为高度为第十二激活层的输入端接收P12中的所有特征图,第十二激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H12,其中,H12中的每幅特征图的宽度为高度为第三反卷积层的输入端接收H12中的所有特征图,第三反卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为F3,其中,F3中的每幅特征图的宽度为高度为第十三批规范化层的输入端接收F3中的所有特征图,第十三批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P13,其中,P13中的每幅特征图的宽度为高度为第六Concatenate融合层的输入端接收P13中的所有特征图和P3中的所有特征图,第六Concatenate融合层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为C6,其中,C6中的每幅特征图的宽度为高度为第十三激活层的输入端接收C6中的所有特征图,第十三激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H13,其中,H13中的每幅特征图的宽度为高度为第八卷积层的输入端接收H13中的所有特征图,第八卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J8,其中,J8中的每幅特征图的宽度为高度为第十四批规范化层的输入端接收J8中的所有特征图,第十四批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P14,其中,P14中的每幅特征图的宽度为高度为第十四激活层的输入端接收P14中的所有特征图,第十四激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H14,其中,H14中的每幅特征图的宽度为高度为第四反卷积层的输入端接收H14中的所有特征图,第四反卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为F4,其中,F4中的每幅特征图的宽度为R、高度为L;第十五批规范化层的输入端接收F4中的所有特征图,第十五批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P15,其中,P15中的每幅特征图的宽度为R、高度为L;第七Concatenate融合层的输入端接收P15中的所有特征图、H1中的所有特征图、上采样框架的输出,第七Concatenate融合层的输出端输出96幅特征图,将输出的所有特征图构成的集合记为C7,其中,C7中的每幅特征图的宽度为R、高度为L;For the decoding framework, the input end of the first deconvolution layer receives the output of the encoding framework, that is, all feature maps in H8 , and the output end of the first deconvolution layer outputs 256 feature maps, which are composed of all the output feature maps. The set of is denoted as F 1 , where the width of each feature map in F 1 is height is The input end of the ninth batch of normalization layers receives all the feature maps in F 1 , and the output end of the ninth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 9 , where in P 9 The width of each feature map is height is The input end of the fourth Concatenate fusion layer receives all the feature maps in P 9 and all the feature maps in P 7 , and the output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is recorded as C 4 , where the width of each feature map in C 4 is height is The input end of the ninth activation layer receives all the feature maps in C 4 , and the output end of the ninth activation layer outputs 512 feature maps. The width of the feature map is height is The input end of the sixth convolutional layer receives all the feature maps in H 9 , the output end of the sixth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J 6 , where in J 6 The width of each feature map is height is The input end of the tenth batch of normalization layers receives all the feature maps in J 6 , the output end of the tenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 10 , where in P 10 The width of each feature map is height is The input end of the tenth activation layer receives all the feature maps in P 10 , and the output end of the tenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 10 , where each H 10 The width of the feature map is height is The input end of the second deconvolution layer receives the output of the coding framework, that is, all feature maps in H10 , and the output end of the second deconvolution layer outputs 128 feature maps, and the set of all the output feature maps is denoted as F 2 , where the width of each feature map in F2 is height is The input end of the eleventh batch of normalization layers receives all the feature maps in F 2 , and the output end of the eleventh batch of normalization layers outputs 128 feature maps, and the set of all the output feature maps is denoted as P 11 , where P The width of each feature map in 11 is height is The input of the fifth Concatenate fusion layer receives all the feature maps in P 11 and all the feature maps in P 5 , and the output of the fifth Concatenate fusion layer outputs 256 feature maps, and the set of all the output feature maps is recorded as C 5 , where the width of each feature map in C 5 is height is The input end of the eleventh activation layer receives all the feature maps in C5 , and the output end of the eleventh activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 11 . The width of each feature map is height is The input end of the seventh convolutional layer receives all the feature maps in H 11 , the output end of the seventh convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 7 , where in J 7 The width of each feature map is height is The input end of the twelfth batch of normalization layers receives all feature maps in J 7 , the output end of the twelfth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 12 , where P The width of each feature map in 12 is height is The input end of the twelfth activation layer receives all the feature maps in P 12 , the output end of the twelfth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 12 , where in H 12 The width of each feature map is height is The input end of the third deconvolution layer receives all the feature maps in H 12 , the output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as F 3 , where F The width of each feature map in 3 is height is The input end of the thirteenth batch of normalization layers receives all the feature maps in F 3 , the output end of the thirteenth batch of normalization layers outputs 64 feature maps, and the set of all the output feature maps is denoted as P 13 , where P The width of each feature map in 13 is height is The input end of the sixth Concatenate fusion layer receives all the feature maps in P 13 and all the feature maps in P 3 , and the output end of the sixth Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C 6 , where the width of each feature map in C 6 is height is The input end of the thirteenth activation layer receives all the feature maps in C 6 , the output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 13 , wherein, in H 13 The width of each feature map is height is The input end of the eighth convolutional layer receives all the feature maps in H 13 , the output end of the eighth convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 8 , where in J 8 The width of each feature map is height is The input end of the fourteenth batch of normalization layers receives all the feature maps in J 8 , the output end of the fourteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 14 , where P The width of each feature map in 14 is height is The input end of the fourteenth activation layer receives all the feature maps in P 14 , the output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H 14 , wherein, in H 14 The width of each feature map is height is The input end of the fourth deconvolution layer receives all the feature maps in H 14 , the output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F 4 , where F The width of each feature map in 4 is R and the height is L; the input end of the fifteenth batch of normalization layers receives all the feature maps in F 4 , and the output end of the fifteenth batch of normalization layers outputs 32 feature maps. The set composed of all the output feature maps is denoted as P 15 , wherein the width of each feature map in P 15 is R and the height is L; the input end of the seventh Concatenate fusion layer receives all the feature maps in P 15 , H All feature maps in 1 , the output of the upsampling framework, the output of the seventh Concatenate fusion layer outputs 96 feature maps, and the set of all the output feature maps is denoted as C 7 , where each feature in C 7 The width of the graph is R and the height is L;

对于上采样框架,第一上采样层的输入端接收Z4中的所有特征图,第一上采样层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为Y1,其中,Y1中的每幅特征图的宽度为高度为第十卷积层的输入端接收Y1中的所有特征图,第十卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为J10,其中,J10中的每幅特征图的宽度为高度为第十七批规范化层的输入端接收J10中的所有特征图,第十七批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P17,其中,P17中的每幅特征图的宽度为高度为第十七激活层的输入端接收P17中的所有特征图,第十七激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H17,其中,H17中的每幅特征图的宽度为高度为第二上采样层的输入端接收H17中的所有特征图,第二上采样层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为Y2,其中,Y2中的每幅特征图的宽度为高度为第十一卷积层的输入端接收Y2中的所有特征图,第十一卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J11,其中,J11中的每幅特征图的宽度为高度为第十八批规范化层的输入端接收J11中的所有特征图,第十八批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P18,其中,P18中的每幅特征图的宽度为高度为第十八激活层的输入端接收P18中的所有特征图,第十八激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H18,其中,H18中的每幅特征图的宽度为高度为第三上采样层的输入端接收H18中的所有特征图,第三上采样层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为Y3,其中,Y3中的每幅特征图的宽度为高度为第十二卷积层的输入端接收Y3中的所有特征图,第十二卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J12,其中,J12中的每幅特征图的宽度为高度为第十九批规范化层的输入端接收J12中的所有特征图,第十九批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P19,其中,P19中的每幅特征图的宽度为高度为第十九激活层的输入端接收P19中的所有特征图,第十九激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H19,其中,H19中的每幅特征图的宽度为高度为第四上采样层的输入端接收H19中的所有特征图,第四上采样层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为Y4,其中,Y4中的每幅特征图的宽度为R、高度为L;第十三卷积层的输入端接收Y4中的所有特征图,第十三卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为J13,其中,J13中的每幅特征图的宽度为R、高度为L;第二十批规范化层的输入端接收J13中的所有特征图,第二十批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P20,其中,P20中的每幅特征图的宽度为R、高度为L;第二十激活层的输入端接收P20中的所有特征图,第二十激活层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为H20,其中,H20中的每幅特征图的宽度为R、高度为L;For the up-sampling framework, the input of the first up-sampling layer receives all the feature maps in Z 4 , the output of the first up-sampling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Y 1 , where the width of each feature map in Y 1 is height is The input end of the tenth convolutional layer receives all the feature maps in Y 1 , and the output end of the tenth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J 10 , where in J 10 The width of each feature map is height is The input end of the seventeenth batch of normalization layers receives all feature maps in J 10 , the output end of the seventeenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 17 , where P The width of each feature map in 17 is height is The input end of the seventeenth activation layer receives all the feature maps in P 17 , the output end of the seventeenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 17 , wherein, in H 17 The width of each feature map is height is The input end of the second upsampling layer receives all the feature maps in H 17 , the output end of the second upsampling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Y 2 , where in Y 2 The width of each feature map is height is The input of the eleventh convolutional layer receives all the feature maps in Y 2 , the output of the eleventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 11 , where J The width of each feature map in 11 is height is The input end of the eighteenth batch of normalization layers receives all feature maps in J 11 , the output end of the eighteenth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 18 , where P The width of each feature map in 18 is height is The input end of the eighteenth activation layer receives all the feature maps in P 18 , the output end of the eighteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 18 , where in H 18 The width of each feature map is height is The input end of the third up-sampling layer receives all the feature maps in H 18 , the output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Y 3 , where in Y 3 The width of each feature map is height is The input of the twelfth convolutional layer receives all feature maps in Y 3 , and the output of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 12 , where J The width of each feature map in 12 is height is The input end of the nineteenth batch of normalization layers receives all the feature maps in J 12 , the output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 19 , where P The width of each feature map in 19 is height is The input end of the nineteenth activation layer receives all the feature maps in P 19 , and the output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H 19 , where in H 19 The width of each feature map is height is The input end of the fourth upsampling layer receives all the feature maps in H 19 , the output end of the fourth upsampling layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as Y 4 , where in Y 4 The width of each feature map of the The set composed of all feature maps is denoted as J 13 , wherein the width of each feature map in J 13 is R and the height is L; the input end of the twentieth batch of normalization layers receives all the feature maps in J 13 , the second The output of ten batches of normalization layers outputs 32 feature maps, and the set of all output feature maps is denoted as P 20 , where the width of each feature map in P 20 is R and the height is L; the twentieth activation The input end of the layer receives all the feature maps in P 20 , the output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as H 20 , where each feature in H 20 The width of the graph is R and the height is L;

对于输出层,第十五激活层的输入端接收译码框架的输出即C7中的所有特征图,第十五激活层的输出端输出96幅特征图,将输出的所有特征图构成的集合记为H15,其中,H15中的每幅特征图的宽度为R、高度为L;第九卷积层的输入端接收H15中的所有特征图,第九卷积层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为J9,其中,J9中的特征图的宽度为R、高度为L;第十六批规范化层的输入端接收J9中的特征图,第十六批规范化层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为P16,其中,P16中的特征图的宽度为R、高度为L;第十六激活层的输入端接收P16中的特征图,第十六激活层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为H16,其中,H16中的特征图的宽度为R、高度为L,H16中的特征图即为原始输入图像对应的估计深度图像;For the output layer, the input terminal of the fifteenth activation layer receives the output of the decoding framework, that is, all feature maps in C7 , and the output terminal of the fifteenth activation layer outputs 96 feature maps. Denoted as H 15 , wherein the width of each feature map in H 15 is R and the height is L; the input end of the ninth convolution layer receives all the feature maps in H 15 , and the output end of the ninth convolution layer outputs 1 feature map, the set of all output feature maps is denoted as J 9 , where the width of the feature map in J 9 is R and the height is L; the input end of the sixteenth batch of normalization layers receives J 9 . Feature map, the output end of the sixteenth batch of normalization layers outputs one feature map, and the set composed of all the output feature maps is denoted as P 16 , where the width of the feature map in P 16 is R and the height is L; The input terminal of the sixteenth activation layer receives the feature map in P 16 , the output terminal of the sixteenth activation layer outputs one feature map, and the set formed by all the output feature maps is denoted as H 16 , wherein the features in H 16 The width of the image is R and the height is L, and the feature map in H 16 is the estimated depth image corresponding to the original input image;

步骤1_3:将训练集中的每幅原始的单目图像作为原始输入图像,输入到卷积神经网络中进行训练,得到训练集中的每幅原始的单目图像对应的估计深度图像,将{Qn(x,y)}对应的估计深度图像记为其中,表示中坐标位置为(x,y)的像素点的像素值;Step 1_3: Take each original monocular image in the training set as the original input image, input it into the convolutional neural network for training, and obtain the estimated depth image corresponding to each original monocular image in the training set, and set {Q n The estimated depth image corresponding to (x,y)} is denoted as in, express The pixel value of the pixel whose middle coordinate position is (x, y);

步骤1_4:计算训练集中的每幅原始的单目图像对应的估计深度图像与对应的真实深度图像之间的损失函数值,将之间的损失函数值记为 Step 1_4: Calculate the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image, and The loss function value between is denoted as

步骤1_5:重复执行步骤1_3和步骤1_4共V次,得到训练好的卷积神经网络训练模型,并共得到N×V个损失函数值;然后从N×V个损失函数值中找出值最小的损失函数值;接着将值最小的损失函数值对应的权值矢量和偏置项对应作为训练好的卷积神经网络训练模型的最优权值矢量和最优偏置项,对应记为Wbest和bbest;其中,V>1;Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain the trained convolutional neural network training model, and obtain a total of N×V loss function values; then find the smallest value from the N×V loss function values. Then take the weight vector and the bias term corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias term of the trained convolutional neural network training model, which are recorded as W best and b best ; wherein, V>1;

所述的测试阶段过程的具体步骤为:The specific steps of the test phase process are:

步骤2_1:令{Q(x',y')}表示待预测的单目图像;其中,1≤x'≤R',1≤y'≤L',R'表示{Q(x',y')}的宽度,L'表示{Q(x',y')}的高度,Q(x',y')表示{Q(x',y')}中坐标位置为(x',y')的像素点的像素值;Step 2_1: Let {Q(x',y')} denote the monocular image to be predicted; wherein, 1≤x'≤R', 1≤y'≤L', R' denotes {Q(x',y ')} width, L' means the height of {Q(x',y')}, Q(x',y') means the coordinate position in {Q(x',y')} is (x',y ') the pixel value of the pixel point;

步骤2_2:将{Q(x',y')}输入到训练好的卷积神经网络训练模型中,并利用Wbest和bbest进行预测,得到{Q(x',y')}对应的预测深度图像,记为{Qdepth(x',y')};其中,Qdepth(x',y')表示{Qdepth(x',y')}中坐标位置为(x',y')的像素点的像素值。Step 2_2: Input {Q(x',y')} into the trained convolutional neural network training model, and use W best and b best to predict, and get {Q(x',y')} corresponding The predicted depth image is recorded as {Q depth (x',y')}; among them, Q depth (x',y') indicates that the coordinate position in {Q depth (x',y')} is (x',y) ') of the pixel value of the pixel point.

所述的步骤1_4中,采用均方误差函数获得。In the described steps 1-4, Obtained using the mean squared error function.

与现有技术相比,本发明的优点在于:Compared with the prior art, the advantages of the present invention are:

1)本发明方法在构建卷积神经网络的过程中采用了跳层连接方式,即采用了Concatenate融合层,并同时在编码框架内使用了短跳层连接,即使用了第一Concatenate融合层、第二Concatenate融合层、第三Concatenate融合层进行连接;在编码框架和译码框架间使用了长跳层连接,即使用了第四Concatenate融合层、第五Concatenate融合层、第六Concatenate融合层、第七Concatenate融合层进行连接,使用跳层连接有益于多尺度特征融合和边界保持,短跳层连接丰富了在编码过程中信息多样性,长跳层连接解决了译码部分原始边界信息的缺失,从而使得利用训练得到的卷积神经网络训练模型进行深度估计更准确。1) The method of the present invention adopts the skip-layer connection mode in the process of constructing the convolutional neural network, that is, the Concatenate fusion layer is adopted, and the short-jump layer connection is used in the coding framework at the same time, that is, the first Concatenate fusion layer, The second Concatenate fusion layer and the third Concatenate fusion layer are connected; a long-hop connection is used between the encoding frame and the decoding frame, that is, the fourth Concatenate fusion layer, the fifth Concatenate fusion layer, the sixth Concatenate fusion layer, The seventh Concatenate fusion layer is connected. The use of skip layer connection is beneficial to multi-scale feature fusion and boundary preservation. Short skip layer connection enriches the diversity of information in the encoding process, and long skip layer connection solves the lack of original boundary information in the decoding part. , so that the depth estimation using the trained convolutional neural network training model is more accurate.

2)本发明方法使用端到端的卷积神经网络训练框架,在编码框架的第三最大池化层之后使用了三个带孔卷积层来提取特征信息,而带孔卷积层能够在不增加训练参数的数量的前提下可以扩大神经元的感受野,得到更多的特征信息。2) The method of the present invention uses an end-to-end convolutional neural network training framework, and uses three atrous convolutional layers after the third maximum pooling layer of the coding frame to extract feature information, and the atroused convolutional layer can On the premise of increasing the number of training parameters, the receptive field of neurons can be expanded and more feature information can be obtained.

3)本发明方法创建的卷积神经网络的隐层包括编码框架、译码框架和上采样框架,三个框架的结合使得利用训练得到的卷积神经网络训练模型能够提取到具有丰富信息的特征,从而可以获得准确性高的深度信息,进而提高了深度估计结果的精度。3) The hidden layer of the convolutional neural network created by the method of the present invention includes an encoding frame, a decoding frame and an upsampling frame, and the combination of the three frames enables the convolutional neural network training model obtained by training to extract features with rich information , so that the depth information with high accuracy can be obtained, thereby improving the accuracy of the depth estimation result.

4)利用本发明方法得到的预测深度图像的尺寸与原始的单目图像的尺寸相同,有利于对其中深度信息的直接使用。4) The size of the predicted depth image obtained by the method of the present invention is the same as the size of the original monocular image, which is beneficial to the direct use of the depth information therein.

附图说明Description of drawings

图1为本发明方法中创建的卷积神经网络的隐层中的编码框架的组成结构示意图;Fig. 1 is the composition structure schematic diagram of the coding frame in the hidden layer of the convolutional neural network created in the method of the present invention;

图2为本发明方法中创建的卷积神经网络的隐层中的译码框架和创建的卷积神经网络的输出层各自的组成结构示意图;Fig. 2 is the respective composition structure diagram of the decoding frame in the hidden layer of the convolutional neural network created in the method of the present invention and the output layer of the created convolutional neural network;

图3为本发明方法中创建的卷积神经网络的隐层中的上采样框架的组成结构示意图。FIG. 3 is a schematic diagram of the composition and structure of the upsampling framework in the hidden layer of the convolutional neural network created in the method of the present invention.

具体实施方式Detailed ways

以下结合附图实施例对本发明作进一步详细描述。The present invention will be further described in detail below with reference to the embodiments of the accompanying drawings.

本发明提出的一种单目视觉深度估计方法,其特征在于包括训练阶段和测试阶段两个过程。A monocular visual depth estimation method proposed by the present invention is characterized in that it includes two processes: a training phase and a testing phase.

所述的训练阶段过程的具体步骤为:The specific steps of the training phase process are:

步骤1_1:选取N幅原始的单目图像及每幅原始的单目图像对应的真实深度图像,并构成训练集,将训练集中的第n幅原始的单目图像记为{Qn(x,y)},将训练集中与{Qn(x,y)}对应的真实深度图像记为其中,N为正整数,N≥100,如取N=1000,n为正整数,1≤n≤N,1≤x≤R,1≤y≤L,R表示{Qn(x,y)}和的宽度,L表示{Qn(x,y)}和的高度,R和L均能被2整除,Qn(x,y)表示{Qn(x,y)}中坐标位置为(x,y)的像素点的像素值,表示中坐标位置为(x,y)的像素点的像素值;在此,原始的单目图像和其对应的真实深度图像直接由KITTI官网提供。Step 1_1: Select N original monocular images and the real depth image corresponding to each original monocular image, and form a training set, and record the nth original monocular image in the training set as {Q n (x, y)}, denote the real depth image corresponding to {Q n (x, y)} in the training set as Among them, N is a positive integer, N≥100, if N=1000, n is a positive integer, 1≤n≤N, 1≤x≤R, 1≤y≤L, R represents {Q n (x,y) }and The width of , L represents {Q n (x, y)} and The height of , R and L are both divisible by 2, Q n (x, y) represents the pixel value of the pixel at the coordinate position (x, y) in {Q n (x, y)}, express The pixel value of the pixel whose mid-coordinate position is (x, y); here, the original monocular image and its corresponding real depth image are directly provided by the KITTI official website.

步骤1_2:构建端到端的卷积神经网络:卷积神经网络包括输入层、隐层和输出层;隐层包括编码框架、译码框架和上采样框架。Step 1_2: Build an end-to-end convolutional neural network: The convolutional neural network includes an input layer, a hidden layer, and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame.

对于输入层,输入层的输入端接收一幅原始输入图像,输入层的输出端输出原始输入图像给隐层;其中,要求输入层的输入端接收的原始输入图像的宽度为R、高度为L。For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; among them, the width of the original input image received by the input end of the input layer is required to be R and the height is L .

对于编码框架,如图1所示,其由依次设置的第一卷积层、第一批规范化层、第一激活层、第一最大池化层、第二卷积层、第二批规范化层、第二激活层、第三卷积层、第三批规范化层、第一Concatenate融合层、第三激活层、第二最大池化层、第四卷积层、第四批规范化层、第四激活层、第五卷积层、第五批规范化层、第二Concatenate融合层、第五激活层、第三最大池化层、第一带孔卷积层、第六批规范化层、第六激活层、第二带孔卷积层、第七批规范化层、第三Concatenate融合层、第七激活层、第四最大池化层、第三带孔卷积层、第八批规范化层、第八激活层组成;对于译码框架,如图2所示,其由依次设置的第一反卷积层、第九批规范化层、第四Concatenate融合层、第九激活层、第六卷积层、第十批规范化层、第十激活层、第二反卷积层、第十一批规范化层、第五Concatenate融合层、第十一激活层、第七卷积层、第十二批规范化层、第十二激活层、第三反卷积层、第十三批规范化层、第六Concatenate融合层、第十三激活层、第八卷积层、第十四批规范化层、第十四激活层、第四反卷积层、第十五批规范化层、第七Concatenate融合层组成;对于上采样框架,如图3所示,其由依次设置的第一上采样层、第十卷积层、第十七批规范化层、第十七激活层、第二上采样层、第十一卷积层、第十八批规范化层、第十八激活层、第三上采样层、第十二卷积层、第十九批规范化层、第十九激活层、第四上采样层、第十三卷积层、第二十批规范化层、第二十激活层组成;对于输出层,如图2所示,其由依次设置的第十五激活层、第九卷积层、第十六批规范化层、第十六激活层组成,其中,第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层、第一反卷积层至第四反卷积层各自的卷积核大小为3×3,第一卷积层的卷积核个数为32、第二卷积层和第三卷积层的卷积核个数为64、第四卷积层和第五卷积层的卷积核个数为128、第一带孔卷积层和第二带孔卷积层的卷积核个数为256、第三带孔卷积层的卷积核个数为512、第一反卷积层和第六卷积层的卷积核个数为256、第二反卷积层和第七卷积层的卷积核个数为128、第三反卷积层和第八卷积层的卷积核个数为64、第四反卷积层的卷积核个数为32、第九卷积层的卷积核个数为1、第十卷积层的卷积核个数为256、第十一卷积层的卷积核个数为128、第十二卷积层的卷积核个数为64、第十三卷积层的卷积核个数为32,第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层各自的卷积步长采用默认值,第一反卷积层至第四反卷积层各自的卷积步长为2×2,第一批规范化层至第二十批规范化层的参数采用默认值,第一激活层至第二十激活层的激活函数采用ReLu,第一最大池化层至第四最大池化层的池化步长为2×2,第一上采样层至第四上采样层的采样步长为2×2。For the encoding framework, as shown in Figure 1, it consists of the first convolutional layer, the first normalization layer, the first activation layer, the first max pooling layer, the second convolutional layer, and the second batch of normalization layers. , second activation layer, third convolution layer, third batch normalization layer, first concatenate fusion layer, third activation layer, second max pooling layer, fourth convolution layer, fourth batch normalization layer, fourth batch Activation layer, fifth convolution layer, fifth batch normalization layer, second concatenate fusion layer, fifth activation layer, third max pooling layer, first atrous convolutional layer, sixth batch normalization layer, sixth activation layer layer, second atrous convolutional layer, seventh batch normalization layer, third concatenate fusion layer, seventh activation layer, fourth max pooling layer, third atrous convolutional layer, eighth batch normalization layer, eighth batch The activation layer consists of the activation layer; for the decoding framework, as shown in Figure 2, it consists of the first deconvolution layer, the ninth batch of normalization layers, the fourth Concatenate fusion layer, the ninth activation layer, the sixth convolution layer, The tenth normalization layer, the tenth activation layer, the second deconvolution layer, the eleventh normalization layer, the fifth Concatenate fusion layer, the eleventh activation layer, the seventh convolution layer, the twelfth batch normalization layer, Twelfth Activation Layer, Third Deconvolution Layer, Thirteenth Batch Normalization Layer, Sixth Concatenate Fusion Layer, Thirteenth Activation Layer, Eighth Convolutional Layer, Fourteenth Batch Normalization Layer, Fourteenth Activation Layer , the fourth deconvolution layer, the fifteenth batch normalization layer, and the seventh Concatenate fusion layer; for the upsampling framework, as shown in Figure 3, it consists of the first upsampling layer, the tenth convolutional layer, Batch 17 Normalization Layer, Activation Layer 17, Upsampling Layer 2, Convolution Layer 11, Normalization Layer 18, Activation Layer 18, Upsampling Layer 3, Convolution 12 layer, the nineteenth batch normalization layer, the nineteenth activation layer, the fourth upsampling layer, the thirteenth convolution layer, the twentieth batch normalization layer, and the twentieth activation layer; for the output layer, as shown in Figure 2 It is composed of the fifteenth activation layer, the ninth convolution layer, the sixteenth batch of normalization layers, and the sixteenth activation layer, which are set in sequence. The size of the convolution kernels from the atrous convolutional layer to the third atrous convolutional layer and the first deconvolutional layer to the fourth deconvolutional layer is 3×3, and the number of convolution kernels of the first convolutional layer is 32. The number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolutional kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the first convolutional layer with holes and The number of convolution kernels in the second convolutional layer with holes is 256, the number of convolution kernels in the third convolutional layer with holes is 512, and the number of convolution kernels in the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolution layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolution layer is 64, and the number of convolution kernels of the fourth deconvolution layer is 64. The number of convolution kernels of the layer is 32, the number of convolution kernels of the ninth convolution layer is 1, the number of convolution kernels of the tenth convolution layer is 256, and the number of convolution kernels of the eleventh convolution layer For 128, the convolution of the twelfth convolutional layer The number of kernels is 64, the number of convolution kernels of the thirteenth convolutional layer is 32, the first convolutional layer to the thirteenth convolutional layer, the first convolutional layer to the third convolutional layer with holes are respectively The convolution step size of the first deconvolution layer to the fourth deconvolution layer is 2 × 2, and the parameters of the first batch of normalization layers to the twentieth batch of normalization layers use the default values. , the activation functions from the first activation layer to the twentieth activation layer use ReLu, the pooling step size from the first maximum pooling layer to the fourth maximum pooling layer is 2×2, and the first upsampling layer to the fourth upsampling The sampling stride of the layer is 2×2.

对于编码框架,第一卷积层的输入端接收输入层的输出端输出的原始输入图像,第一卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为J1,其中,J1中的每幅特征图的宽度为R、高度为L;第一批规范化层的输入端接收J1中的所有特征图,第一批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P1,其中,P1中的每幅特征图的宽度为R、高度为L;第一激活层的输入端接收P1中的所有特征图,第一激活层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为H1,其中,H1中的每幅特征图的宽度为R、高度为L;第一最大池化层的输入端接收H1中的所有特征图,第一最大池化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为Z1,其中,Z1中的每幅特征图的宽度为高度为第二卷积层的输入端接收Z1中的所有特征图,第二卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J2,其中,J2中的每幅特征图的宽度为高度为第二批规范化层的输入端接收J2中的所有特征图,第二批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P2,其中,P2中的每幅特征图的宽度为高度为第二激活层的输入端接收P2中的所有特征图,第二激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H2,其中,H2中的每幅特征图的宽度为高度为第三卷积层的输入端接收H2中的所有特征图,第三卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J3,其中,J3中的每幅特征图的宽度为高度为第三批规范化层的输入端接收J3中的所有特征图,第三批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P3,其中,P3中的每幅特征图的宽度为高度为第一Concatenate融合层的输入端接收P3中的所有特征图和H2中的所有特征图,第一Concatenate融合层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为C1,其中,C1中的每幅特征图的宽度为高度为第三激活层的输入端接收C1中的所有特征图,第三激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H3,其中,H3中的每幅特征图的宽度为高度为第二最大池化层的输入端接收H3中的所有特征图,第二最大池化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为Z2,其中,Z2中的每幅特征图的宽度为高度为第四卷积层的输入端接收Z2中的所有特征图,第四卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J4,其中,J4中的每幅特征图的宽度为高度为第四批规范化层的输入端接收J4中的所有特征图,第四批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P4,其中,P4中的每幅特征图的宽度为高度为第四激活层的输入端接收P4中的所有特征图,第四激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H4,其中,H4中的每幅特征图的宽度为高度为第五卷积层的输入端接收H4中的所有特征图,第五卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J5,其中,J5中的每幅特征图的宽度为高度为第五批规范化层的输入端接收J5中的所有特征图,第五批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P5,其中,P5中的每幅特征图的宽度为高度为第二Concatenate融合层的输入端接收P5中的所有特征图和H4中的所有特征图,第二Concatenate融合层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为C2,其中,C2中的每幅特征图的宽度为高度为第五激活层的输入端接收C2中的所有特征图,第五激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H5,其中,H5中的每幅特征图的宽度为高度为第三最大池化层的输入端接收H5中的所有特征图,第三最大池化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为Z3,其中,Z3中的每幅特征图的宽度为高度为第一带孔卷积层的输入端接收Z3中的所有特征图,第一带孔卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为K1,其中,K1中的每幅特征图的宽度为高度为第六批规范化层的输入端接收K1中的所有特征图,第六批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P6,其中,P6中的每幅特征图的宽度为高度为第六激活层的输入端接收P6中的所有特征图,第六激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H6,其中,H6中的每幅特征图的宽度为高度为第二带孔卷积层的输入端接收H6中的所有特征图,第二带孔卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为K2,其中,K2中的每幅特征图的宽度为高度为第七批规范化层的输入端接收K2中的所有特征图,第七批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P7,其中,P7中的每幅特征图的宽度为高度为第三Concatenate融合层的输入端接收P7中的所有特征图和H6中的所有特征图,第三Concatenate融合层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为C3,其中,C3中的每幅特征图的宽度为高度为第七激活层的输入端接收C3中的所有特征图,第七激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H7,其中,H7中的每幅特征图的宽度为高度为第四最大池化层的输入端接收H7中的所有特征图,第四最大池化层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为Z4,其中,Z4中的每幅特征图的宽度为高度为第三带孔卷积层的输入端接收Z4中的所有特征图,第三带孔卷积层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为K3,其中,K3中的每幅特征图的宽度为高度为第八批规范化层的输入端接收K3中的所有特征图,第八批规范化层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为P8,其中,P8中的每幅特征图的宽度为高度为第八激活层的输入端接收P8中的所有特征图,第八激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H8,H8也即为编码框架的输出,其中,H8中的每幅特征图的宽度为高度为 For the coding framework, the input of the first convolutional layer receives the original input image output by the output of the input layer, the output of the first convolutional layer outputs 32 feature maps, and the set of all the output feature maps is denoted as J 1 , where the width of each feature map in J 1 is R and the height is L; the input end of the first batch of normalization layers receives all the feature maps in J 1 , and the output end of the first batch of normalization layers outputs 32 features The set of all output feature maps is denoted as P 1 , wherein the width of each feature map in P 1 is R and the height is L; the input end of the first activation layer receives all feature maps in P 1 , the output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as H 1 , wherein the width of each feature map in H 1 is R and the height is L; the first largest The input end of the pooling layer receives all the feature maps in H 1 , the output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as Z 1 , wherein, in Z 1 The width of each feature map is height is The input of the second convolutional layer receives all the feature maps in Z 1 , the output of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 2 , where in J 2 The width of each feature map is height is The input end of the second batch of normalization layers receives all feature maps in J 2 , and the output end of the second batch of normalization layers outputs 64 feature maps. The width of each feature map is height is The input terminal of the second activation layer receives all feature maps in P 2 , and the output terminal of the second activation layer outputs 64 feature maps. The width of the feature map is height is The input end of the third convolutional layer receives all the feature maps in H 2 , and the output end of the third convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 3 , where in The width of each feature map is height is The input end of the third batch of normalization layers receives all feature maps in J 3 , the output end of the third batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 3 , where in P 3 The width of each feature map is height is The input of the first Concatenate fusion layer receives all the feature maps in P3 and all the feature maps in H2 , and the output of the first Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C 1 , where the width of each feature map in C 1 is height is The input terminal of the third activation layer receives all feature maps in C 1 , and the output terminal of the third activation layer outputs 128 feature maps. The width of the feature map is height is The input of the second maximum pooling layer receives all the feature maps in H 3 , the output of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Z 2 , where Z The width of each feature map in 2 is height is The input end of the fourth convolutional layer receives all the feature maps in Z 2 , and the output end of the fourth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 4 , where in J 4 The width of each feature map is height is The input terminal of the fourth batch of normalization layers receives all feature maps in J 4 , and the output terminal of the fourth batch of normalization layers outputs 128 feature maps. The width of each feature map is height is The input end of the fourth activation layer receives all the feature maps in P 4 , and the output end of the fourth activation layer outputs 128 feature maps. The width of the feature map is height is The input end of the fifth convolutional layer receives all the feature maps in H 4 , and the output end of the fifth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 5 , where in J 5 The width of each feature map is height is The input end of the fifth batch of normalization layers receives all feature maps in J 5 , and the output end of the fifth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 5 , where in P 5 The width of each feature map is height is The input end of the second Concatenate fusion layer receives all the feature maps in P 5 and all the feature maps in H 4 , and the output end of the second Concatenate fusion layer outputs 256 feature maps, and the set formed by all the output feature maps is recorded as C 2 , where the width of each feature map in C 2 is height is The input end of the fifth activation layer receives all the feature maps in C 2 , and the output end of the fifth activation layer outputs 256 feature maps. The width of the feature map is height is The input terminal of the third maximum pooling layer receives all the feature maps in H 5 , the output terminal of the third maximum pooling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Z 3 , where Z 3 The width of each feature map in 3 is height is The input of the first atrous convolutional layer receives all feature maps in Z 3 , the output of the first atrous convolutional layer outputs 256 feature maps, and the set of all the output feature maps is denoted as K 1 , where , the width of each feature map in K1 is height is The input end of the sixth batch of normalization layers receives all the feature maps in K 1 , the output end of the sixth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 6 , where in P 6 The width of each feature map is height is The input end of the sixth activation layer receives all the feature maps in P 6 , and the output end of the sixth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 6 , where each of the feature maps in H 6 The width of the feature map is height is The input end of the second atrous convolutional layer receives all the feature maps in H 6 , the output end of the second atrous convolutional layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as K 2 , where , the width of each feature map in K2 is height is The input end of the seventh batch of normalization layers receives all the feature maps in K 2 , the output end of the seventh batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 7 , where in P 7 The width of each feature map is height is The input end of the third Concatenate fusion layer receives all the feature maps in P 7 and all the feature maps in H 6 , and the output end of the third Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is denoted as C 3 , where the width of each feature map in C 3 is height is The input terminal of the seventh activation layer receives all the feature maps in C 3 , and the output terminal of the seventh activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H 7 . The width of the feature map is height is The input end of the fourth maximum pooling layer receives all the feature maps in H 7 , and the output end of the fourth maximum pooling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Z 4 , where Z 4 The width of each feature map in 4 is height is The input of the third atrous convolutional layer receives all the feature maps in Z 4 , the output of the third atrous convolutional layer outputs 512 feature maps, and the set of all the output feature maps is denoted as K 3 , where , the width of each feature map in K3 is height is The input end of the eighth batch of normalization layers receives all the feature maps in K 3 , the output end of the eighth batch of normalization layers outputs 512 feature maps, and the set formed by all the output feature maps is denoted as P 8 , where in P 8 The width of each feature map is height is The input end of the eighth activation layer receives all the feature maps in P8, and the output end of the eighth activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H 8 , and H 8 is also the coding frame. The output of , where the width of each feature map in H8 is height is

对于译码框架,第一反卷积层的输入端接收编码框架的输出即H8中的所有特征图,第一反卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为F1,其中,F1中的每幅特征图的宽度为高度为第九批规范化层的输入端接收F1中的所有特征图,第九批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P9,其中,P9中的每幅特征图的宽度为高度为第四Concatenate融合层的输入端接收P9中的所有特征图和P7中的所有特征图,第四Concatenate融合层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为C4,其中,C4中的每幅特征图的宽度为高度为第九激活层的输入端接收C4中的所有特征图,第九激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H9,其中,H9中的每幅特征图的宽度为高度为第六卷积层的输入端接收H9中的所有特征图,第六卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为J6,其中,J6中的每幅特征图的宽度为高度为第十批规范化层的输入端接收J6中的所有特征图,第十批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P10,其中,P10中的每幅特征图的宽度为高度为第十激活层的输入端接收P10中的所有特征图,第十激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H10,其中,H10中的每幅特征图的宽度为高度为第二反卷积层的输入端接收编码框架的输出即H10中的所有特征图,第二反卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为F2,其中,F2中的每幅特征图的宽度为高度为第十一批规范化层的输入端接收F2中的所有特征图,第十一批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P11,其中,P11中的每幅特征图的宽度为高度为第五Concatenate融合层的输入端接收P11中的所有特征图和P5中的所有特征图,第五Concatenate融合层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为C5,其中,C5中的每幅特征图的宽度为高度为第十一激活层的输入端接收C5中的所有特征图,第十一激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H11,其中,H11中的每幅特征图的宽度为高度为第七卷积层的输入端接收H11中的所有特征图,第七卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J7,其中,J7中的每幅特征图的宽度为高度为第十二批规范化层的输入端接收J7中的所有特征图,第十二批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P12,其中,P12中的每幅特征图的宽度为高度为第十二激活层的输入端接收P12中的所有特征图,第十二激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H12,其中,H12中的每幅特征图的宽度为高度为第三反卷积层的输入端接收H12中的所有特征图,第三反卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为F3,其中,F3中的每幅特征图的宽度为高度为第十三批规范化层的输入端接收F3中的所有特征图,第十三批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P13,其中,P13中的每幅特征图的宽度为高度为第六Concatenate融合层的输入端接收P13中的所有特征图和P3中的所有特征图,第六Concatenate融合层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为C6,其中,C6中的每幅特征图的宽度为高度为第十三激活层的输入端接收C6中的所有特征图,第十三激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H13,其中,H13中的每幅特征图的宽度为高度为第八卷积层的输入端接收H13中的所有特征图,第八卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J8,其中,J8中的每幅特征图的宽度为高度为第十四批规范化层的输入端接收J8中的所有特征图,第十四批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P14,其中,P14中的每幅特征图的宽度为高度为第十四激活层的输入端接收P14中的所有特征图,第十四激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H14,其中,H14中的每幅特征图的宽度为高度为第四反卷积层的输入端接收H14中的所有特征图,第四反卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为F4,其中,F4中的每幅特征图的宽度为R、高度为L;第十五批规范化层的输入端接收F4中的所有特征图,第十五批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P15,其中,P15中的每幅特征图的宽度为R、高度为L;第七Concatenate融合层的输入端接收P15中的所有特征图、H1中的所有特征图、上采样框架的输出,第七Concatenate融合层的输出端输出96幅特征图,将输出的所有特征图构成的集合记为C7,其中,C7中的每幅特征图的宽度为R、高度为L。For the decoding framework, the input end of the first deconvolution layer receives the output of the encoding framework, that is, all feature maps in H8 , and the output end of the first deconvolution layer outputs 256 feature maps, which are composed of all the output feature maps. The set is denoted as F 1 , where the width of each feature map in F 1 is height is The input end of the ninth batch of normalization layers receives all the feature maps in F 1 , and the output end of the ninth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 9 , where in P 9 The width of each feature map is height is The input end of the fourth Concatenate fusion layer receives all the feature maps in P 9 and all the feature maps in P 7 , and the output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is recorded as C 4 , where the width of each feature map in C 4 is height is The input end of the ninth activation layer receives all the feature maps in C 4 , and the output end of the ninth activation layer outputs 512 feature maps. The width of the feature map is height is The input end of the sixth convolutional layer receives all the feature maps in H 9 , the output end of the sixth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J 6 , where in J 6 The width of each feature map is height is The input end of the tenth batch of normalization layers receives all the feature maps in J 6 , the output end of the tenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 10 , where in P 10 The width of each feature map is height is The input end of the tenth activation layer receives all the feature maps in P 10 , and the output end of the tenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 10 , where each H 10 The width of the feature map is height is The input end of the second deconvolution layer receives the output of the coding framework, that is, all feature maps in H10 , and the output end of the second deconvolution layer outputs 128 feature maps, and the set of all the output feature maps is denoted as F 2 , where the width of each feature map in F2 is height is The input end of the eleventh batch of normalization layers receives all the feature maps in F 2 , and the output end of the eleventh batch of normalization layers outputs 128 feature maps, and the set of all the output feature maps is denoted as P 11 , where P The width of each feature map in 11 is height is The input of the fifth Concatenate fusion layer receives all the feature maps in P 11 and all the feature maps in P 5 , and the output of the fifth Concatenate fusion layer outputs 256 feature maps, and the set of all the output feature maps is recorded as C 5 , where the width of each feature map in C 5 is height is The input end of the eleventh activation layer receives all the feature maps in C5 , and the output end of the eleventh activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 11 . The width of each feature map is height is The input end of the seventh convolutional layer receives all the feature maps in H 11 , the output end of the seventh convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 7 , where in J 7 The width of each feature map is height is The input end of the twelfth batch of normalization layers receives all feature maps in J 7 , the output end of the twelfth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 12 , where P The width of each feature map in 12 is height is The input end of the twelfth activation layer receives all the feature maps in P 12 , the output end of the twelfth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 12 , where in H 12 The width of each feature map is height is The input end of the third deconvolution layer receives all the feature maps in H 12 , the output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as F 3 , where F The width of each feature map in 3 is height is The input end of the thirteenth batch of normalization layers receives all the feature maps in F 3 , the output end of the thirteenth batch of normalization layers outputs 64 feature maps, and the set of all the output feature maps is denoted as P 13 , where P The width of each feature map in 13 is height is The input end of the sixth Concatenate fusion layer receives all the feature maps in P 13 and all the feature maps in P 3 , and the output end of the sixth Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C 6 , where the width of each feature map in C 6 is height is The input end of the thirteenth activation layer receives all the feature maps in C 6 , the output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 13 , wherein, in H 13 The width of each feature map is height is The input end of the eighth convolutional layer receives all the feature maps in H 13 , the output end of the eighth convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 8 , where in J 8 The width of each feature map is height is The input end of the fourteenth batch of normalization layers receives all the feature maps in J 8 , the output end of the fourteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 14 , where P The width of each feature map in 14 is height is The input end of the fourteenth activation layer receives all the feature maps in P 14 , the output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H 14 , wherein, in H 14 The width of each feature map is height is The input end of the fourth deconvolution layer receives all the feature maps in H 14 , the output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F 4 , where F The width of each feature map in 4 is R and the height is L; the input end of the fifteenth batch of normalization layers receives all the feature maps in F 4 , and the output end of the fifteenth batch of normalization layers outputs 32 feature maps. The set composed of all the output feature maps is denoted as P 15 , wherein the width of each feature map in P 15 is R and the height is L; the input end of the seventh Concatenate fusion layer receives all the feature maps in P 15 , H All feature maps in 1 , the output of the upsampling framework, the output of the seventh Concatenate fusion layer outputs 96 feature maps, and the set of all the output feature maps is denoted as C 7 , where each feature in C 7 The width of the graph is R and the height is L.

对于上采样框架,第一上采样层的输入端接收Z4中的所有特征图,第一上采样层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为Y1,其中,Y1中的每幅特征图的宽度为高度为第十卷积层的输入端接收Y1中的所有特征图,第十卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为J10,其中,J10中的每幅特征图的宽度为高度为第十七批规范化层的输入端接收J10中的所有特征图,第十七批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P17,其中,P17中的每幅特征图的宽度为高度为第十七激活层的输入端接收P17中的所有特征图,第十七激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H17,其中,H17中的每幅特征图的宽度为高度为第二上采样层的输入端接收H17中的所有特征图,第二上采样层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为Y2,其中,Y2中的每幅特征图的宽度为高度为第十一卷积层的输入端接收Y2中的所有特征图,第十一卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J11,其中,J11中的每幅特征图的宽度为高度为第十八批规范化层的输入端接收J11中的所有特征图,第十八批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P18,其中,P18中的每幅特征图的宽度为高度为第十八激活层的输入端接收P18中的所有特征图,第十八激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H18,其中,H18中的每幅特征图的宽度为高度为第三上采样层的输入端接收H18中的所有特征图,第三上采样层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为Y3,其中,Y3中的每幅特征图的宽度为高度为第十二卷积层的输入端接收Y3中的所有特征图,第十二卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J12,其中,J12中的每幅特征图的宽度为高度为第十九批规范化层的输入端接收J12中的所有特征图,第十九批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P19,其中,P19中的每幅特征图的宽度为高度为第十九激活层的输入端接收P19中的所有特征图,第十九激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H19,其中,H19中的每幅特征图的宽度为高度为第四上采样层的输入端接收H19中的所有特征图,第四上采样层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为Y4,其中,Y4中的每幅特征图的宽度为R、高度为L;第十三卷积层的输入端接收Y4中的所有特征图,第十三卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为J13,其中,J13中的每幅特征图的宽度为R、高度为L;第二十批规范化层的输入端接收J13中的所有特征图,第二十批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P20,其中,P20中的每幅特征图的宽度为R、高度为L;第二十激活层的输入端接收P20中的所有特征图,第二十激活层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为H20,其中,H20中的每幅特征图的宽度为R、高度为L。For the up-sampling framework, the input of the first up-sampling layer receives all the feature maps in Z 4 , the output of the first up-sampling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Y 1 , where the width of each feature map in Y 1 is height is The input end of the tenth convolutional layer receives all the feature maps in Y 1 , and the output end of the tenth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J 10 , where in J 10 The width of each feature map is height is The input end of the seventeenth batch of normalization layers receives all feature maps in J 10 , the output end of the seventeenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 17 , where P The width of each feature map in 17 is height is The input end of the seventeenth activation layer receives all the feature maps in P 17 , the output end of the seventeenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 17 , wherein, in H 17 The width of each feature map is height is The input end of the second upsampling layer receives all the feature maps in H 17 , the output end of the second upsampling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Y 2 , where in Y 2 The width of each feature map is height is The input of the eleventh convolutional layer receives all the feature maps in Y 2 , the output of the eleventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 11 , where J The width of each feature map in 11 is height is The input end of the eighteenth batch of normalization layers receives all feature maps in J 11 , the output end of the eighteenth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 18 , where P The width of each feature map in 18 is height is The input end of the eighteenth activation layer receives all the feature maps in P 18 , the output end of the eighteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 18 , where in H 18 The width of each feature map is height is The input end of the third up-sampling layer receives all the feature maps in H 18 , the output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Y 3 , where in Y 3 The width of each feature map is height is The input of the twelfth convolutional layer receives all feature maps in Y 3 , and the output of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 12 , where J The width of each feature map in 12 is height is The input end of the nineteenth batch of normalization layers receives all the feature maps in J 12 , the output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 19 , where P The width of each feature map in 19 is height is The input end of the nineteenth activation layer receives all the feature maps in P 19 , and the output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H 19 , where in H 19 The width of each feature map is height is The input end of the fourth upsampling layer receives all the feature maps in H 19 , the output end of the fourth upsampling layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as Y 4 , where in Y 4 The width of each feature map of the The set composed of all feature maps is denoted as J 13 , wherein the width of each feature map in J 13 is R and the height is L; the input end of the twentieth batch of normalization layers receives all the feature maps in J 13 , the second The output of ten batches of normalization layers outputs 32 feature maps, and the set of all output feature maps is denoted as P 20 , where the width of each feature map in P 20 is R and the height is L; the twentieth activation The input end of the layer receives all the feature maps in P 20 , and the output end of the twentieth activation layer outputs 32 feature maps. The width of the graph is R and the height is L.

对于输出层,第十五激活层的输入端接收译码框架的输出即C7中的所有特征图,第十五激活层的输出端输出96幅特征图,将输出的所有特征图构成的集合记为H15,其中,H15中的每幅特征图的宽度为R、高度为L;第九卷积层的输入端接收H15中的所有特征图,第九卷积层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为J9,其中,J9中的特征图的宽度为R、高度为L;第十六批规范化层的输入端接收J9中的特征图,第十六批规范化层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为P16,其中,P16中的特征图的宽度为R、高度为L;第十六激活层的输入端接收P16中的特征图,第十六激活层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为H16,其中,H16中的特征图的宽度为R、高度为L,H16中的特征图即为原始输入图像对应的估计深度图像。For the output layer, the input terminal of the fifteenth activation layer receives the output of the decoding framework, that is, all the feature maps in C7 , and the output terminal of the fifteenth activation layer outputs 96 feature maps. It is denoted as H 15 , wherein the width of each feature map in H 15 is R and the height is L; the input end of the ninth convolution layer receives all the feature maps in H 15 , and the output end of the ninth convolution layer outputs 1 feature map, the set of all output feature maps is denoted as J 9 , where the width of the feature map in J 9 is R, and the height is L; the input of the sixteenth batch of normalization layers receives J 9 . Feature map, the output end of the sixteenth batch of normalization layers outputs one feature map, and the set composed of all the output feature maps is denoted as P 16 , where the width of the feature map in P 16 is R and the height is L; The input terminal of the sixteenth activation layer receives the feature map in P 16 , and the output terminal of the sixteenth activation layer outputs one feature map, and the set formed by all the output feature maps is denoted as H 16 , wherein the features in H 16 The width of the map is R and the height is L, and the feature map in H 16 is the estimated depth image corresponding to the original input image.

步骤1_3:将训练集中的每幅原始的单目图像作为原始输入图像,输入到卷积神经网络中进行训练,得到训练集中的每幅原始的单目图像对应的估计深度图像,将{Qn(x,y)}对应的估计深度图像记为其中,表示中坐标位置为(x,y)的像素点的像素值。Step 1_3: Take each original monocular image in the training set as the original input image, input it into the convolutional neural network for training, and obtain the estimated depth image corresponding to each original monocular image in the training set, and set {Q n The estimated depth image corresponding to (x,y)} is denoted as in, express The pixel value of the pixel whose middle coordinate position is (x, y).

步骤1_4:计算训练集中的每幅原始的单目图像对应的估计深度图像与对应的真实深度图像之间的损失函数值,将之间的损失函数值记为采用均方误差函数获得。Step 1_4: Calculate the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image, and The loss function value between is denoted as Obtained using the mean squared error function.

步骤1_5:重复执行步骤1_3和步骤1_4共V次,得到训练好的卷积神经网络训练模型,并共得到N×V个损失函数值;然后从N×V个损失函数值中找出值最小的损失函数值;接着将值最小的损失函数值对应的权值矢量和偏置项对应作为训练好的卷积神经网络训练模型的最优权值矢量和最优偏置项,对应记为Wbest和bbest;其中,V>1,在本实施例中取V=20。Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain a trained convolutional neural network training model, and obtain a total of N×V loss function values; then find the smallest value from the N×V loss function values. Then take the weight vector and the bias term corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias term of the trained convolutional neural network training model, corresponding to W best and b best ; where V>1, in this embodiment, V=20.

所述的测试阶段过程的具体步骤为:The specific steps of the described testing phase process are:

步骤2_1:令{Q(x',y')}表示待预测的单目图像;其中,1≤x'≤R',1≤y'≤L',R'表示{Q(x',y')}的宽度,L'表示{Q(x',y')}的高度,Q(x',y')表示{Q(x',y')}中坐标位置为(x',y')的像素点的像素值。Step 2_1: Let {Q(x',y')} denote the monocular image to be predicted; wherein, 1≤x'≤R', 1≤y'≤L', R' denotes {Q(x',y ')} width, L' means the height of {Q(x',y')}, Q(x',y') means the coordinate position in {Q(x',y')} is (x',y ') of the pixel value of the pixel point.

步骤2_2:将{Q(x',y')}输入到训练好的卷积神经网络训练模型中,并利用Wbest和bbest进行预测,得到{Q(x',y')}对应的预测深度图像,记为{Qdepth(x',y')};其中,Qdepth(x',y')表示{Qdepth(x',y')}中坐标位置为(x',y')的像素点的像素值。Step 2_2: Input {Q(x',y')} into the trained convolutional neural network training model, and use W best and b best to predict, and get {Q(x',y')} corresponding The predicted depth image is recorded as {Q depth (x', y')}; among them, Q depth (x', y') indicates that the coordinate position in {Q depth (x', y')} is (x', y ') of the pixel value of the pixel point.

为了验证本发明方法的可行性和有效性,对本发明方法进行实验。In order to verify the feasibility and effectiveness of the method of the present invention, experiments were carried out on the method of the present invention.

在此,本发明方法中构成训练集的单目图像和用于测试的单目图像均由KITTI官方网站给出,因此直接使用KITTI官方网站给出的测试数据集来分析测试本发明方法的准确性。将测试数据集中的每幅单目图像作为待预测的单目图像输入到训练好的深度卷积神经网络训练模型中,再载入训练阶段得到的最优权重Wbest,获得对应的预测深度图像。Here, the monocular images constituting the training set and the monocular images used for testing in the method of the present invention are given by the official website of KITTI, so the test data set given by the official website of KITTI is directly used to analyze and test the accuracy of the method of the present invention sex. Input each monocular image in the test data set as the monocular image to be predicted into the trained deep convolutional neural network training model, and then load the optimal weight W best obtained in the training stage to obtain the corresponding predicted depth image .

在此,采用单目视觉深度预测评价方法的6个常用客观参量作为评价指标,即:均方根误差(root mean squared error,rms)、对数均方根误差(log_rms)、平均对数误差(average log10error,log10)、阈值准确性(thr):δ1、δ2、δ3。均方根误差、对数均方根误差、平均对数误差的数值越低代表预测深度图像与真实深度图像越接近,δ1、δ2、δ3的数值越高说明预测深度图像的准确性越高。反映本发明方法的评价性能优劣指标的均方根误差、对数均方根误差、平均对数误差和δ1、δ2、δ3的结果如表1所列。从表1所列的数据可知,按本发明方法获得的预测深度图像与真实深度图像之间的差别很小,这说明了本发明方法的预测结果的精度很高,体现了本发明方法的可行性和有效性。Here, six common objective parameters of the monocular visual depth prediction evaluation method are used as evaluation indicators, namely: root mean squared error (rms), logarithmic root mean squared error (log_rms), average logarithmic error (average log 10 error, log 10), threshold accuracy (thr): δ 1 , δ 2 , δ 3 . The lower the values of root mean square error, logarithmic root mean square error, and average logarithmic error, the closer the predicted depth image is to the real depth image, and the higher the values of δ 1 , δ 2 , and δ 3 , the accuracy of the predicted depth image. higher. Table 1 lists the results of root mean square error, logarithmic root mean square error, average logarithmic error, and δ 1 , δ 2 , and δ 3 reflecting the evaluation performance indicators of the present invention. From the data listed in Table 1, it can be seen that the difference between the predicted depth image obtained by the method of the present invention and the real depth image is very small, which shows that the accuracy of the prediction result of the method of the present invention is very high, and reflects the feasibility of the method of the present invention. sex and effectiveness.

表1利用本发明方法预测得到的预测深度图像与真实深度图像之间的对比评价指标Table 1 Comparison and evaluation index between the predicted depth image and the real depth image predicted by the method of the present invention

Claims (2)

1.一种单目视觉深度估计方法,其特征在于包括训练阶段和测试阶段两个过程;1. a monocular vision depth estimation method is characterized in that comprising two processes of training stage and testing stage; 所述的训练阶段过程的具体步骤为:The specific steps of the training phase process are: 步骤1_1:选取N幅原始的单目图像及每幅原始的单目图像对应的真实深度图像,并构成训练集,将训练集中的第n幅原始的单目图像记为{Qn(x,y)},将训练集中与{Qn(x,y)}对应的真实深度图像记为其中,N为正整数,N≥100,n为正整数,1≤n≤N,1≤x≤R,1≤y≤L,R表示{Qn(x,y)}和的宽度,L表示{Qn(x,y)}和的高度,R和L均能被2整除,Qn(x,y)表示{Qn(x,y)}中坐标位置为(x,y)的像素点的像素值,表示中坐标位置为(x,y)的像素点的像素值;Step 1_1: Select N original monocular images and the real depth image corresponding to each original monocular image, and form a training set, and record the nth original monocular image in the training set as {Q n (x, y)}, denote the real depth image corresponding to {Q n (x, y)} in the training set as Among them, N is a positive integer, N≥100, n is a positive integer, 1≤n≤N, 1≤x≤R, 1≤y≤L, R represents {Q n (x, y)} and The width of , L represents {Q n (x, y)} and The height of , R and L are both divisible by 2, Q n (x, y) represents the pixel value of the pixel at the coordinate position (x, y) in {Q n (x, y)}, express The pixel value of the pixel whose middle coordinate position is (x, y); 步骤1_2:构建端到端的卷积神经网络:卷积神经网络包括输入层、隐层和输出层;隐层包括编码框架、译码框架和上采样框架;Step 1_2: Build an end-to-end convolutional neural network: the convolutional neural network includes an input layer, a hidden layer, and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame; 对于输入层,输入层的输入端接收一幅原始输入图像,输入层的输出端输出原始输入图像给隐层;其中,要求输入层的输入端接收的原始输入图像的宽度为R、高度为L;For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; among them, the width of the original input image received by the input end of the input layer is required to be R and the height is L ; 对于编码框架,其由依次设置的第一卷积层、第一批规范化层、第一激活层、第一最大池化层、第二卷积层、第二批规范化层、第二激活层、第三卷积层、第三批规范化层、第一Concatenate融合层、第三激活层、第二最大池化层、第四卷积层、第四批规范化层、第四激活层、第五卷积层、第五批规范化层、第二Concatenate融合层、第五激活层、第三最大池化层、第一带孔卷积层、第六批规范化层、第六激活层、第二带孔卷积层、第七批规范化层、第三Concatenate融合层、第七激活层、第四最大池化层、第三带孔卷积层、第八批规范化层、第八激活层组成;对于译码框架,其由依次设置的第一反卷积层、第九批规范化层、第四Concatenate融合层、第九激活层、第六卷积层、第十批规范化层、第十激活层、第二反卷积层、第十一批规范化层、第五Concatenate融合层、第十一激活层、第七卷积层、第十二批规范化层、第十二激活层、第三反卷积层、第十三批规范化层、第六Concatenate融合层、第十三激活层、第八卷积层、第十四批规范化层、第十四激活层、第四反卷积层、第十五批规范化层、第七Concatenate融合层组成;对于上采样框架,其由依次设置的第一上采样层、第十卷积层、第十七批规范化层、第十七激活层、第二上采样层、第十一卷积层、第十八批规范化层、第十八激活层、第三上采样层、第十二卷积层、第十九批规范化层、第十九激活层、第四上采样层、第十三卷积层、第二十批规范化层、第二十激活层组成;对于输出层,其由依次设置的第十五激活层、第九卷积层、第十六批规范化层、第十六激活层组成,其中,第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层、第一反卷积层至第四反卷积层各自的卷积核大小为3×3,第一卷积层的卷积核个数为32、第二卷积层和第三卷积层的卷积核个数为64、第四卷积层和第五卷积层的卷积核个数为128、第一带孔卷积层和第二带孔卷积层的卷积核个数为256、第三带孔卷积层的卷积核个数为512、第一反卷积层和第六卷积层的卷积核个数为256、第二反卷积层和第七卷积层的卷积核个数为128、第三反卷积层和第八卷积层的卷积核个数为64、第四反卷积层的卷积核个数为32、第九卷积层的卷积核个数为1、第十卷积层的卷积核个数为256、第十一卷积层的卷积核个数为128、第十二卷积层的卷积核个数为64、第十三卷积层的卷积核个数为32,第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层各自的卷积步长采用默认值,第一反卷积层至第四反卷积层各自的卷积步长为2×2,第一批规范化层至第二十批规范化层的参数采用默认值,第一激活层至第二十激活层的激活函数采用ReLu,第一最大池化层至第四最大池化层的池化步长为2×2,第一上采样层至第四上采样层的采样步长为2×2;For the encoding framework, it consists of the first convolutional layer, the first normalization layer, the first activation layer, the first max pooling layer, the second convolutional layer, the second normalization layer, the second activation layer, 3rd Convolutional Layer, 3rd Batch Normalization Layer, 1st Concatenate Fusion Layer, 3rd Activation Layer, 2nd Max Pooling Layer, 4th Convolutional Layer, 4th Batch Normalization Layer, 4th Activation Layer, 5th Volume Convolution layer, fifth batch normalization layer, second concatenate fusion layer, fifth activation layer, third max pooling layer, first convolutional layer with holes, sixth batch normalization layer, sixth activation layer, second hole Convolutional layer, seventh batch of normalization layer, third Concatenate fusion layer, seventh activation layer, fourth maximum pooling layer, third convolutional layer with holes, eighth batch of normalization layer, eighth activation layer; The code frame is composed of the first deconvolution layer, the ninth batch of normalization layers, the fourth Concatenate fusion layer, the ninth activation layer, the sixth convolution layer, the tenth batch of normalization layers, the tenth activation layer, the fourth batch of The second deconvolution layer, the eleventh normalization layer, the fifth concatenate fusion layer, the eleventh activation layer, the seventh convolution layer, the twelfth normalization layer, the twelfth activation layer, the third deconvolution layer , the thirteenth batch of normalization layer, the sixth Concatenate fusion layer, the thirteenth activation layer, the eighth convolution layer, the fourteenth batch of normalization layer, the fourteenth activation layer, the fourth deconvolution layer, the fifteenth batch The normalization layer and the seventh Concatenate fusion layer are composed; for the upsampling framework, it consists of the first upsampling layer, the tenth convolutional layer, the seventeenth batch normalization layer, the seventeenth activation layer, and the second upsampling layer. , the eleventh convolutional layer, the eighteenth batch of normalization layers, the eighteenth activation layer, the third upsampling layer, the twelfth convolutional layer, the nineteenth batch of normalization layers, the nineteenth activation layer, the fourth upper The sampling layer, the thirteenth convolutional layer, the twentieth batch of normalization layers, and the twentieth activation layer are composed; for the output layer, it consists of the fifteenth activation layer, the ninth convolutional layer, and the sixteenth batch of normalization set in sequence. layer and the sixteenth activation layer, among which, the first convolutional layer to the thirteenth convolutional layer, the first convolutional layer with holes to the third convolutional layer with holes, the first deconvolution layer to the fourth inverse convolutional layer The convolution kernel size of each convolution layer is 3 × 3, the number of convolution kernels of the first convolution layer is 32, the number of convolution kernels of the second convolution layer and the third convolution layer is 64, and the number of convolution kernels of the fourth convolution layer is 64. The number of convolution kernels of the convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first convolutional layer and the second convolutional layer with holes is 256, and the number of convolutional kernels of the third convolutional layer with holes The number of convolution kernels is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolution layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolution layer is 128, The number of convolution kernels in the third deconvolution layer and the eighth convolution layer is 64, the number of convolution kernels in the fourth deconvolution layer is 32, and the number of convolution kernels in the ninth convolution layer is 1. The number of convolution kernels in the tenth convolution layer is 256, the number of convolution kernels in the eleventh convolution layer is 128, the number of convolution kernels in the twelfth convolution layer is 64, and the number of convolution kernels in the thirteenth convolution layer is 64. The number of convolution kernels is 32, the first The convolution strides from the first convolutional layer to the thirteenth convolutional layer, the first convolutional convolutional layer to the third convolutional convolutional layer with holes adopt the default values, and the first deconvolutional layer to the fourth deconvolutional layer The respective convolution step size is 2×2, the parameters of the first batch of normalization layers to the twentieth batch of normalization layers use default values, the activation functions of the first activation layer to the twentieth activation layer use ReLu, and the first maximum pooling is used. The pooling step size from the layer to the fourth maximum pooling layer is 2×2, and the sampling step size from the first upsampling layer to the fourth upsampling layer is 2×2; 对于编码框架,第一卷积层的输入端接收输入层的输出端输出的原始输入图像,第一卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为J1,其中,J1中的每幅特征图的宽度为R、高度为L;第一批规范化层的输入端接收J1中的所有特征图,第一批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P1,其中,P1中的每幅特征图的宽度为R、高度为L;第一激活层的输入端接收P1中的所有特征图,第一激活层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为H1,其中,H1中的每幅特征图的宽度为R、高度为L;第一最大池化层的输入端接收H1中的所有特征图,第一最大池化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为Z1,其中,Z1中的每幅特征图的宽度为高度为第二卷积层的输入端接收Z1中的所有特征图,第二卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J2,其中,J2中的每幅特征图的宽度为高度为第二批规范化层的输入端接收J2中的所有特征图,第二批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P2,其中,P2中的每幅特征图的宽度为高度为第二激活层的输入端接收P2中的所有特征图,第二激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H2,其中,H2中的每幅特征图的宽度为高度为第三卷积层的输入端接收H2中的所有特征图,第三卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J3,其中,J3中的每幅特征图的宽度为高度为第三批规范化层的输入端接收J3中的所有特征图,第三批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P3,其中,P3中的每幅特征图的宽度为高度为第一Concatenate融合层的输入端接收P3中的所有特征图和H2中的所有特征图,第一Concatenate融合层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为C1,其中,C1中的每幅特征图的宽度为高度为第三激活层的输入端接收C1中的所有特征图,第三激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H3,其中,H3中的每幅特征图的宽度为高度为第二最大池化层的输入端接收H3中的所有特征图,第二最大池化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为Z2,其中,Z2中的每幅特征图的宽度为高度为第四卷积层的输入端接收Z2中的所有特征图,第四卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J4,其中,J4中的每幅特征图的宽度为高度为第四批规范化层的输入端接收J4中的所有特征图,第四批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P4,其中,P4中的每幅特征图的宽度为高度为第四激活层的输入端接收P4中的所有特征图,第四激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H4,其中,H4中的每幅特征图的宽度为高度为第五卷积层的输入端接收H4中的所有特征图,第五卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J5,其中,J5中的每幅特征图的宽度为高度为第五批规范化层的输入端接收J5中的所有特征图,第五批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P5,其中,P5中的每幅特征图的宽度为高度为第二Concatenate融合层的输入端接收P5中的所有特征图和H4中的所有特征图,第二Concatenate融合层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为C2,其中,C2中的每幅特征图的宽度为高度为第五激活层的输入端接收C2中的所有特征图,第五激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H5,其中,H5中的每幅特征图的宽度为高度为第三最大池化层的输入端接收H5中的所有特征图,第三最大池化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为Z3,其中,Z3中的每幅特征图的宽度为高度为第一带孔卷积层的输入端接收Z3中的所有特征图,第一带孔卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为K1,其中,K1中的每幅特征图的宽度为高度为第六批规范化层的输入端接收K1中的所有特征图,第六批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P6,其中,P6中的每幅特征图的宽度为高度为第六激活层的输入端接收P6中的所有特征图,第六激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H6,其中,H6中的每幅特征图的宽度为高度为第二带孔卷积层的输入端接收H6中的所有特征图,第二带孔卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为K2,其中,K2中的每幅特征图的宽度为高度为第七批规范化层的输入端接收K2中的所有特征图,第七批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P7,其中,P7中的每幅特征图的宽度为高度为第三Concatenate融合层的输入端接收P7中的所有特征图和H6中的所有特征图,第三Concatenate融合层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为C3,其中,C3中的每幅特征图的宽度为高度为第七激活层的输入端接收C3中的所有特征图,第七激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H7,其中,H7中的每幅特征图的宽度为高度为第四最大池化层的输入端接收H7中的所有特征图,第四最大池化层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为Z4,其中,Z4中的每幅特征图的宽度为高度为第三带孔卷积层的输入端接收Z4中的所有特征图,第三带孔卷积层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为K3,其中,K3中的每幅特征图的宽度为高度为第八批规范化层的输入端接收K3中的所有特征图,第八批规范化层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为P8,其中,P8中的每幅特征图的宽度为高度为第八激活层的输入端接收P8中的所有特征图,第八激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H8,H8也即为编码框架的输出,其中,H8中的每幅特征图的宽度为高度为 For the coding framework, the input of the first convolutional layer receives the original input image output by the output of the input layer, the output of the first convolutional layer outputs 32 feature maps, and the set of all the output feature maps is denoted as J 1 , where the width of each feature map in J 1 is R and the height is L; the input end of the first batch of normalization layers receives all the feature maps in J 1 , and the output end of the first batch of normalization layers outputs 32 features The set of all output feature maps is denoted as P 1 , wherein the width of each feature map in P 1 is R and the height is L; the input end of the first activation layer receives all feature maps in P 1 , the output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as H 1 , wherein the width of each feature map in H 1 is R and the height is L; the first largest The input end of the pooling layer receives all the feature maps in H 1 , the output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as Z 1 , wherein, in Z 1 The width of each feature map is height is The input of the second convolutional layer receives all the feature maps in Z 1 , the output of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 2 , where in J 2 The width of each feature map is height is The input end of the second batch of normalization layers receives all feature maps in J 2 , and the output end of the second batch of normalization layers outputs 64 feature maps. The width of each feature map is height is The input terminal of the second activation layer receives all feature maps in P 2 , and the output terminal of the second activation layer outputs 64 feature maps. The width of the feature map is height is The input end of the third convolutional layer receives all the feature maps in H 2 , and the output end of the third convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 3 , where in The width of each feature map is height is The input end of the third batch of normalization layers receives all feature maps in J 3 , the output end of the third batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 3 , where in P 3 The width of each feature map is height is The input of the first Concatenate fusion layer receives all the feature maps in P3 and all the feature maps in H2 , and the output of the first Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C 1 , where the width of each feature map in C 1 is height is The input terminal of the third activation layer receives all feature maps in C 1 , and the output terminal of the third activation layer outputs 128 feature maps. The width of the feature map is height is The input of the second maximum pooling layer receives all the feature maps in H 3 , the output of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Z 2 , where Z The width of each feature map in 2 is height is The input end of the fourth convolutional layer receives all the feature maps in Z 2 , and the output end of the fourth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 4 , where in J 4 The width of each feature map is height is The input terminal of the fourth batch of normalization layers receives all feature maps in J 4 , and the output terminal of the fourth batch of normalization layers outputs 128 feature maps. The width of each feature map is height is The input end of the fourth activation layer receives all the feature maps in P 4 , and the output end of the fourth activation layer outputs 128 feature maps. The width of the feature map is height is The input end of the fifth convolutional layer receives all the feature maps in H 4 , and the output end of the fifth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 5 , where in J 5 The width of each feature map is height is The input end of the fifth batch of normalization layers receives all feature maps in J 5 , and the output end of the fifth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 5 , where in P 5 The width of each feature map is height is The input end of the second Concatenate fusion layer receives all the feature maps in P 5 and all the feature maps in H 4 , and the output end of the second Concatenate fusion layer outputs 256 feature maps, and the set formed by all the output feature maps is recorded as C 2 , where the width of each feature map in C 2 is height is The input end of the fifth activation layer receives all the feature maps in C 2 , and the output end of the fifth activation layer outputs 256 feature maps. The width of the feature map is height is The input terminal of the third maximum pooling layer receives all the feature maps in H 5 , the output terminal of the third maximum pooling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Z 3 , where Z 3 The width of each feature map in 3 is height is The input of the first atrous convolutional layer receives all feature maps in Z 3 , the output of the first atrous convolutional layer outputs 256 feature maps, and the set of all the output feature maps is denoted as K 1 , where , the width of each feature map in K1 is height is The input end of the sixth batch of normalization layers receives all the feature maps in K 1 , the output end of the sixth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 6 , where in P 6 The width of each feature map is height is The input end of the sixth activation layer receives all the feature maps in P 6 , and the output end of the sixth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 6 , where each of the feature maps in H 6 The width of the feature map is height is The input end of the second atrous convolutional layer receives all the feature maps in H 6 , the output end of the second atrous convolutional layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as K 2 , where , the width of each feature map in K2 is height is The input end of the seventh batch of normalization layers receives all the feature maps in K 2 , the output end of the seventh batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 7 , where in P 7 The width of each feature map is height is The input end of the third Concatenate fusion layer receives all the feature maps in P 7 and all the feature maps in H 6 , and the output end of the third Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is denoted as C 3 , where the width of each feature map in C 3 is height is The input terminal of the seventh activation layer receives all the feature maps in C 3 , and the output terminal of the seventh activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H 7 . The width of the feature map is height is The input end of the fourth maximum pooling layer receives all the feature maps in H 7 , and the output end of the fourth maximum pooling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Z 4 , where Z 4 The width of each feature map in 4 is height is The input of the third atrous convolutional layer receives all the feature maps in Z 4 , the output of the third atrous convolutional layer outputs 512 feature maps, and the set of all the output feature maps is denoted as K 3 , where , the width of each feature map in K3 is height is The input end of the eighth batch of normalization layers receives all the feature maps in K 3 , the output end of the eighth batch of normalization layers outputs 512 feature maps, and the set formed by all the output feature maps is denoted as P 8 , where in P 8 The width of each feature map is height is The input end of the eighth activation layer receives all the feature maps in P8, and the output end of the eighth activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H 8 , and H 8 is also the coding frame. The output of , where the width of each feature map in H8 is height is 对于译码框架,第一反卷积层的输入端接收编码框架的输出即H8中的所有特征图,第一反卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为F1,其中,F1中的每幅特征图的宽度为高度为第九批规范化层的输入端接收F1中的所有特征图,第九批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P9,其中,P9中的每幅特征图的宽度为高度为第四Concatenate融合层的输入端接收P9中的所有特征图和P7中的所有特征图,第四Concatenate融合层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为C4,其中,C4中的每幅特征图的宽度为高度为第九激活层的输入端接收C4中的所有特征图,第九激活层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为H9,其中,H9中的每幅特征图的宽度为高度为第六卷积层的输入端接收H9中的所有特征图,第六卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为J6,其中,J6中的每幅特征图的宽度为高度为第十批规范化层的输入端接收J6中的所有特征图,第十批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P10,其中,P10中的每幅特征图的宽度为高度为第十激活层的输入端接收P10中的所有特征图,第十激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H10,其中,H10中的每幅特征图的宽度为高度为第二反卷积层的输入端接收编码框架的输出即H10中的所有特征图,第二反卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为F2,其中,F2中的每幅特征图的宽度为高度为第十一批规范化层的输入端接收F2中的所有特征图,第十一批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P11,其中,P11中的每幅特征图的宽度为高度为第五Concatenate融合层的输入端接收P11中的所有特征图和P5中的所有特征图,第五Concatenate融合层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为C5,其中,C5中的每幅特征图的宽度为高度为第十一激活层的输入端接收C5中的所有特征图,第十一激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H11,其中,H11中的每幅特征图的宽度为高度为第七卷积层的输入端接收H11中的所有特征图,第七卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J7,其中,J7中的每幅特征图的宽度为高度为第十二批规范化层的输入端接收J7中的所有特征图,第十二批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P12,其中,P12中的每幅特征图的宽度为高度为第十二激活层的输入端接收P12中的所有特征图,第十二激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H12,其中,H12中的每幅特征图的宽度为高度为第三反卷积层的输入端接收H12中的所有特征图,第三反卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为F3,其中,F3中的每幅特征图的宽度为高度为第十三批规范化层的输入端接收F3中的所有特征图,第十三批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P13,其中,P13中的每幅特征图的宽度为高度为第六Concatenate融合层的输入端接收P13中的所有特征图和P3中的所有特征图,第六Concatenate融合层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为C6,其中,C6中的每幅特征图的宽度为高度为第十三激活层的输入端接收C6中的所有特征图,第十三激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H13,其中,H13中的每幅特征图的宽度为高度为第八卷积层的输入端接收H13中的所有特征图,第八卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J8,其中,J8中的每幅特征图的宽度为高度为第十四批规范化层的输入端接收J8中的所有特征图,第十四批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P14,其中,P14中的每幅特征图的宽度为高度为第十四激活层的输入端接收P14中的所有特征图,第十四激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H14,其中,H14中的每幅特征图的宽度为高度为第四反卷积层的输入端接收H14中的所有特征图,第四反卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为F4,其中,F4中的每幅特征图的宽度为R、高度为L;第十五批规范化层的输入端接收F4中的所有特征图,第十五批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P15,其中,P15中的每幅特征图的宽度为R、高度为L;第七Concatenate融合层的输入端接收P15中的所有特征图、H1中的所有特征图、上采样框架的输出,第七Concatenate融合层的输出端输出96幅特征图,将输出的所有特征图构成的集合记为C7,其中,C7中的每幅特征图的宽度为R、高度为L;For the decoding framework, the input end of the first deconvolution layer receives the output of the encoding framework, that is, all feature maps in H8 , and the output end of the first deconvolution layer outputs 256 feature maps, which are composed of all the output feature maps. The set is denoted as F 1 , where the width of each feature map in F 1 is height is The input end of the ninth batch of normalization layers receives all the feature maps in F 1 , and the output end of the ninth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 9 , where in P 9 The width of each feature map is height is The input end of the fourth Concatenate fusion layer receives all the feature maps in P 9 and all the feature maps in P 7 , and the output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is recorded as C 4 , where the width of each feature map in C 4 is height is The input end of the ninth activation layer receives all the feature maps in C 4 , and the output end of the ninth activation layer outputs 512 feature maps. The width of the feature map is height is The input end of the sixth convolutional layer receives all the feature maps in H 9 , the output end of the sixth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J 6 , where in J 6 The width of each feature map is height is The input end of the tenth batch of normalization layers receives all the feature maps in J 6 , the output end of the tenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 10 , where in P 10 The width of each feature map is height is The input end of the tenth activation layer receives all the feature maps in P 10 , and the output end of the tenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 10 , where each H 10 The width of the feature map is height is The input end of the second deconvolution layer receives the output of the coding framework, that is, all feature maps in H10 , and the output end of the second deconvolution layer outputs 128 feature maps, and the set of all the output feature maps is denoted as F 2 , where the width of each feature map in F2 is height is The input end of the eleventh batch of normalization layers receives all the feature maps in F 2 , and the output end of the eleventh batch of normalization layers outputs 128 feature maps, and the set of all the output feature maps is denoted as P 11 , where P The width of each feature map in 11 is height is The input of the fifth Concatenate fusion layer receives all the feature maps in P 11 and all the feature maps in P 5 , and the output of the fifth Concatenate fusion layer outputs 256 feature maps, and the set of all the output feature maps is recorded as C 5 , where the width of each feature map in C 5 is height is The input end of the eleventh activation layer receives all the feature maps in C5 , and the output end of the eleventh activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 11 . The width of each feature map is height is The input end of the seventh convolutional layer receives all the feature maps in H 11 , the output end of the seventh convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 7 , where in J 7 The width of each feature map is height is The input end of the twelfth batch of normalization layers receives all feature maps in J 7 , the output end of the twelfth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 12 , where P The width of each feature map in 12 is height is The input end of the twelfth activation layer receives all the feature maps in P 12 , the output end of the twelfth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 12 , where in H 12 The width of each feature map is height is The input end of the third deconvolution layer receives all the feature maps in H 12 , the output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as F 3 , where F The width of each feature map in 3 is height is The input end of the thirteenth batch of normalization layers receives all the feature maps in F 3 , the output end of the thirteenth batch of normalization layers outputs 64 feature maps, and the set of all the output feature maps is denoted as P 13 , where P The width of each feature map in 13 is height is The input end of the sixth Concatenate fusion layer receives all the feature maps in P 13 and all the feature maps in P 3 , and the output end of the sixth Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C 6 , where the width of each feature map in C 6 is height is The input end of the thirteenth activation layer receives all the feature maps in C 6 , the output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 13 , wherein, in H 13 The width of each feature map is height is The input end of the eighth convolutional layer receives all the feature maps in H 13 , the output end of the eighth convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 8 , where in J 8 The width of each feature map is height is The input end of the fourteenth batch of normalization layers receives all the feature maps in J 8 , the output end of the fourteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 14 , where P The width of each feature map in 14 is height is The input end of the fourteenth activation layer receives all the feature maps in P 14 , the output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H 14 , wherein, in H 14 The width of each feature map is height is The input end of the fourth deconvolution layer receives all the feature maps in H 14 , the output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F 4 , where F The width of each feature map in 4 is R and the height is L; the input end of the fifteenth batch of normalization layers receives all the feature maps in F 4 , and the output end of the fifteenth batch of normalization layers outputs 32 feature maps. The set composed of all the output feature maps is denoted as P 15 , wherein the width of each feature map in P 15 is R and the height is L; the input end of the seventh Concatenate fusion layer receives all the feature maps in P 15 , H All feature maps in 1 , the output of the upsampling framework, the output of the seventh Concatenate fusion layer outputs 96 feature maps, and the set of all the output feature maps is denoted as C 7 , where each feature in C 7 The width of the graph is R and the height is L; 对于上采样框架,第一上采样层的输入端接收Z4中的所有特征图,第一上采样层的输出端输出512幅特征图,将输出的所有特征图构成的集合记为Y1,其中,Y1中的每幅特征图的宽度为高度为第十卷积层的输入端接收Y1中的所有特征图,第十卷积层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为J10,其中,J10中的每幅特征图的宽度为高度为第十七批规范化层的输入端接收J10中的所有特征图,第十七批规范化层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为P17,其中,P17中的每幅特征图的宽度为高度为第十七激活层的输入端接收P17中的所有特征图,第十七激活层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为H17,其中,H17中的每幅特征图的宽度为高度为第二上采样层的输入端接收H17中的所有特征图,第二上采样层的输出端输出256幅特征图,将输出的所有特征图构成的集合记为Y2,其中,Y2中的每幅特征图的宽度为高度为第十一卷积层的输入端接收Y2中的所有特征图,第十一卷积层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为J11,其中,J11中的每幅特征图的宽度为高度为第十八批规范化层的输入端接收J11中的所有特征图,第十八批规范化层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为P18,其中,P18中的每幅特征图的宽度为高度为第十八激活层的输入端接收P18中的所有特征图,第十八激活层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为H18,其中,H18中的每幅特征图的宽度为高度为第三上采样层的输入端接收H18中的所有特征图,第三上采样层的输出端输出128幅特征图,将输出的所有特征图构成的集合记为Y3,其中,Y3中的每幅特征图的宽度为高度为第十二卷积层的输入端接收Y3中的所有特征图,第十二卷积层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为J12,其中,J12中的每幅特征图的宽度为高度为第十九批规范化层的输入端接收J12中的所有特征图,第十九批规范化层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为P19,其中,P19中的每幅特征图的宽度为高度为第十九激活层的输入端接收P19中的所有特征图,第十九激活层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为H19,其中,H19中的每幅特征图的宽度为高度为第四上采样层的输入端接收H19中的所有特征图,第四上采样层的输出端输出64幅特征图,将输出的所有特征图构成的集合记为Y4,其中,Y4中的每幅特征图的宽度为R、高度为L;第十三卷积层的输入端接收Y4中的所有特征图,第十三卷积层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为J13,其中,J13中的每幅特征图的宽度为R、高度为L;第二十批规范化层的输入端接收J13中的所有特征图,第二十批规范化层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为P20,其中,P20中的每幅特征图的宽度为R、高度为L;第二十激活层的输入端接收P20中的所有特征图,第二十激活层的输出端输出32幅特征图,将输出的所有特征图构成的集合记为H20,其中,H20中的每幅特征图的宽度为R、高度为L;For the up-sampling framework, the input of the first up-sampling layer receives all the feature maps in Z 4 , the output of the first up-sampling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Y 1 , where the width of each feature map in Y 1 is height is The input end of the tenth convolutional layer receives all the feature maps in Y 1 , and the output end of the tenth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J 10 , where in J 10 The width of each feature map is height is The input end of the seventeenth batch of normalization layers receives all feature maps in J 10 , the output end of the seventeenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P 17 , where P The width of each feature map in 17 is height is The input end of the seventeenth activation layer receives all the feature maps in P 17 , the output end of the seventeenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 17 , wherein, in H 17 The width of each feature map is height is The input end of the second upsampling layer receives all the feature maps in H 17 , the output end of the second upsampling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Y 2 , where in Y 2 The width of each feature map is height is The input of the eleventh convolutional layer receives all the feature maps in Y 2 , the output of the eleventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J 11 , where J The width of each feature map in 11 is height is The input end of the eighteenth batch of normalization layers receives all feature maps in J 11 , the output end of the eighteenth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P 18 , where P The width of each feature map in 18 is height is The input end of the eighteenth activation layer receives all the feature maps in P 18 , the output end of the eighteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H 18 , where in H 18 The width of each feature map is height is The input end of the third up-sampling layer receives all the feature maps in H 18 , the output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Y 3 , where in Y 3 The width of each feature map is height is The input of the twelfth convolutional layer receives all feature maps in Y 3 , and the output of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J 12 , where J The width of each feature map in 12 is height is The input end of the nineteenth batch of normalization layers receives all the feature maps in J 12 , the output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P 19 , where P The width of each feature map in 19 is height is The input end of the nineteenth activation layer receives all the feature maps in P 19 , and the output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H 19 , where in H 19 The width of each feature map is height is The input end of the fourth upsampling layer receives all the feature maps in H 19 , the output end of the fourth upsampling layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as Y 4 , where in Y 4 The width of each feature map of the The set formed by all feature maps is denoted as J 13 , wherein the width of each feature map in J 13 is R and the height is L; the input end of the twentieth batch of normalization layers receives all feature maps in J 13 , the second The output of ten batches of normalization layers outputs 32 feature maps, and the set of all output feature maps is denoted as P 20 , where the width of each feature map in P 20 is R and the height is L; the twentieth activation The input end of the layer receives all the feature maps in P 20 , and the output end of the twentieth activation layer outputs 32 feature maps. The width of the graph is R and the height is L; 对于输出层,第十五激活层的输入端接收译码框架的输出即C7中的所有特征图,第十五激活层的输出端输出96幅特征图,将输出的所有特征图构成的集合记为H15,其中,H15中的每幅特征图的宽度为R、高度为L;第九卷积层的输入端接收H15中的所有特征图,第九卷积层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为J9,其中,J9中的特征图的宽度为R、高度为L;第十六批规范化层的输入端接收J9中的特征图,第十六批规范化层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为P16,其中,P16中的特征图的宽度为R、高度为L;第十六激活层的输入端接收P16中的特征图,第十六激活层的输出端输出1幅特征图,将输出的所有特征图构成的集合记为H16,其中,H16中的特征图的宽度为R、高度为L,H16中的特征图即为原始输入图像对应的估计深度图像;For the output layer, the input terminal of the fifteenth activation layer receives the output of the decoding framework, that is, all the feature maps in C7 , and the output terminal of the fifteenth activation layer outputs 96 feature maps. It is denoted as H 15 , wherein the width of each feature map in H 15 is R and the height is L; the input end of the ninth convolution layer receives all the feature maps in H 15 , and the output end of the ninth convolution layer outputs 1 feature map, the set of all output feature maps is denoted as J 9 , where the width of the feature map in J 9 is R, and the height is L; the input of the sixteenth batch of normalization layers receives J 9 . Feature map, the output end of the sixteenth batch of normalization layers outputs one feature map, and the set composed of all the output feature maps is denoted as P 16 , where the width of the feature map in P 16 is R and the height is L; The input terminal of the sixteenth activation layer receives the feature map in P 16 , and the output terminal of the sixteenth activation layer outputs one feature map, and the set formed by all the output feature maps is denoted as H 16 , wherein the features in H 16 The width of the image is R and the height is L, and the feature map in H 16 is the estimated depth image corresponding to the original input image; 步骤1_3:将训练集中的每幅原始的单目图像作为原始输入图像,输入到卷积神经网络中进行训练,得到训练集中的每幅原始的单目图像对应的估计深度图像,将{Qn(x,y)}对应的估计深度图像记为其中,表示中坐标位置为(x,y)的像素点的像素值;Step 1_3: Take each original monocular image in the training set as the original input image, input it into the convolutional neural network for training, and obtain the estimated depth image corresponding to each original monocular image in the training set, and set {Q n The estimated depth image corresponding to (x,y)} is denoted as in, express The pixel value of the pixel whose middle coordinate position is (x, y); 步骤1_4:计算训练集中的每幅原始的单目图像对应的估计深度图像与对应的真实深度图像之间的损失函数值,将之间的损失函数值记为 Step 1_4: Calculate the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image, and The loss function value between is denoted as 步骤1_5:重复执行步骤1_3和步骤1_4共V次,得到训练好的卷积神经网络训练模型,并共得到N×V个损失函数值;然后从N×V个损失函数值中找出值最小的损失函数值;接着将值最小的损失函数值对应的权值矢量和偏置项对应作为训练好的卷积神经网络训练模型的最优权值矢量和最优偏置项,对应记为Wbest和bbest;其中,V>1;Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain the trained convolutional neural network training model, and obtain a total of N×V loss function values; then find the smallest value from the N×V loss function values. Then take the weight vector and the bias term corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias term of the trained convolutional neural network training model, which are recorded as W best and b best ; wherein, V>1; 所述的测试阶段过程的具体步骤为:The specific steps of the test phase process are: 步骤2_1:令{Q(x',y')}表示待预测的单目图像;其中,1≤x'≤R',1≤y'≤L',R'表示{Q(x',y')}的宽度,L'表示{Q(x',y')}的高度,Q(x',y')表示{Q(x',y')}中坐标位置为(x',y')的像素点的像素值;Step 2_1: Let {Q(x',y')} denote the monocular image to be predicted; wherein, 1≤x'≤R', 1≤y'≤L', R' denotes {Q(x',y ')} width, L' means the height of {Q(x',y')}, Q(x',y') means the coordinate position in {Q(x',y')} is (x',y ') the pixel value of the pixel point; 步骤2_2:将{Q(x',y')}输入到训练好的卷积神经网络训练模型中,并利用Wbest和bbest进行预测,得到{Q(x',y')}对应的预测深度图像,记为{Qdepth(x',y')};其中,Qdepth(x',y')表示{Qdepth(x',y')}中坐标位置为(x',y')的像素点的像素值。Step 2_2: Input {Q(x',y')} into the trained convolutional neural network training model, and use W best and b best to predict, and get {Q(x',y')} corresponding to The predicted depth image is recorded as {Q depth (x',y')}; among them, Q depth (x',y') indicates that the coordinate position in {Q depth (x',y')} is (x',y) ') of the pixel value of the pixel point. 2.根据权利要求1所述的一种单目视觉深度估计方法,其特征在于所述的步骤1_4中,采用均方误差函数获得。2. a kind of monocular vision depth estimation method according to claim 1, is characterized in that in described step 1-4, Obtained using the mean squared error function.
CN201811246664.0A 2018-10-25 2018-10-25 Monocular vision depth estimation method Active CN109460815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811246664.0A CN109460815B (en) 2018-10-25 2018-10-25 Monocular vision depth estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811246664.0A CN109460815B (en) 2018-10-25 2018-10-25 Monocular vision depth estimation method

Publications (2)

Publication Number Publication Date
CN109460815A true CN109460815A (en) 2019-03-12
CN109460815B CN109460815B (en) 2021-12-10

Family

ID=65608334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811246664.0A Active CN109460815B (en) 2018-10-25 2018-10-25 Monocular vision depth estimation method

Country Status (1)

Country Link
CN (1) CN109460815B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414674A (en) * 2019-07-31 2019-11-05 浙江科技学院 A Monocular Depth Estimation Method Based on Residual Network and Local Refinement
CN111161166A (en) * 2019-12-16 2020-05-15 西安交通大学 Image moire eliminating method based on depth multi-resolution network
US20210209453A1 (en) * 2019-03-14 2021-07-08 Infineon Technologies Ag Fmcw radar with interference signal suppression using artificial neural network
WO2022193866A1 (en) * 2021-03-16 2022-09-22 Huawei Technologies Co., Ltd. Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos
US11885903B2 (en) 2019-03-14 2024-01-30 Infineon Technologies Ag FMCW radar with interference signal suppression using artificial neural network
US12032089B2 (en) 2019-03-14 2024-07-09 Infineon Technologies Ag FMCW radar with interference signal suppression using artificial neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886165A (en) * 2017-12-30 2018-04-06 北京工业大学 A kind of parallel-convolution neural net method based on CRT technology
CN108090472A (en) * 2018-01-12 2018-05-29 浙江大学 Pedestrian based on multichannel uniformity feature recognition methods and its system again
US20180260703A1 (en) * 2016-11-22 2018-09-13 Massachusetts Institute Of Technology Systems and methods for training neural networks
CN108681692A (en) * 2018-04-10 2018-10-19 华南理工大学 Increase Building recognition method in a kind of remote sensing images based on deep learning newly

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180260703A1 (en) * 2016-11-22 2018-09-13 Massachusetts Institute Of Technology Systems and methods for training neural networks
CN107886165A (en) * 2017-12-30 2018-04-06 北京工业大学 A kind of parallel-convolution neural net method based on CRT technology
CN108090472A (en) * 2018-01-12 2018-05-29 浙江大学 Pedestrian based on multichannel uniformity feature recognition methods and its system again
CN108681692A (en) * 2018-04-10 2018-10-19 华南理工大学 Increase Building recognition method in a kind of remote sensing images based on deep learning newly

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210209453A1 (en) * 2019-03-14 2021-07-08 Infineon Technologies Ag Fmcw radar with interference signal suppression using artificial neural network
US11885903B2 (en) 2019-03-14 2024-01-30 Infineon Technologies Ag FMCW radar with interference signal suppression using artificial neural network
US11907829B2 (en) * 2019-03-14 2024-02-20 Infineon Technologies Ag FMCW radar with interference signal suppression using artificial neural network
US12032089B2 (en) 2019-03-14 2024-07-09 Infineon Technologies Ag FMCW radar with interference signal suppression using artificial neural network
CN110414674A (en) * 2019-07-31 2019-11-05 浙江科技学院 A Monocular Depth Estimation Method Based on Residual Network and Local Refinement
CN110414674B (en) * 2019-07-31 2021-09-10 浙江科技学院 Monocular depth estimation method based on residual error network and local refinement
CN111161166A (en) * 2019-12-16 2020-05-15 西安交通大学 Image moire eliminating method based on depth multi-resolution network
WO2022193866A1 (en) * 2021-03-16 2022-09-22 Huawei Technologies Co., Ltd. Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos
US12033342B2 (en) 2021-03-16 2024-07-09 Huawei Technologies Co., Ltd. Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos

Also Published As

Publication number Publication date
CN109460815B (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN109460815A (en) A kind of monocular depth estimation method
CN110738697B (en) Monocular depth estimation method based on deep learning
CN111784602B (en) Method for generating countermeasure network for image restoration
CN107506761B (en) Brain image segmentation method and system based on saliency learning convolutional neural network
CN109146944B (en) Visual depth estimation method based on depth separable convolutional neural network
CN111462230B (en) Typhoon center positioning method based on deep reinforcement learning
CN110175986A (en) A kind of stereo-picture vision significance detection method based on convolutional neural networks
CN111126599B (en) A neural network weight initialization method based on transfer learning
CN108711141A (en) The motion blur image blind restoration method of network is fought using improved production
CN106934456A (en) A kind of depth convolutional neural networks model building method
CN116206214B (en) A method, system, device and medium for automatically identifying landslides based on lightweight convolutional neural network and dual attention
CN110059728A (en) RGB-D image vision conspicuousness detection method based on attention model
CN111062395A (en) Real-time video semantic segmentation method
CN110503063A (en) Fall Detection Method Based on Hourglass Convolutional Autoencoding Neural Network
CN111259735B (en) Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network
CN113420643A (en) Lightweight underwater target detection method based on depth separable cavity convolution
CN114255239B (en) A cell segmentation fine-tuning method
CN114972753A (en) A lightweight semantic segmentation method and system based on contextual information aggregation and assisted learning
CN109461177A (en) A kind of monocular image depth prediction approach neural network based
CN109448039B (en) Monocular vision depth estimation method based on deep convolutional neural network
CN112330705A (en) Image binarization method based on deep learning semantic segmentation
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN111639751A (en) Non-zero padding training method for binary convolutional neural network
CN114495210A (en) Posture change face recognition method based on attention mechanism
CN109377498A (en) An Interactive Mapping Method Based on Recurrent Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant