CN109460815A

CN109460815A - A kind of monocular depth estimation method

Info

Publication number: CN109460815A
Application number: CN201811246664.0A
Authority: CN
Inventors: 周武杰; 袁建中; 吕思嘉; 钱亚冠; 向坚; 张宇来
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2019-03-12
Anticipated expiration: 2038-10-25
Also published as: CN109460815B

Abstract

The invention discloses a monocular visual depth estimation method, which firstly constructs a convolutional neural network, which includes an input layer, a hidden layer and an output layer; the hidden layer includes an encoding frame, a decoding frame and an upsampling frame; and then uses a training set As the original input image, the monocular image is input into the convolutional neural network for training, and the estimated depth image corresponding to each original monocular image in the training set is obtained; then the estimated depth image corresponding to the monocular image in the training set is calculated. The loss function value between the corresponding real depth image and the convolutional neural network training model, the optimal weight vector and the optimal bias term are obtained; then the monocular image to be predicted is input into the convolutional neural network training model , and use the optimal weight vector and the optimal bias term to predict the corresponding predicted depth image; the advantage is that its prediction accuracy is high.

Description

A Monocular Vision Depth Estimation Method

技术领域technical field

本发明涉及一种图像信号处理技术，尤其是涉及一种单目视觉深度估计方法。The invention relates to an image signal processing technology, in particular to a monocular visual depth estimation method.

背景技术Background technique

经济的高速发展带来了人们生活水平的不断提升，随着人们对好的生活质量的要求逐渐增强，交通的便利性也越来越好。汽车作为交通中的重要一环，其发展更加被重视。在人工智能大火的如今，无人驾驶也是近年来较为热门的话题之一，并且在百度宣布无人驾驶车进入批量生产即将投入使用之后，无人驾驶的热潮持续提高。车前的单目视觉深度估计是无人驾驶领域的一部分，它可以有效地保障汽车行驶过程中的安全。The rapid development of the economy has brought about the continuous improvement of people's living standards. As people's requirements for a good quality of life gradually increase, the convenience of transportation is also getting better and better. As an important part of transportation, the development of automobiles has been paid more and more attention. In the age of artificial intelligence, driverless cars are also one of the hottest topics in recent years, and after Baidu announced that driverless cars will enter mass production and will be put into use, the upsurge of driverless cars continues to increase. The monocular visual depth estimation in front of the car is a part of the field of unmanned driving, which can effectively ensure the safety of the car during driving.

单目视觉深度估计的方法主要有传统方法和深度学习方法。在深度学习方法出现之前，依赖于传统方法的深度估计得出的结果远不能满足人们的最低的期望标准；在深度学习方法出现后，在深度学习中使用端到端的训练方法，使用大量的训练数据，进行学习后深度估计得出的结果精度得到了极大的提升。Eigen等人在文献《Depth Map Predictionfrom a Single Image using a Multi-Scale Deep Network》(《基于多尺度深度网络的单幅图像深度图预测》)中提出的神经网络的基础上进行了进一步提升，《基于多尺度深度网络的单幅图像深度图预测》提出使用两个尺度的神经网络来做深度估计：粗规模网络预测全局深度分布和精细规模网络以局部细化深度图，而Eigen等人在这两个尺度的神经网络的基础上将其拓展到三个尺度。该三个尺度的神经网络架构首先使用第一个尺度来根据整个图像区域预测出一个较为粗略的结果，然后使用第二个尺度对其在中等分辨率的基础上进行优化，最后使用第三个尺度对结果上采样后做细化提炼获得预测深度图，但是，该三个尺度的神经网络架构是针对深度预测、表面法线估计和语义分割这三种不同的计算机视觉任务的联合预测而提出的，若将其单独用于深度估计，则深度估计的准确性却不是很高，而且最终得到的预测深度图只有原本图像尺寸的一半，而尺寸的不一致性不利于对其中深度信息的直接使用。The methods of monocular visual depth estimation mainly include traditional methods and deep learning methods. Before the emergence of deep learning methods, the results obtained by relying on traditional methods for depth estimation are far from meeting people's minimum expectations; after the emergence of deep learning methods, end-to-end training methods are used in deep learning, using a large number of training The accuracy of the results obtained by depth estimation after learning has been greatly improved. Eigen et al. further improved the neural network proposed in the document "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" ("Depth Map Prediction of Single Image Based on Multi-Scale Deep Network"). Single Image Depth Map Prediction Based on Multi-scale Deep Networks" proposed to use two-scale neural networks for depth estimation: a coarse-scale network to predict the global depth distribution and a fine-scale network to locally refine the depth map, and Eigen et al. Based on the neural network of two scales, it is extended to three scales. The three-scale neural network architecture first uses the first scale to predict a rougher result based on the entire image area, then uses the second scale to optimize it on the basis of medium resolution, and finally uses the third scale. The scales upsample the results and refine and refine the predicted depth map. However, the three-scale neural network architecture is proposed for the joint prediction of three different computer vision tasks: depth prediction, surface normal estimation, and semantic segmentation. If it is used for depth estimation alone, the accuracy of depth estimation is not very high, and the final predicted depth map is only half the size of the original image, and the inconsistency of the size is not conducive to the direct use of depth information. .

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是提供一种单目视觉深度估计方法，其预测精度高。The technical problem to be solved by the present invention is to provide a monocular visual depth estimation method with high prediction accuracy.

本发明解决上述技术问题所采用的技术方案为：一种单目视觉深度估计方法，其特征在于包括训练阶段和测试阶段两个过程；The technical solution adopted by the present invention to solve the above technical problems is: a monocular visual depth estimation method, which is characterized in that it includes two processes: a training phase and a testing phase;

所述的训练阶段过程的具体步骤为：The specific steps of the training phase process are:

步骤1_1：选取N幅原始的单目图像及每幅原始的单目图像对应的真实深度图像，并构成训练集，将训练集中的第n幅原始的单目图像记为{Qⁿ(x,y)}，将训练集中与{Qⁿ(x,y)}对应的真实深度图像记为其中，N为正整数，N≥100，n为正整数，1≤n≤N，1≤x≤R，1≤y≤L，R表示{Qⁿ(x,y)}和的宽度，L表示{Qⁿ(x,y)}和的高度，R和L均能被2整除，Qⁿ(x,y)表示{Qⁿ(x,y)}中坐标位置为(x,y)的像素点的像素值，表示中坐标位置为(x,y)的像素点的像素值；Step 1_1: Select N original monocular images and the real depth image corresponding to each original monocular image, and form a training set, and record the nth original monocular image in the training set as {Q ⁿ (x, y)}, denote the real depth image corresponding to {Q ⁿ (x, y)} in the training set as Among them, N is a positive integer, N≥100, n is a positive integer, 1≤n≤N, 1≤x≤R, 1≤y≤L, R represents {Q ⁿ (x, y)} and The width of , L represents {Q ⁿ (x, y)} and The height of , R and L are both divisible by 2, Q ⁿ (x, y) represents the pixel value of the pixel at the coordinate position (x, y) in {Q ⁿ (x, y)}, express The pixel value of the pixel whose middle coordinate position is (x, y);

步骤1_2：构建端到端的卷积神经网络：卷积神经网络包括输入层、隐层和输出层；隐层包括编码框架、译码框架和上采样框架；Step 1_2: Build an end-to-end convolutional neural network: the convolutional neural network includes an input layer, a hidden layer, and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame;

对于输入层，输入层的输入端接收一幅原始输入图像，输入层的输出端输出原始输入图像给隐层；其中，要求输入层的输入端接收的原始输入图像的宽度为R、高度为L；For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; among them, the width of the original input image received by the input end of the input layer is required to be R and the height is L ;

对于编码框架，其由依次设置的第一卷积层、第一批规范化层、第一激活层、第一最大池化层、第二卷积层、第二批规范化层、第二激活层、第三卷积层、第三批规范化层、第一Concatenate融合层、第三激活层、第二最大池化层、第四卷积层、第四批规范化层、第四激活层、第五卷积层、第五批规范化层、第二Concatenate融合层、第五激活层、第三最大池化层、第一带孔卷积层、第六批规范化层、第六激活层、第二带孔卷积层、第七批规范化层、第三Concatenate融合层、第七激活层、第四最大池化层、第三带孔卷积层、第八批规范化层、第八激活层组成；对于译码框架，其由依次设置的第一反卷积层、第九批规范化层、第四Concatenate融合层、第九激活层、第六卷积层、第十批规范化层、第十激活层、第二反卷积层、第十一批规范化层、第五Concatenate融合层、第十一激活层、第七卷积层、第十二批规范化层、第十二激活层、第三反卷积层、第十三批规范化层、第六Concatenate融合层、第十三激活层、第八卷积层、第十四批规范化层、第十四激活层、第四反卷积层、第十五批规范化层、第七Concatenate融合层组成；对于上采样框架，其由依次设置的第一上采样层、第十卷积层、第十七批规范化层、第十七激活层、第二上采样层、第十一卷积层、第十八批规范化层、第十八激活层、第三上采样层、第十二卷积层、第十九批规范化层、第十九激活层、第四上采样层、第十三卷积层、第二十批规范化层、第二十激活层组成；对于输出层，其由依次设置的第十五激活层、第九卷积层、第十六批规范化层、第十六激活层组成，其中，第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层、第一反卷积层至第四反卷积层各自的卷积核大小为3×3，第一卷积层的卷积核个数为32、第二卷积层和第三卷积层的卷积核个数为64、第四卷积层和第五卷积层的卷积核个数为128、第一带孔卷积层和第二带孔卷积层的卷积核个数为256、第三带孔卷积层的卷积核个数为512、第一反卷积层和第六卷积层的卷积核个数为256、第二反卷积层和第七卷积层的卷积核个数为128、第三反卷积层和第八卷积层的卷积核个数为64、第四反卷积层的卷积核个数为32、第九卷积层的卷积核个数为1、第十卷积层的卷积核个数为256、第十一卷积层的卷积核个数为128、第十二卷积层的卷积核个数为64、第十三卷积层的卷积核个数为32，第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层各自的卷积步长采用默认值，第一反卷积层至第四反卷积层各自的卷积步长为2×2，第一批规范化层至第二十批规范化层的参数采用默认值，第一激活层至第二十激活层的激活函数采用ReLu，第一最大池化层至第四最大池化层的池化步长为2×2，第一上采样层至第四上采样层的采样步长为2×2；For the encoding framework, it consists of the first convolutional layer, the first normalization layer, the first activation layer, the first max pooling layer, the second convolutional layer, the second normalization layer, the second activation layer, 3rd Convolutional Layer, 3rd Batch Normalization Layer, 1st Concatenate Fusion Layer, 3rd Activation Layer, 2nd Max Pooling Layer, 4th Convolutional Layer, 4th Batch Normalization Layer, 4th Activation Layer, 5th Volume Convolution layer, fifth batch normalization layer, second concatenate fusion layer, fifth activation layer, third max pooling layer, first convolutional layer with holes, sixth batch normalization layer, sixth activation layer, second hole Convolutional layer, seventh batch of normalization layer, third Concatenate fusion layer, seventh activation layer, fourth maximum pooling layer, third convolutional layer with holes, eighth batch of normalization layer, eighth activation layer; The code frame is composed of the first deconvolution layer, the ninth batch of normalization layers, the fourth Concatenate fusion layer, the ninth activation layer, the sixth convolution layer, the tenth batch of normalization layers, the tenth activation layer, the fourth batch of The second deconvolution layer, the eleventh normalization layer, the fifth concatenate fusion layer, the eleventh activation layer, the seventh convolution layer, the twelfth normalization layer, the twelfth activation layer, the third deconvolution layer , the thirteenth batch of normalization layer, the sixth Concatenate fusion layer, the thirteenth activation layer, the eighth convolution layer, the fourteenth batch of normalization layer, the fourteenth activation layer, the fourth deconvolution layer, the fifteenth batch The normalization layer and the seventh Concatenate fusion layer are composed; for the upsampling framework, it consists of the first upsampling layer, the tenth convolutional layer, the seventeenth batch normalization layer, the seventeenth activation layer, and the second upsampling layer. , the eleventh convolutional layer, the eighteenth batch of normalization layers, the eighteenth activation layer, the third upsampling layer, the twelfth convolutional layer, the nineteenth batch of normalization layers, the nineteenth activation layer, the fourth upper The sampling layer, the thirteenth convolutional layer, the twentieth batch of normalization layers, and the twentieth activation layer are composed; for the output layer, it consists of the fifteenth activation layer, the ninth convolutional layer, and the sixteenth batch of normalization set in sequence. layer and the sixteenth activation layer, among which, the first convolutional layer to the thirteenth convolutional layer, the first convolutional layer with holes to the third convolutional layer with holes, the first deconvolution layer to the fourth inverse convolutional layer The convolution kernel size of each convolution layer is 3 × 3, the number of convolution kernels of the first convolution layer is 32, the number of convolution kernels of the second convolution layer and the third convolution layer is 64, and the number of convolution kernels of the fourth convolution layer is 64. The number of convolution kernels of the convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first convolutional layer and the second convolutional layer with holes is 256, and the number of convolutional kernels of the third convolutional layer with holes The number of convolution kernels is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolution layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolution layer is 128, The number of convolution kernels in the third deconvolution layer and the eighth convolution layer is 64, the number of convolution kernels in the fourth deconvolution layer is 32, and the number of convolution kernels in the ninth convolution layer is 1. The number of convolution kernels in the tenth convolution layer is 256, the number of convolution kernels in the eleventh convolution layer is 128, the number of convolution kernels in the twelfth convolution layer is 64, and the number of convolution kernels in the thirteenth convolution layer is 64. The number of convolution kernels is 32, the first The convolution strides from the first convolutional layer to the thirteenth convolutional layer, the first convolutional convolutional layer to the third convolutional convolutional layer with holes adopt the default values, and the first deconvolutional layer to the fourth deconvolutional layer The respective convolution step size is 2×2, the parameters of the first batch of normalization layers to the twentieth batch of normalization layers use default values, the activation functions of the first activation layer to the twentieth activation layer use ReLu, and the first maximum pooling is used. The pooling step size from the layer to the fourth maximum pooling layer is 2×2, and the sampling step size from the first upsampling layer to the fourth upsampling layer is 2×2;

对于编码框架，第一卷积层的输入端接收输入层的输出端输出的原始输入图像，第一卷积层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为J₁，其中，J₁中的每幅特征图的宽度为R、高度为L；第一批规范化层的输入端接收J₁中的所有特征图，第一批规范化层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为P₁，其中，P₁中的每幅特征图的宽度为R、高度为L；第一激活层的输入端接收P₁中的所有特征图，第一激活层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为H₁，其中，H₁中的每幅特征图的宽度为R、高度为L；第一最大池化层的输入端接收H₁中的所有特征图，第一最大池化层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为Z₁，其中，Z₁中的每幅特征图的宽度为高度为第二卷积层的输入端接收Z₁中的所有特征图，第二卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为J₂，其中，J₂中的每幅特征图的宽度为高度为第二批规范化层的输入端接收J₂中的所有特征图，第二批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₂，其中，P₂中的每幅特征图的宽度为高度为第二激活层的输入端接收P₂中的所有特征图，第二激活层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为H₂，其中，H₂中的每幅特征图的宽度为高度为第三卷积层的输入端接收H₂中的所有特征图，第三卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为J₃，其中，J₃中的每幅特征图的宽度为高度为第三批规范化层的输入端接收J₃中的所有特征图，第三批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₃，其中，P₃中的每幅特征图的宽度为高度为第一Concatenate融合层的输入端接收P₃中的所有特征图和H₂中的所有特征图，第一Concatenate融合层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为C₁，其中，C₁中的每幅特征图的宽度为高度为第三激活层的输入端接收C₁中的所有特征图，第三激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₃，其中，H₃中的每幅特征图的宽度为高度为第二最大池化层的输入端接收H₃中的所有特征图，第二最大池化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为Z₂，其中，Z₂中的每幅特征图的宽度为高度为第四卷积层的输入端接收Z₂中的所有特征图，第四卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为J₄，其中，J₄中的每幅特征图的宽度为高度为第四批规范化层的输入端接收J₄中的所有特征图，第四批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₄，其中，P₄中的每幅特征图的宽度为高度为第四激活层的输入端接收P₄中的所有特征图，第四激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₄，其中，H₄中的每幅特征图的宽度为高度为第五卷积层的输入端接收H₄中的所有特征图，第五卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为J₅，其中，J₅中的每幅特征图的宽度为高度为第五批规范化层的输入端接收J₅中的所有特征图，第五批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₅，其中，P₅中的每幅特征图的宽度为高度为第二Concatenate融合层的输入端接收P₅中的所有特征图和H₄中的所有特征图，第二Concatenate融合层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为C₂，其中，C₂中的每幅特征图的宽度为高度为第五激活层的输入端接收C₂中的所有特征图，第五激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₅，其中，H₅中的每幅特征图的宽度为高度为第三最大池化层的输入端接收H₅中的所有特征图，第三最大池化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为Z₃，其中，Z₃中的每幅特征图的宽度为高度为第一带孔卷积层的输入端接收Z₃中的所有特征图，第一带孔卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为K₁，其中，K₁中的每幅特征图的宽度为高度为第六批规范化层的输入端接收K₁中的所有特征图，第六批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₆，其中，P₆中的每幅特征图的宽度为高度为第六激活层的输入端接收P₆中的所有特征图，第六激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₆，其中，H₆中的每幅特征图的宽度为高度为第二带孔卷积层的输入端接收H₆中的所有特征图，第二带孔卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为K₂，其中，K₂中的每幅特征图的宽度为高度为第七批规范化层的输入端接收K₂中的所有特征图，第七批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₇，其中，P₇中的每幅特征图的宽度为高度为第三Concatenate融合层的输入端接收P₇中的所有特征图和H₆中的所有特征图，第三Concatenate融合层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为C₃，其中，C₃中的每幅特征图的宽度为高度为第七激活层的输入端接收C₃中的所有特征图，第七激活层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为H₇，其中，H₇中的每幅特征图的宽度为高度为第四最大池化层的输入端接收H₇中的所有特征图，第四最大池化层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为Z₄，其中，Z₄中的每幅特征图的宽度为高度为第三带孔卷积层的输入端接收Z₄中的所有特征图，第三带孔卷积层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为K₃，其中，K₃中的每幅特征图的宽度为高度为第八批规范化层的输入端接收K₃中的所有特征图，第八批规范化层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为P₈，其中，P₈中的每幅特征图的宽度为高度为第八激活层的输入端接收P₈中的所有特征图，第八激活层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为H₈，H₈也即为编码框架的输出，其中，H₈中的每幅特征图的宽度为高度为 For the coding framework, the input of the first convolutional layer receives the original input image output by the output of the input layer, the output of the first convolutional layer outputs 32 feature maps, and the set of all the output feature maps is denoted as J ₁ , where the width of each feature map in J ₁ is R and the height is L; the input end of the first batch of normalization layers receives all the feature maps in J ₁ , and the output end of the first batch of normalization layers outputs 32 features The set of all output feature maps is denoted as P ₁ , wherein the width of each feature map in P ₁ is R and the height is L; the input end of the first activation layer receives all feature maps in P ₁ , the output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as H ₁ , wherein the width of each feature map in H ₁ is R and the height is L; the first largest The input end of the pooling layer receives all the feature maps in H ₁ , the output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as Z ₁ , wherein, in Z ₁ The width of each feature map is height is The input of the second convolutional layer receives all the feature maps in Z ₁ , the output of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₂ , where in J ₂ The width of each feature map is height is The input end of the _second batch of normalization layers receives all feature maps in J ₂ , and the output end of the _second batch of normalization layers outputs 64 feature maps. The width of each feature map is height is The input terminal of the _second activation layer receives all feature maps in P ₂ , and the output terminal of the _second activation layer outputs 64 feature maps. The width of the feature map is height is The input end of the third convolutional layer receives all the feature maps in H ₂ , and the output end of the _third convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₃ , where in The width of each feature map is height is The input end of the third batch of normalization layers receives all feature maps in J ₃ , the output end of the third batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₃ , where in P ₃ The width of each feature map is height is The input of the first Concatenate fusion layer receives all the feature maps in _P3 and all the feature maps in _H2 , and the output of the first Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C ₁ , where the width of each feature map in C ₁ is height is The input terminal of the _third activation layer receives all feature maps in C ₁ , and the output terminal of the _third activation layer outputs 128 feature maps. The width of the feature map is height is The input of the second maximum pooling layer receives all the feature maps in H ₃ , the output of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Z ₂ , where Z The width of each feature map in ₂ is height is The input end of the fourth convolutional layer receives all the feature maps in Z ₂ , and the output end of the fourth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₄ , where in J ₄ The width of each feature map is height is The input terminal of the _fourth batch of normalization layers receives all feature maps in J ₄ , and the output terminal of the _fourth batch of normalization layers outputs 128 feature maps. The width of each feature map is height is The input end of the _fourth activation layer receives all the feature maps in P ₄ , and the output end of the _fourth activation layer outputs 128 feature maps. The width of the feature map is height is The input end of the fifth convolutional layer receives all the feature maps in H ₄ , and the output end of the fifth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₅ , where in J ₅ The width of each feature map is height is The input end of the fifth batch of normalization layers receives all feature maps in J ₅ , and the output end of the fifth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₅ , where in P ₅ The width of each feature map is height is The input end of the second Concatenate fusion layer receives all the feature maps in P ₅ and all the feature maps in H ₄ , and the output end of the second Concatenate fusion layer outputs 256 feature maps, and the set formed by all the output feature maps is recorded as C ₂ , where the width of each feature map in C ₂ is height is The input end of the _fifth activation layer receives all the feature maps in C ₂ , and the output end of the _fifth activation layer outputs 256 feature maps. The width of the feature map is height is The input terminal of the third maximum pooling layer receives all the feature maps in H ₅ , the output terminal of the third maximum pooling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Z ₃ , where Z 3 The width of each feature map in ₃ is height is The input of the first atrous convolutional layer receives all feature maps in Z ₃ , the output of the first atrous convolutional layer outputs 256 feature maps, and the set of all the output feature maps is denoted as K ₁ , where , the width _of each feature map in K1 is height is The input end of the sixth batch of normalization layers receives all the feature maps in K ₁ , the output end of the sixth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₆ , where in P ₆ The width of each feature map is height is The input end of the sixth activation layer receives all the feature maps in P ₆ , and the output end of the sixth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₆ , where each of the feature maps in H ₆ The width of the feature map is height is The input end of the second atrous convolutional layer receives all the feature maps in H ₆ , the output end of the second atrous convolutional layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as K ₂ , where , the width of each feature map in _K2 is height is The input end of the seventh batch of normalization layers receives all the feature maps in K ₂ , the output end of the seventh batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₇ , where in P ₇ The width of each feature map is height is The input end of the third Concatenate fusion layer receives all the feature maps in P ₇ and all the feature maps in H ₆ , and the output end of the third Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is denoted as C ₃ , where the width of each feature map in C ₃ is height is The input terminal of the seventh activation layer receives all the feature maps in C ₃ , and the output terminal of the _seventh activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H ₇ . The width of the feature map is height is The input end of the fourth maximum pooling layer receives all the feature maps in H ₇ , and the output end of the fourth maximum pooling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Z ₄ , where Z 4 The width of each feature map in ₄ is height is The input of the third atrous convolutional layer receives all the feature maps in Z ₄ , the output of the third atrous convolutional layer outputs 512 feature maps, and the set of all the output feature maps is denoted as K ₃ , where , the width _of each feature map in K3 is height is The input end of the eighth batch of normalization layers receives all the feature maps in K ₃ , the output end of the eighth batch of normalization layers outputs 512 feature maps, and the set formed by all the output feature maps is denoted as P ₈ , where in P ₈ The width of each feature map is height is The input end of the _eighth activation layer receives all the feature maps in P8, and the output end of the eighth activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H ₈ , and H ₈ is also the coding frame. _The output of , where the width of each feature map in H8 is height is

对于译码框架，第一反卷积层的输入端接收编码框架的输出即H₈中的所有特征图，第一反卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为F₁，其中，F₁中的每幅特征图的宽度为高度为第九批规范化层的输入端接收F₁中的所有特征图，第九批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₉，其中，P₉中的每幅特征图的宽度为高度为第四Concatenate融合层的输入端接收P₉中的所有特征图和P₇中的所有特征图，第四Concatenate融合层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为C₄，其中，C₄中的每幅特征图的宽度为高度为第九激活层的输入端接收C₄中的所有特征图，第九激活层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为H₉，其中，H₉中的每幅特征图的宽度为高度为第六卷积层的输入端接收H₉中的所有特征图，第六卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为J₆，其中，J₆中的每幅特征图的宽度为高度为第十批规范化层的输入端接收J₆中的所有特征图，第十批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₁₀，其中，P₁₀中的每幅特征图的宽度为高度为第十激活层的输入端接收P₁₀中的所有特征图，第十激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₁₀，其中，H₁₀中的每幅特征图的宽度为高度为第二反卷积层的输入端接收编码框架的输出即H₁₀中的所有特征图，第二反卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为F₂，其中，F₂中的每幅特征图的宽度为高度为第十一批规范化层的输入端接收F₂中的所有特征图，第十一批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₁₁，其中，P₁₁中的每幅特征图的宽度为高度为第五Concatenate融合层的输入端接收P₁₁中的所有特征图和P₅中的所有特征图，第五Concatenate融合层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为C₅，其中，C₅中的每幅特征图的宽度为高度为第十一激活层的输入端接收C₅中的所有特征图，第十一激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₁₁，其中，H₁₁中的每幅特征图的宽度为高度为第七卷积层的输入端接收H₁₁中的所有特征图，第七卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为J₇，其中，J₇中的每幅特征图的宽度为高度为第十二批规范化层的输入端接收J₇中的所有特征图，第十二批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₁₂，其中，P₁₂中的每幅特征图的宽度为高度为第十二激活层的输入端接收P₁₂中的所有特征图，第十二激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₁₂，其中，H₁₂中的每幅特征图的宽度为高度为第三反卷积层的输入端接收H₁₂中的所有特征图，第三反卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为F₃，其中，F₃中的每幅特征图的宽度为高度为第十三批规范化层的输入端接收F₃中的所有特征图，第十三批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₁₃，其中，P₁₃中的每幅特征图的宽度为高度为第六Concatenate融合层的输入端接收P₁₃中的所有特征图和P₃中的所有特征图，第六Concatenate融合层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为C₆，其中，C₆中的每幅特征图的宽度为高度为第十三激活层的输入端接收C₆中的所有特征图，第十三激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₁₃，其中，H₁₃中的每幅特征图的宽度为高度为第八卷积层的输入端接收H₁₃中的所有特征图，第八卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为J₈，其中，J₈中的每幅特征图的宽度为高度为第十四批规范化层的输入端接收J₈中的所有特征图，第十四批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₁₄，其中，P₁₄中的每幅特征图的宽度为高度为第十四激活层的输入端接收P₁₄中的所有特征图，第十四激活层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为H₁₄，其中，H₁₄中的每幅特征图的宽度为高度为第四反卷积层的输入端接收H₁₄中的所有特征图，第四反卷积层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为F₄，其中，F₄中的每幅特征图的宽度为R、高度为L；第十五批规范化层的输入端接收F₄中的所有特征图，第十五批规范化层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为P₁₅，其中，P₁₅中的每幅特征图的宽度为R、高度为L；第七Concatenate融合层的输入端接收P₁₅中的所有特征图、H₁中的所有特征图、上采样框架的输出，第七Concatenate融合层的输出端输出96幅特征图，将输出的所有特征图构成的集合记为C₇，其中，C₇中的每幅特征图的宽度为R、高度为L；For the decoding framework, the input end of the first deconvolution layer receives the output of the encoding framework, that is, all feature maps in _H8 , and the output end of the first deconvolution layer outputs 256 feature maps, which are composed of all the output feature maps. The set of is denoted as F ₁ , where the width of each feature map in F ₁ is height is The input end of the ninth batch of normalization layers receives all the feature maps in F ₁ , and the output end of the ninth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₉ , where in P ₉ The width of each feature map is height is The input end of the fourth Concatenate fusion layer receives all the feature maps in P ₉ and all the feature maps in P ₇ , and the output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is recorded as C ₄ , where the width of each feature map in C ₄ is height is The input end of the _ninth activation layer receives all the feature maps in C ₄ , and the output end of the _ninth activation layer outputs 512 feature maps. The width of the feature map is height is The input end of the sixth convolutional layer receives all the feature maps in H ₉ , the output end of the sixth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J ₆ , where in J ₆ The width of each feature map is height is The input end of the tenth batch of normalization layers receives all the feature maps in J ₆ , the output end of the tenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₁₀ , where in P ₁₀ The width of each feature map is height is The input end of the tenth activation layer receives all the feature maps in P ₁₀ , and the output end of the tenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₁₀ , where each H ₁₀ The width of the feature map is height is The input end of the second deconvolution layer receives the output of the coding framework, that is, all feature maps in _H10 , and the output end of the second deconvolution layer outputs 128 feature maps, and the set of all the output feature maps is denoted as F ₂ , where the width of each feature map in _F2 is height is The input end of the eleventh batch of normalization layers receives all the feature maps in F ₂ , and the output end of the eleventh batch of normalization layers outputs 128 feature maps, and the set of all the output feature maps is denoted as P ₁₁ , where P The width of each feature map in ₁₁ is height is The input of the fifth Concatenate fusion layer receives all the feature maps in P ₁₁ and all the feature maps in P ₅ , and the output of the fifth Concatenate fusion layer outputs 256 feature maps, and the set of all the output feature maps is recorded as C ₅ , where the width of each feature map in C ₅ is height is The input end of the eleventh activation layer receives all the feature maps in _C5 , and the output end of the _{eleventh activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 11} _. The width of each feature map is height is The input end of the seventh convolutional layer receives all the feature maps in H ₁₁ , the output end of the seventh convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₇ , where in J ₇ The width of each feature map is height is The input end of the twelfth batch of normalization layers receives all feature maps in J ₇ , the output end of the twelfth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₁₂ , where P The width of each feature map in ₁₂ is height is The input end of the twelfth activation layer receives all the feature maps in P ₁₂ , the output end of the twelfth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₂ , where in H ₁₂ The width of each feature map is height is The input end of the third deconvolution layer receives all the feature maps in H ₁₂ , the output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as F ₃ , where F The width of each feature map in ₃ is height is The input end of the thirteenth batch of normalization layers receives all the feature maps in F ₃ , the output end of the thirteenth batch of normalization layers outputs 64 feature maps, and the set of all the output feature maps is denoted as P ₁₃ , where P The width of each feature map in ₁₃ is height is The input end of the sixth Concatenate fusion layer receives all the feature maps in P ₁₃ and all the feature maps in P ₃ , and the output end of the sixth Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C ₆ , where the width of each feature map in C ₆ is height is The input end of the thirteenth activation layer receives all the feature maps in C ₆ , the output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₃ , wherein, in H ₁₃ The width of each feature map is height is The input end of the eighth convolutional layer receives all the feature maps in H ₁₃ , the output end of the eighth convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₈ , where in J ₈ The width of each feature map is height is The input end of the fourteenth batch of normalization layers receives all the feature maps in J ₈ , the output end of the fourteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₁₄ , where P The width of each feature map in ₁₄ is height is The input end of the fourteenth activation layer receives all the feature maps in P ₁₄ , the output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H ₁₄ , wherein, in H ₁₄ The width of each feature map is height is The input end of the fourth deconvolution layer receives all the feature maps in H ₁₄ , the output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F ₄ , where F The width of each feature map in ₄ is R and the height is L; the input end of the fifteenth batch of normalization layers receives all the feature maps in F ₄ , and the output end of the fifteenth batch of normalization layers outputs 32 feature maps. The set composed of all the output feature maps is denoted as P ₁₅ , wherein the width of each feature map in P ₁₅ is R and the height is L; the input end of the seventh Concatenate fusion layer receives all the feature maps in P ₁₅ , H All feature maps in ₁ , the output of the upsampling framework, the output of the seventh Concatenate fusion layer outputs 96 feature maps, and the set of all the output feature maps is denoted as C ₇ , where each feature in C ₇ The width of the graph is R and the height is L;

对于上采样框架，第一上采样层的输入端接收Z₄中的所有特征图，第一上采样层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为Y₁，其中，Y₁中的每幅特征图的宽度为高度为第十卷积层的输入端接收Y₁中的所有特征图，第十卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为J₁₀，其中，J₁₀中的每幅特征图的宽度为高度为第十七批规范化层的输入端接收J₁₀中的所有特征图，第十七批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₁₇，其中，P₁₇中的每幅特征图的宽度为高度为第十七激活层的输入端接收P₁₇中的所有特征图，第十七激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₁₇，其中，H₁₇中的每幅特征图的宽度为高度为第二上采样层的输入端接收H₁₇中的所有特征图，第二上采样层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为Y₂，其中，Y₂中的每幅特征图的宽度为高度为第十一卷积层的输入端接收Y₂中的所有特征图，第十一卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为J₁₁，其中，J₁₁中的每幅特征图的宽度为高度为第十八批规范化层的输入端接收J₁₁中的所有特征图，第十八批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₁₈，其中，P₁₈中的每幅特征图的宽度为高度为第十八激活层的输入端接收P₁₈中的所有特征图，第十八激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₁₈，其中，H₁₈中的每幅特征图的宽度为高度为第三上采样层的输入端接收H₁₈中的所有特征图，第三上采样层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为Y₃，其中，Y₃中的每幅特征图的宽度为高度为第十二卷积层的输入端接收Y₃中的所有特征图，第十二卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为J₁₂，其中，J₁₂中的每幅特征图的宽度为高度为第十九批规范化层的输入端接收J₁₂中的所有特征图，第十九批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₁₉，其中，P₁₉中的每幅特征图的宽度为高度为第十九激活层的输入端接收P₁₉中的所有特征图，第十九激活层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为H₁₉，其中，H₁₉中的每幅特征图的宽度为高度为第四上采样层的输入端接收H₁₉中的所有特征图，第四上采样层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为Y₄，其中，Y₄中的每幅特征图的宽度为R、高度为L；第十三卷积层的输入端接收Y₄中的所有特征图，第十三卷积层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为J₁₃，其中，J₁₃中的每幅特征图的宽度为R、高度为L；第二十批规范化层的输入端接收J₁₃中的所有特征图，第二十批规范化层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为P₂₀，其中，P₂₀中的每幅特征图的宽度为R、高度为L；第二十激活层的输入端接收P₂₀中的所有特征图，第二十激活层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为H₂₀，其中，H₂₀中的每幅特征图的宽度为R、高度为L；For the up-sampling framework, the input of the first up-sampling layer receives all the feature maps in Z ₄ , the output of the first up-sampling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Y ₁ , where the width of each feature map in Y ₁ is height is The input end of the tenth convolutional layer receives all the feature maps in Y ₁ , and the output end of the tenth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J ₁₀ , where in J ₁₀ The width of each feature map is height is The input end of the seventeenth batch of normalization layers receives all feature maps in J ₁₀ , the output end of the seventeenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₁₇ , where P The width of each feature map in ₁₇ is height is The input end of the seventeenth activation layer receives all the feature maps in P ₁₇ , the output end of the seventeenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₁₇ , wherein, in H ₁₇ The width of each feature map is height is The input end of the second upsampling layer receives all the feature maps in H ₁₇ , the output end of the second upsampling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Y ₂ , where in Y ₂ The width of each feature map is height is The input of the eleventh convolutional layer receives all the feature maps in Y ₂ , the output of the eleventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₁₁ , where J The width of each feature map in ₁₁ is height is The input end of the eighteenth batch of normalization layers receives all feature maps in J ₁₁ , the output end of the eighteenth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₁₈ , where P The width of each feature map in ₁₈ is height is The input end of the eighteenth activation layer receives all the feature maps in P ₁₈ , the output end of the eighteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₈ , where in H ₁₈ The width of each feature map is height is The input end of the third up-sampling layer receives all the feature maps in H ₁₈ , the output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Y ₃ , where in Y ₃ The width of each feature map is height is The input of the twelfth convolutional layer receives all feature maps in Y ₃ , and the output of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₁₂ , where J The width of each feature map in ₁₂ is height is The input end of the nineteenth batch of normalization layers receives all the feature maps in J ₁₂ , the output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₁₉ , where P The width of each feature map in ₁₉ is height is The input end of the nineteenth activation layer receives all the feature maps in P ₁₉ , and the output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H ₁₉ , where in H ₁₉ The width of each feature map is height is The input end of the fourth upsampling layer receives all the feature maps in H ₁₉ , the output end of the fourth upsampling layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as Y ₄ , where in Y ₄ _The width of each feature map of the The set composed of all feature maps is denoted as J ₁₃ , wherein the width of each feature map in J ₁₃ is R and the height is L; the input end of the twentieth batch of normalization layers receives all the feature maps in J ₁₃ , the second The output of ten batches of normalization layers outputs 32 feature maps, and the set of all output feature maps is denoted as P ₂₀ , where the width of each feature map in P ₂₀ is R and the height is L; the twentieth activation The input end of the layer receives all the feature maps in P ₂₀ , the output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as H ₂₀ , where each feature in H ₂₀ The width of the graph is R and the height is L;

对于输出层，第十五激活层的输入端接收译码框架的输出即C₇中的所有特征图，第十五激活层的输出端输出96幅特征图，将输出的所有特征图构成的集合记为H₁₅，其中，H₁₅中的每幅特征图的宽度为R、高度为L；第九卷积层的输入端接收H₁₅中的所有特征图，第九卷积层的输出端输出1幅特征图，将输出的所有特征图构成的集合记为J₉，其中，J₉中的特征图的宽度为R、高度为L；第十六批规范化层的输入端接收J₉中的特征图，第十六批规范化层的输出端输出1幅特征图，将输出的所有特征图构成的集合记为P₁₆，其中，P₁₆中的特征图的宽度为R、高度为L；第十六激活层的输入端接收P₁₆中的特征图，第十六激活层的输出端输出1幅特征图，将输出的所有特征图构成的集合记为H₁₆，其中，H₁₆中的特征图的宽度为R、高度为L，H₁₆中的特征图即为原始输入图像对应的估计深度图像；For the output layer, the input terminal of the fifteenth activation layer receives the output of the decoding framework, that is, all feature maps in _C7 , and the output terminal of the fifteenth activation layer outputs 96 feature maps. Denoted as H ₁₅ , wherein the width of each feature map in H ₁₅ is R and the height is L; the input end of the ninth convolution layer receives all the feature maps in H ₁₅ , and the output end of the ninth convolution layer outputs 1 feature map, the set of all output feature maps is denoted as J ₉ , where the width of the feature map in J ₉ is R and the height is L; the input end of the sixteenth batch of normalization layers receives J ₉ . Feature map, the output end of the sixteenth batch of normalization layers outputs one feature map, and the set composed of all the output feature maps is denoted as P ₁₆ , where the width of the feature map in P ₁₆ is R and the height is L; The input terminal of the sixteenth activation layer receives the feature map in P ₁₆ , the output terminal of the sixteenth activation layer outputs one feature map, and the set formed by all the output feature maps is denoted as H ₁₆ , wherein the features in H ₁₆ The width of the image is R and the height is L, and the feature map in H ₁₆ is the estimated depth image corresponding to the original input image;

步骤1_3：将训练集中的每幅原始的单目图像作为原始输入图像，输入到卷积神经网络中进行训练，得到训练集中的每幅原始的单目图像对应的估计深度图像，将{Qⁿ(x,y)}对应的估计深度图像记为其中，表示中坐标位置为(x,y)的像素点的像素值；Step 1_3: Take each original monocular image in the training set as the original input image, input it into the convolutional neural network for training, and obtain the estimated depth image corresponding to each original monocular image in the training set, and set {Q ⁿ The estimated depth image corresponding to (x,y)} is denoted as in, express The pixel value of the pixel whose middle coordinate position is (x, y);

步骤1_4：计算训练集中的每幅原始的单目图像对应的估计深度图像与对应的真实深度图像之间的损失函数值，将与之间的损失函数值记为 Step 1_4: Calculate the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image, and The loss function value between is denoted as

步骤1_5：重复执行步骤1_3和步骤1_4共V次，得到训练好的卷积神经网络训练模型，并共得到N×V个损失函数值；然后从N×V个损失函数值中找出值最小的损失函数值；接着将值最小的损失函数值对应的权值矢量和偏置项对应作为训练好的卷积神经网络训练模型的最优权值矢量和最优偏置项，对应记为W^best和b^best；其中，V＞1；Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain the trained convolutional neural network training model, and obtain a total of N×V loss function values; then find the smallest value from the N×V loss function values. Then take the weight vector and the bias term corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias term of the trained convolutional neural network training model, which are recorded as W ^best and b ^best ; wherein, V>1;

所述的测试阶段过程的具体步骤为：The specific steps of the test phase process are:

步骤2_1：令{Q(x',y')}表示待预测的单目图像；其中，1≤x'≤R'，1≤y'≤L'，R'表示{Q(x',y')}的宽度，L'表示{Q(x',y')}的高度，Q(x',y')表示{Q(x',y')}中坐标位置为(x',y')的像素点的像素值；Step 2_1: Let {Q(x',y')} denote the monocular image to be predicted; wherein, 1≤x'≤R', 1≤y'≤L', R' denotes {Q(x',y ')} width, L' means the height of {Q(x',y')}, Q(x',y') means the coordinate position in {Q(x',y')} is (x',y ') the pixel value of the pixel point;

步骤2_2：将{Q(x',y')}输入到训练好的卷积神经网络训练模型中，并利用W^best和b^best进行预测，得到{Q(x',y')}对应的预测深度图像，记为{Q_depth(x',y')}；其中，Q_depth(x',y')表示{Q_depth(x',y')}中坐标位置为(x',y')的像素点的像素值。Step 2_2: Input {Q(x',y')} into the trained convolutional neural network training model, and use W ^best and b ^best to predict, and get {Q(x',y')} corresponding The predicted depth image is recorded as {Q _depth (x',y')}; among them, Q _depth (x',y') indicates that the coordinate position in {Q _depth (x',y')} is (x',y) ') of the pixel value of the pixel point.

所述的步骤1_4中，采用均方误差函数获得。In the described steps 1-4, Obtained using the mean squared error function.

与现有技术相比，本发明的优点在于：Compared with the prior art, the advantages of the present invention are:

1)本发明方法在构建卷积神经网络的过程中采用了跳层连接方式，即采用了Concatenate融合层，并同时在编码框架内使用了短跳层连接，即使用了第一Concatenate融合层、第二Concatenate融合层、第三Concatenate融合层进行连接；在编码框架和译码框架间使用了长跳层连接，即使用了第四Concatenate融合层、第五Concatenate融合层、第六Concatenate融合层、第七Concatenate融合层进行连接，使用跳层连接有益于多尺度特征融合和边界保持，短跳层连接丰富了在编码过程中信息多样性，长跳层连接解决了译码部分原始边界信息的缺失，从而使得利用训练得到的卷积神经网络训练模型进行深度估计更准确。1) The method of the present invention adopts the skip-layer connection mode in the process of constructing the convolutional neural network, that is, the Concatenate fusion layer is adopted, and the short-jump layer connection is used in the coding framework at the same time, that is, the first Concatenate fusion layer, The second Concatenate fusion layer and the third Concatenate fusion layer are connected; a long-hop connection is used between the encoding frame and the decoding frame, that is, the fourth Concatenate fusion layer, the fifth Concatenate fusion layer, the sixth Concatenate fusion layer, The seventh Concatenate fusion layer is connected. The use of skip layer connection is beneficial to multi-scale feature fusion and boundary preservation. Short skip layer connection enriches the diversity of information in the encoding process, and long skip layer connection solves the lack of original boundary information in the decoding part. , so that the depth estimation using the trained convolutional neural network training model is more accurate.

2)本发明方法使用端到端的卷积神经网络训练框架，在编码框架的第三最大池化层之后使用了三个带孔卷积层来提取特征信息，而带孔卷积层能够在不增加训练参数的数量的前提下可以扩大神经元的感受野，得到更多的特征信息。2) The method of the present invention uses an end-to-end convolutional neural network training framework, and uses three atrous convolutional layers after the third maximum pooling layer of the coding frame to extract feature information, and the atroused convolutional layer can On the premise of increasing the number of training parameters, the receptive field of neurons can be expanded and more feature information can be obtained.

3)本发明方法创建的卷积神经网络的隐层包括编码框架、译码框架和上采样框架，三个框架的结合使得利用训练得到的卷积神经网络训练模型能够提取到具有丰富信息的特征，从而可以获得准确性高的深度信息，进而提高了深度估计结果的精度。3) The hidden layer of the convolutional neural network created by the method of the present invention includes an encoding frame, a decoding frame and an upsampling frame, and the combination of the three frames enables the convolutional neural network training model obtained by training to extract features with rich information , so that the depth information with high accuracy can be obtained, thereby improving the accuracy of the depth estimation result.

4)利用本发明方法得到的预测深度图像的尺寸与原始的单目图像的尺寸相同，有利于对其中深度信息的直接使用。4) The size of the predicted depth image obtained by the method of the present invention is the same as the size of the original monocular image, which is beneficial to the direct use of the depth information therein.

附图说明Description of drawings

图1为本发明方法中创建的卷积神经网络的隐层中的编码框架的组成结构示意图；Fig. 1 is the composition structure schematic diagram of the coding frame in the hidden layer of the convolutional neural network created in the method of the present invention;

图2为本发明方法中创建的卷积神经网络的隐层中的译码框架和创建的卷积神经网络的输出层各自的组成结构示意图；Fig. 2 is the respective composition structure diagram of the decoding frame in the hidden layer of the convolutional neural network created in the method of the present invention and the output layer of the created convolutional neural network;

图3为本发明方法中创建的卷积神经网络的隐层中的上采样框架的组成结构示意图。FIG. 3 is a schematic diagram of the composition and structure of the upsampling framework in the hidden layer of the convolutional neural network created in the method of the present invention.

具体实施方式Detailed ways

以下结合附图实施例对本发明作进一步详细描述。The present invention will be further described in detail below with reference to the embodiments of the accompanying drawings.

本发明提出的一种单目视觉深度估计方法，其特征在于包括训练阶段和测试阶段两个过程。A monocular visual depth estimation method proposed by the present invention is characterized in that it includes two processes: a training phase and a testing phase.

步骤1_1：选取N幅原始的单目图像及每幅原始的单目图像对应的真实深度图像，并构成训练集，将训练集中的第n幅原始的单目图像记为{Qⁿ(x,y)}，将训练集中与{Qⁿ(x,y)}对应的真实深度图像记为其中，N为正整数，N≥100，如取N＝1000，n为正整数，1≤n≤N，1≤x≤R，1≤y≤L，R表示{Qⁿ(x,y)}和的宽度，L表示{Qⁿ(x,y)}和的高度，R和L均能被2整除，Qⁿ(x,y)表示{Qⁿ(x,y)}中坐标位置为(x,y)的像素点的像素值，表示中坐标位置为(x,y)的像素点的像素值；在此，原始的单目图像和其对应的真实深度图像直接由KITTI官网提供。Step 1_1: Select N original monocular images and the real depth image corresponding to each original monocular image, and form a training set, and record the nth original monocular image in the training set as {Q ⁿ (x, y)}, denote the real depth image corresponding to {Q ⁿ (x, y)} in the training set as Among them, N is a positive integer, N≥100, if N=1000, n is a positive integer, 1≤n≤N, 1≤x≤R, 1≤y≤L, R represents {Q ⁿ (x,y) }and The width of , L represents {Q ⁿ (x, y)} and The height of , R and L are both divisible by 2, Q ⁿ (x, y) represents the pixel value of the pixel at the coordinate position (x, y) in {Q ⁿ (x, y)}, express The pixel value of the pixel whose mid-coordinate position is (x, y); here, the original monocular image and its corresponding real depth image are directly provided by the KITTI official website.

步骤1_2：构建端到端的卷积神经网络：卷积神经网络包括输入层、隐层和输出层；隐层包括编码框架、译码框架和上采样框架。Step 1_2: Build an end-to-end convolutional neural network: The convolutional neural network includes an input layer, a hidden layer, and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame.

对于输入层，输入层的输入端接收一幅原始输入图像，输入层的输出端输出原始输入图像给隐层；其中，要求输入层的输入端接收的原始输入图像的宽度为R、高度为L。For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; among them, the width of the original input image received by the input end of the input layer is required to be R and the height is L .

对于编码框架，如图1所示，其由依次设置的第一卷积层、第一批规范化层、第一激活层、第一最大池化层、第二卷积层、第二批规范化层、第二激活层、第三卷积层、第三批规范化层、第一Concatenate融合层、第三激活层、第二最大池化层、第四卷积层、第四批规范化层、第四激活层、第五卷积层、第五批规范化层、第二Concatenate融合层、第五激活层、第三最大池化层、第一带孔卷积层、第六批规范化层、第六激活层、第二带孔卷积层、第七批规范化层、第三Concatenate融合层、第七激活层、第四最大池化层、第三带孔卷积层、第八批规范化层、第八激活层组成；对于译码框架，如图2所示，其由依次设置的第一反卷积层、第九批规范化层、第四Concatenate融合层、第九激活层、第六卷积层、第十批规范化层、第十激活层、第二反卷积层、第十一批规范化层、第五Concatenate融合层、第十一激活层、第七卷积层、第十二批规范化层、第十二激活层、第三反卷积层、第十三批规范化层、第六Concatenate融合层、第十三激活层、第八卷积层、第十四批规范化层、第十四激活层、第四反卷积层、第十五批规范化层、第七Concatenate融合层组成；对于上采样框架，如图3所示，其由依次设置的第一上采样层、第十卷积层、第十七批规范化层、第十七激活层、第二上采样层、第十一卷积层、第十八批规范化层、第十八激活层、第三上采样层、第十二卷积层、第十九批规范化层、第十九激活层、第四上采样层、第十三卷积层、第二十批规范化层、第二十激活层组成；对于输出层，如图2所示，其由依次设置的第十五激活层、第九卷积层、第十六批规范化层、第十六激活层组成，其中，第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层、第一反卷积层至第四反卷积层各自的卷积核大小为3×3，第一卷积层的卷积核个数为32、第二卷积层和第三卷积层的卷积核个数为64、第四卷积层和第五卷积层的卷积核个数为128、第一带孔卷积层和第二带孔卷积层的卷积核个数为256、第三带孔卷积层的卷积核个数为512、第一反卷积层和第六卷积层的卷积核个数为256、第二反卷积层和第七卷积层的卷积核个数为128、第三反卷积层和第八卷积层的卷积核个数为64、第四反卷积层的卷积核个数为32、第九卷积层的卷积核个数为1、第十卷积层的卷积核个数为256、第十一卷积层的卷积核个数为128、第十二卷积层的卷积核个数为64、第十三卷积层的卷积核个数为32，第一卷积层至第十三卷积层、第一带孔卷积层至第三带孔卷积层各自的卷积步长采用默认值，第一反卷积层至第四反卷积层各自的卷积步长为2×2，第一批规范化层至第二十批规范化层的参数采用默认值，第一激活层至第二十激活层的激活函数采用ReLu，第一最大池化层至第四最大池化层的池化步长为2×2，第一上采样层至第四上采样层的采样步长为2×2。For the encoding framework, as shown in Figure 1, it consists of the first convolutional layer, the first normalization layer, the first activation layer, the first max pooling layer, the second convolutional layer, and the second batch of normalization layers. , second activation layer, third convolution layer, third batch normalization layer, first concatenate fusion layer, third activation layer, second max pooling layer, fourth convolution layer, fourth batch normalization layer, fourth batch Activation layer, fifth convolution layer, fifth batch normalization layer, second concatenate fusion layer, fifth activation layer, third max pooling layer, first atrous convolutional layer, sixth batch normalization layer, sixth activation layer layer, second atrous convolutional layer, seventh batch normalization layer, third concatenate fusion layer, seventh activation layer, fourth max pooling layer, third atrous convolutional layer, eighth batch normalization layer, eighth batch The activation layer consists of the activation layer; for the decoding framework, as shown in Figure 2, it consists of the first deconvolution layer, the ninth batch of normalization layers, the fourth Concatenate fusion layer, the ninth activation layer, the sixth convolution layer, The tenth normalization layer, the tenth activation layer, the second deconvolution layer, the eleventh normalization layer, the fifth Concatenate fusion layer, the eleventh activation layer, the seventh convolution layer, the twelfth batch normalization layer, Twelfth Activation Layer, Third Deconvolution Layer, Thirteenth Batch Normalization Layer, Sixth Concatenate Fusion Layer, Thirteenth Activation Layer, Eighth Convolutional Layer, Fourteenth Batch Normalization Layer, Fourteenth Activation Layer , the fourth deconvolution layer, the fifteenth batch normalization layer, and the seventh Concatenate fusion layer; for the upsampling framework, as shown in Figure 3, it consists of the first upsampling layer, the tenth convolutional layer, Batch 17 Normalization Layer, Activation Layer 17, Upsampling Layer 2, Convolution Layer 11, Normalization Layer 18, Activation Layer 18, Upsampling Layer 3, Convolution 12 layer, the nineteenth batch normalization layer, the nineteenth activation layer, the fourth upsampling layer, the thirteenth convolution layer, the twentieth batch normalization layer, and the twentieth activation layer; for the output layer, as shown in Figure 2 It is composed of the fifteenth activation layer, the ninth convolution layer, the sixteenth batch of normalization layers, and the sixteenth activation layer, which are set in sequence. The size of the convolution kernels from the atrous convolutional layer to the third atrous convolutional layer and the first deconvolutional layer to the fourth deconvolutional layer is 3×3, and the number of convolution kernels of the first convolutional layer is 32. The number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolutional kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the first convolutional layer with holes and The number of convolution kernels in the second convolutional layer with holes is 256, the number of convolution kernels in the third convolutional layer with holes is 512, and the number of convolution kernels in the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolution layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolution layer is 64, and the number of convolution kernels of the fourth deconvolution layer is 64. The number of convolution kernels of the layer is 32, the number of convolution kernels of the ninth convolution layer is 1, the number of convolution kernels of the tenth convolution layer is 256, and the number of convolution kernels of the eleventh convolution layer For 128, the convolution of the twelfth convolutional layer The number of kernels is 64, the number of convolution kernels of the thirteenth convolutional layer is 32, the first convolutional layer to the thirteenth convolutional layer, the first convolutional layer to the third convolutional layer with holes are respectively The convolution step size of the first deconvolution layer to the fourth deconvolution layer is 2 × 2, and the parameters of the first batch of normalization layers to the twentieth batch of normalization layers use the default values. , the activation functions from the first activation layer to the twentieth activation layer use ReLu, the pooling step size from the first maximum pooling layer to the fourth maximum pooling layer is 2×2, and the first upsampling layer to the fourth upsampling The sampling stride of the layer is 2×2.

对于译码框架，第一反卷积层的输入端接收编码框架的输出即H₈中的所有特征图，第一反卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为F₁，其中，F₁中的每幅特征图的宽度为高度为第九批规范化层的输入端接收F₁中的所有特征图，第九批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₉，其中，P₉中的每幅特征图的宽度为高度为第四Concatenate融合层的输入端接收P₉中的所有特征图和P₇中的所有特征图，第四Concatenate融合层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为C₄，其中，C₄中的每幅特征图的宽度为高度为第九激活层的输入端接收C₄中的所有特征图，第九激活层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为H₉，其中，H₉中的每幅特征图的宽度为高度为第六卷积层的输入端接收H₉中的所有特征图，第六卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为J₆，其中，J₆中的每幅特征图的宽度为高度为第十批规范化层的输入端接收J₆中的所有特征图，第十批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₁₀，其中，P₁₀中的每幅特征图的宽度为高度为第十激活层的输入端接收P₁₀中的所有特征图，第十激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₁₀，其中，H₁₀中的每幅特征图的宽度为高度为第二反卷积层的输入端接收编码框架的输出即H₁₀中的所有特征图，第二反卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为F₂，其中，F₂中的每幅特征图的宽度为高度为第十一批规范化层的输入端接收F₂中的所有特征图，第十一批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₁₁，其中，P₁₁中的每幅特征图的宽度为高度为第五Concatenate融合层的输入端接收P₁₁中的所有特征图和P₅中的所有特征图，第五Concatenate融合层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为C₅，其中，C₅中的每幅特征图的宽度为高度为第十一激活层的输入端接收C₅中的所有特征图，第十一激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₁₁，其中，H₁₁中的每幅特征图的宽度为高度为第七卷积层的输入端接收H₁₁中的所有特征图，第七卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为J₇，其中，J₇中的每幅特征图的宽度为高度为第十二批规范化层的输入端接收J₇中的所有特征图，第十二批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₁₂，其中，P₁₂中的每幅特征图的宽度为高度为第十二激活层的输入端接收P₁₂中的所有特征图，第十二激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₁₂，其中，H₁₂中的每幅特征图的宽度为高度为第三反卷积层的输入端接收H₁₂中的所有特征图，第三反卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为F₃，其中，F₃中的每幅特征图的宽度为高度为第十三批规范化层的输入端接收F₃中的所有特征图，第十三批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₁₃，其中，P₁₃中的每幅特征图的宽度为高度为第六Concatenate融合层的输入端接收P₁₃中的所有特征图和P₃中的所有特征图，第六Concatenate融合层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为C₆，其中，C₆中的每幅特征图的宽度为高度为第十三激活层的输入端接收C₆中的所有特征图，第十三激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₁₃，其中，H₁₃中的每幅特征图的宽度为高度为第八卷积层的输入端接收H₁₃中的所有特征图，第八卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为J₈，其中，J₈中的每幅特征图的宽度为高度为第十四批规范化层的输入端接收J₈中的所有特征图，第十四批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₁₄，其中，P₁₄中的每幅特征图的宽度为高度为第十四激活层的输入端接收P₁₄中的所有特征图，第十四激活层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为H₁₄，其中，H₁₄中的每幅特征图的宽度为高度为第四反卷积层的输入端接收H₁₄中的所有特征图，第四反卷积层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为F₄，其中，F₄中的每幅特征图的宽度为R、高度为L；第十五批规范化层的输入端接收F₄中的所有特征图，第十五批规范化层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为P₁₅，其中，P₁₅中的每幅特征图的宽度为R、高度为L；第七Concatenate融合层的输入端接收P₁₅中的所有特征图、H₁中的所有特征图、上采样框架的输出，第七Concatenate融合层的输出端输出96幅特征图，将输出的所有特征图构成的集合记为C₇，其中，C₇中的每幅特征图的宽度为R、高度为L。For the decoding framework, the input end of the first deconvolution layer receives the output of the encoding framework, that is, all feature maps in _H8 , and the output end of the first deconvolution layer outputs 256 feature maps, which are composed of all the output feature maps. The set is denoted as F ₁ , where the width of each feature map in F ₁ is height is The input end of the ninth batch of normalization layers receives all the feature maps in F ₁ , and the output end of the ninth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₉ , where in P ₉ The width of each feature map is height is The input end of the fourth Concatenate fusion layer receives all the feature maps in P ₉ and all the feature maps in P ₇ , and the output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is recorded as C ₄ , where the width of each feature map in C ₄ is height is The input end of the _ninth activation layer receives all the feature maps in C ₄ , and the output end of the _ninth activation layer outputs 512 feature maps. The width of the feature map is height is The input end of the sixth convolutional layer receives all the feature maps in H ₉ , the output end of the sixth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J ₆ , where in J ₆ The width of each feature map is height is The input end of the tenth batch of normalization layers receives all the feature maps in J ₆ , the output end of the tenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₁₀ , where in P ₁₀ The width of each feature map is height is The input end of the tenth activation layer receives all the feature maps in P ₁₀ , and the output end of the tenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₁₀ , where each H ₁₀ The width of the feature map is height is The input end of the second deconvolution layer receives the output of the coding framework, that is, all feature maps in _H10 , and the output end of the second deconvolution layer outputs 128 feature maps, and the set of all the output feature maps is denoted as F ₂ , where the width of each feature map in _F2 is height is The input end of the eleventh batch of normalization layers receives all the feature maps in F ₂ , and the output end of the eleventh batch of normalization layers outputs 128 feature maps, and the set of all the output feature maps is denoted as P ₁₁ , where P The width of each feature map in ₁₁ is height is The input of the fifth Concatenate fusion layer receives all the feature maps in P ₁₁ and all the feature maps in P ₅ , and the output of the fifth Concatenate fusion layer outputs 256 feature maps, and the set of all the output feature maps is recorded as C ₅ , where the width of each feature map in C ₅ is height is The input end of the eleventh activation layer receives all the feature maps in _C5 , and the output end of the _{eleventh activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 11} _. The width of each feature map is height is The input end of the seventh convolutional layer receives all the feature maps in H ₁₁ , the output end of the seventh convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₇ , where in J ₇ The width of each feature map is height is The input end of the twelfth batch of normalization layers receives all feature maps in J ₇ , the output end of the twelfth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₁₂ , where P The width of each feature map in ₁₂ is height is The input end of the twelfth activation layer receives all the feature maps in P ₁₂ , the output end of the twelfth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₂ , where in H ₁₂ The width of each feature map is height is The input end of the third deconvolution layer receives all the feature maps in H ₁₂ , the output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as F ₃ , where F The width of each feature map in ₃ is height is The input end of the thirteenth batch of normalization layers receives all the feature maps in F ₃ , the output end of the thirteenth batch of normalization layers outputs 64 feature maps, and the set of all the output feature maps is denoted as P ₁₃ , where P The width of each feature map in ₁₃ is height is The input end of the sixth Concatenate fusion layer receives all the feature maps in P ₁₃ and all the feature maps in P ₃ , and the output end of the sixth Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C ₆ , where the width of each feature map in C ₆ is height is The input end of the thirteenth activation layer receives all the feature maps in C ₆ , the output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₃ , wherein, in H ₁₃ The width of each feature map is height is The input end of the eighth convolutional layer receives all the feature maps in H ₁₃ , the output end of the eighth convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₈ , where in J ₈ The width of each feature map is height is The input end of the fourteenth batch of normalization layers receives all the feature maps in J ₈ , the output end of the fourteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₁₄ , where P The width of each feature map in ₁₄ is height is The input end of the fourteenth activation layer receives all the feature maps in P ₁₄ , the output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H ₁₄ , wherein, in H ₁₄ The width of each feature map is height is The input end of the fourth deconvolution layer receives all the feature maps in H ₁₄ , the output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F ₄ , where F The width of each feature map in ₄ is R and the height is L; the input end of the fifteenth batch of normalization layers receives all the feature maps in F ₄ , and the output end of the fifteenth batch of normalization layers outputs 32 feature maps. The set composed of all the output feature maps is denoted as P ₁₅ , wherein the width of each feature map in P ₁₅ is R and the height is L; the input end of the seventh Concatenate fusion layer receives all the feature maps in P ₁₅ , H All feature maps in ₁ , the output of the upsampling framework, the output of the seventh Concatenate fusion layer outputs 96 feature maps, and the set of all the output feature maps is denoted as C ₇ , where each feature in C ₇ The width of the graph is R and the height is L.

对于上采样框架，第一上采样层的输入端接收Z₄中的所有特征图，第一上采样层的输出端输出512幅特征图，将输出的所有特征图构成的集合记为Y₁，其中，Y₁中的每幅特征图的宽度为高度为第十卷积层的输入端接收Y₁中的所有特征图，第十卷积层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为J₁₀，其中，J₁₀中的每幅特征图的宽度为高度为第十七批规范化层的输入端接收J₁₀中的所有特征图，第十七批规范化层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为P₁₇，其中，P₁₇中的每幅特征图的宽度为高度为第十七激活层的输入端接收P₁₇中的所有特征图，第十七激活层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为H₁₇，其中，H₁₇中的每幅特征图的宽度为高度为第二上采样层的输入端接收H₁₇中的所有特征图，第二上采样层的输出端输出256幅特征图，将输出的所有特征图构成的集合记为Y₂，其中，Y₂中的每幅特征图的宽度为高度为第十一卷积层的输入端接收Y₂中的所有特征图，第十一卷积层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为J₁₁，其中，J₁₁中的每幅特征图的宽度为高度为第十八批规范化层的输入端接收J₁₁中的所有特征图，第十八批规范化层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为P₁₈，其中，P₁₈中的每幅特征图的宽度为高度为第十八激活层的输入端接收P₁₈中的所有特征图，第十八激活层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为H₁₈，其中，H₁₈中的每幅特征图的宽度为高度为第三上采样层的输入端接收H₁₈中的所有特征图，第三上采样层的输出端输出128幅特征图，将输出的所有特征图构成的集合记为Y₃，其中，Y₃中的每幅特征图的宽度为高度为第十二卷积层的输入端接收Y₃中的所有特征图，第十二卷积层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为J₁₂，其中，J₁₂中的每幅特征图的宽度为高度为第十九批规范化层的输入端接收J₁₂中的所有特征图，第十九批规范化层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为P₁₉，其中，P₁₉中的每幅特征图的宽度为高度为第十九激活层的输入端接收P₁₉中的所有特征图，第十九激活层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为H₁₉，其中，H₁₉中的每幅特征图的宽度为高度为第四上采样层的输入端接收H₁₉中的所有特征图，第四上采样层的输出端输出64幅特征图，将输出的所有特征图构成的集合记为Y₄，其中，Y₄中的每幅特征图的宽度为R、高度为L；第十三卷积层的输入端接收Y₄中的所有特征图，第十三卷积层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为J₁₃，其中，J₁₃中的每幅特征图的宽度为R、高度为L；第二十批规范化层的输入端接收J₁₃中的所有特征图，第二十批规范化层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为P₂₀，其中，P₂₀中的每幅特征图的宽度为R、高度为L；第二十激活层的输入端接收P₂₀中的所有特征图，第二十激活层的输出端输出32幅特征图，将输出的所有特征图构成的集合记为H₂₀，其中，H₂₀中的每幅特征图的宽度为R、高度为L。For the up-sampling framework, the input of the first up-sampling layer receives all the feature maps in Z ₄ , the output of the first up-sampling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Y ₁ , where the width of each feature map in Y ₁ is height is The input end of the tenth convolutional layer receives all the feature maps in Y ₁ , and the output end of the tenth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J ₁₀ , where in J ₁₀ The width of each feature map is height is The input end of the seventeenth batch of normalization layers receives all feature maps in J ₁₀ , the output end of the seventeenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₁₇ , where P The width of each feature map in ₁₇ is height is The input end of the seventeenth activation layer receives all the feature maps in P ₁₇ , the output end of the seventeenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₁₇ , wherein, in H ₁₇ The width of each feature map is height is The input end of the second upsampling layer receives all the feature maps in H ₁₇ , the output end of the second upsampling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Y ₂ , where in Y ₂ The width of each feature map is height is The input of the eleventh convolutional layer receives all the feature maps in Y ₂ , the output of the eleventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₁₁ , where J The width of each feature map in ₁₁ is height is The input end of the eighteenth batch of normalization layers receives all feature maps in J ₁₁ , the output end of the eighteenth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₁₈ , where P The width of each feature map in ₁₈ is height is The input end of the eighteenth activation layer receives all the feature maps in P ₁₈ , the output end of the eighteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₈ , where in H ₁₈ The width of each feature map is height is The input end of the third up-sampling layer receives all the feature maps in H ₁₈ , the output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Y ₃ , where in Y ₃ The width of each feature map is height is The input of the twelfth convolutional layer receives all feature maps in Y ₃ , and the output of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₁₂ , where J The width of each feature map in ₁₂ is height is The input end of the nineteenth batch of normalization layers receives all the feature maps in J ₁₂ , the output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₁₉ , where P The width of each feature map in ₁₉ is height is The input end of the nineteenth activation layer receives all the feature maps in P ₁₉ , and the output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H ₁₉ , where in H ₁₉ The width of each feature map is height is The input end of the fourth upsampling layer receives all the feature maps in H ₁₉ , the output end of the fourth upsampling layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as Y ₄ , where in Y ₄ _The width of each feature map of the The set composed of all feature maps is denoted as J ₁₃ , wherein the width of each feature map in J ₁₃ is R and the height is L; the input end of the twentieth batch of normalization layers receives all the feature maps in J ₁₃ , the second The output of ten batches of normalization layers outputs 32 feature maps, and the set of all output feature maps is denoted as P ₂₀ , where the width of each feature map in P ₂₀ is R and the height is L; the twentieth activation The input end of the layer receives all the feature maps in P ₂₀ , and the output end of the _twentieth activation layer outputs ₃₂ feature maps. The width of the graph is R and the height is L.

对于输出层，第十五激活层的输入端接收译码框架的输出即C₇中的所有特征图，第十五激活层的输出端输出96幅特征图，将输出的所有特征图构成的集合记为H₁₅，其中，H₁₅中的每幅特征图的宽度为R、高度为L；第九卷积层的输入端接收H₁₅中的所有特征图，第九卷积层的输出端输出1幅特征图，将输出的所有特征图构成的集合记为J₉，其中，J₉中的特征图的宽度为R、高度为L；第十六批规范化层的输入端接收J₉中的特征图，第十六批规范化层的输出端输出1幅特征图，将输出的所有特征图构成的集合记为P₁₆，其中，P₁₆中的特征图的宽度为R、高度为L；第十六激活层的输入端接收P₁₆中的特征图，第十六激活层的输出端输出1幅特征图，将输出的所有特征图构成的集合记为H₁₆，其中，H₁₆中的特征图的宽度为R、高度为L，H₁₆中的特征图即为原始输入图像对应的估计深度图像。For the output layer, the input terminal of the fifteenth activation layer receives the output of the decoding framework, that is, all the feature maps in _C7 , and the output terminal of the fifteenth activation layer outputs 96 feature maps. It is denoted as H ₁₅ , wherein the width of each feature map in H ₁₅ is R and the height is L; the input end of the ninth convolution layer receives all the feature maps in H ₁₅ , and the output end of the ninth convolution layer outputs 1 feature map, the set of all output feature maps is denoted as J ₉ , where the width of the feature map in J ₉ is R, and the height is L; the input of the sixteenth batch of normalization layers receives J ₉ . Feature map, the output end of the sixteenth batch of normalization layers outputs one feature map, and the set composed of all the output feature maps is denoted as P ₁₆ , where the width of the feature map in P ₁₆ is R and the height is L; The input terminal of the sixteenth activation layer receives the feature map in P ₁₆ , and the output terminal of the sixteenth activation layer outputs one feature map, and the set formed by all the output feature maps is denoted as H ₁₆ , wherein the features in H ₁₆ The width of the map is R and the height is L, and the feature map in H ₁₆ is the estimated depth image corresponding to the original input image.

步骤1_3：将训练集中的每幅原始的单目图像作为原始输入图像，输入到卷积神经网络中进行训练，得到训练集中的每幅原始的单目图像对应的估计深度图像，将{Qⁿ(x,y)}对应的估计深度图像记为其中，表示中坐标位置为(x,y)的像素点的像素值。Step 1_3: Take each original monocular image in the training set as the original input image, input it into the convolutional neural network for training, and obtain the estimated depth image corresponding to each original monocular image in the training set, and set {Q ⁿ The estimated depth image corresponding to (x,y)} is denoted as in, express The pixel value of the pixel whose middle coordinate position is (x, y).

步骤1_4：计算训练集中的每幅原始的单目图像对应的估计深度图像与对应的真实深度图像之间的损失函数值，将与之间的损失函数值记为采用均方误差函数获得。Step 1_4: Calculate the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image, and The loss function value between is denoted as Obtained using the mean squared error function.

步骤1_5：重复执行步骤1_3和步骤1_4共V次，得到训练好的卷积神经网络训练模型，并共得到N×V个损失函数值；然后从N×V个损失函数值中找出值最小的损失函数值；接着将值最小的损失函数值对应的权值矢量和偏置项对应作为训练好的卷积神经网络训练模型的最优权值矢量和最优偏置项，对应记为W^best和b^best；其中，V＞1，在本实施例中取V＝20。Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain a trained convolutional neural network training model, and obtain a total of N×V loss function values; then find the smallest value from the N×V loss function values. Then take the weight vector and the bias term corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias term of the trained convolutional neural network training model, corresponding to W ^best and b ^best ; where V>1, in this embodiment, V=20.

所述的测试阶段过程的具体步骤为：The specific steps of the described testing phase process are:

步骤2_1：令{Q(x',y')}表示待预测的单目图像；其中，1≤x'≤R'，1≤y'≤L'，R'表示{Q(x',y')}的宽度，L'表示{Q(x',y')}的高度，Q(x',y')表示{Q(x',y')}中坐标位置为(x',y')的像素点的像素值。Step 2_1: Let {Q(x',y')} denote the monocular image to be predicted; wherein, 1≤x'≤R', 1≤y'≤L', R' denotes {Q(x',y ')} width, L' means the height of {Q(x',y')}, Q(x',y') means the coordinate position in {Q(x',y')} is (x',y ') of the pixel value of the pixel point.

步骤2_2：将{Q(x',y')}输入到训练好的卷积神经网络训练模型中，并利用W^best和b^best进行预测，得到{Q(x',y')}对应的预测深度图像，记为{Q_depth(x',y')}；其中，Q_depth(x',y')表示{Q_depth(x',y')}中坐标位置为(x',y')的像素点的像素值。Step 2_2: Input {Q(x',y')} into the trained convolutional neural network training model, and use W ^best and b ^best to predict, and get {Q(x',y')} corresponding The predicted depth image is recorded as {Q _depth (x', y')}; among them, Q _depth (x', y') indicates that the coordinate position in {Q _depth (x', y')} is (x', y ') of the pixel value of the pixel point.

为了验证本发明方法的可行性和有效性，对本发明方法进行实验。In order to verify the feasibility and effectiveness of the method of the present invention, experiments were carried out on the method of the present invention.

在此，本发明方法中构成训练集的单目图像和用于测试的单目图像均由KITTI官方网站给出，因此直接使用KITTI官方网站给出的测试数据集来分析测试本发明方法的准确性。将测试数据集中的每幅单目图像作为待预测的单目图像输入到训练好的深度卷积神经网络训练模型中，再载入训练阶段得到的最优权重W^best，获得对应的预测深度图像。Here, the monocular images constituting the training set and the monocular images used for testing in the method of the present invention are given by the official website of KITTI, so the test data set given by the official website of KITTI is directly used to analyze and test the accuracy of the method of the present invention sex. Input each monocular image in the test data set as the monocular image to be predicted into the trained deep convolutional neural network training model, and then load the optimal weight W ^best obtained in the training stage to obtain the corresponding predicted depth image .

在此，采用单目视觉深度预测评价方法的6个常用客观参量作为评价指标，即：均方根误差(root mean squared error，rms)、对数均方根误差(log_rms)、平均对数误差(average log₁₀error，log10)、阈值准确性(thr)：δ₁、δ₂、δ₃。均方根误差、对数均方根误差、平均对数误差的数值越低代表预测深度图像与真实深度图像越接近，δ₁、δ₂、δ₃的数值越高说明预测深度图像的准确性越高。反映本发明方法的评价性能优劣指标的均方根误差、对数均方根误差、平均对数误差和δ₁、δ₂、δ₃的结果如表1所列。从表1所列的数据可知，按本发明方法获得的预测深度图像与真实深度图像之间的差别很小，这说明了本发明方法的预测结果的精度很高，体现了本发明方法的可行性和有效性。Here, six common objective parameters of the monocular visual depth prediction evaluation method are used as evaluation indicators, namely: root mean squared error (rms), logarithmic root mean squared error (log_rms), average logarithmic error (average log ₁₀ error, log 10), threshold accuracy (thr): δ ₁ , δ ₂ , δ ₃ . The lower the values of root mean square error, logarithmic root mean square error, and average logarithmic error, the closer the predicted depth image is to the real depth image, and the higher the values of δ ₁ , δ ₂ , and δ ₃ , the accuracy of the predicted depth image. higher. Table 1 lists the results of root mean square error, logarithmic root mean square error, average logarithmic error, and δ ₁ , δ ₂ , and δ ₃ reflecting the evaluation performance indicators of the present invention. From the data listed in Table 1, it can be seen that the difference between the predicted depth image obtained by the method of the present invention and the real depth image is very small, which shows that the accuracy of the prediction result of the method of the present invention is very high, and reflects the feasibility of the method of the present invention. sex and effectiveness.

表1利用本发明方法预测得到的预测深度图像与真实深度图像之间的对比评价指标Table 1 Comparison and evaluation index between the predicted depth image and the real depth image predicted by the method of the present invention

Claims

1. a monocular vision depth estimation method is characterized in that comprising two processes of training stage and testing stage;

The specific steps of the training phase process are:

Step 1_1: Select N original monocular images and the real depth image corresponding to each original monocular image, and form a training set, and record the nth original monocular image in the training set as {Q ⁿ (x, y)}, denote the real depth image corresponding to {Q ⁿ (x, y)} in the training set as Among them, N is a positive integer, N≥100, n is a positive integer, 1≤n≤N, 1≤x≤R, 1≤y≤L, R represents {Q ⁿ (x, y)} and The width of , L represents {Q ⁿ (x, y)} and The height of , R and L are both divisible by 2, Q ⁿ (x, y) represents the pixel value of the pixel at the coordinate position (x, y) in {Q ⁿ (x, y)}, express The pixel value of the pixel whose middle coordinate position is (x, y);

Step 1_2: Build an end-to-end convolutional neural network: the convolutional neural network includes an input layer, a hidden layer, and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame;

For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; among them, the width of the original input image received by the input end of the input layer is required to be R and the height is L ;

For the encoding framework, it consists of the first convolutional layer, the first normalization layer, the first activation layer, the first max pooling layer, the second convolutional layer, the second normalization layer, the second activation layer, 3rd Convolutional Layer, 3rd Batch Normalization Layer, 1st Concatenate Fusion Layer, 3rd Activation Layer, 2nd Max Pooling Layer, 4th Convolutional Layer, 4th Batch Normalization Layer, 4th Activation Layer, 5th Volume Convolution layer, fifth batch normalization layer, second concatenate fusion layer, fifth activation layer, third max pooling layer, first convolutional layer with holes, sixth batch normalization layer, sixth activation layer, second hole Convolutional layer, seventh batch of normalization layer, third Concatenate fusion layer, seventh activation layer, fourth maximum pooling layer, third convolutional layer with holes, eighth batch of normalization layer, eighth activation layer; The code frame is composed of the first deconvolution layer, the ninth batch of normalization layers, the fourth Concatenate fusion layer, the ninth activation layer, the sixth convolution layer, the tenth batch of normalization layers, the tenth activation layer, the fourth batch of The second deconvolution layer, the eleventh normalization layer, the fifth concatenate fusion layer, the eleventh activation layer, the seventh convolution layer, the twelfth normalization layer, the twelfth activation layer, the third deconvolution layer , the thirteenth batch of normalization layer, the sixth Concatenate fusion layer, the thirteenth activation layer, the eighth convolution layer, the fourteenth batch of normalization layer, the fourteenth activation layer, the fourth deconvolution layer, the fifteenth batch The normalization layer and the seventh Concatenate fusion layer are composed; for the upsampling framework, it consists of the first upsampling layer, the tenth convolutional layer, the seventeenth batch normalization layer, the seventeenth activation layer, and the second upsampling layer. , the eleventh convolutional layer, the eighteenth batch of normalization layers, the eighteenth activation layer, the third upsampling layer, the twelfth convolutional layer, the nineteenth batch of normalization layers, the nineteenth activation layer, the fourth upper The sampling layer, the thirteenth convolutional layer, the twentieth batch of normalization layers, and the twentieth activation layer are composed; for the output layer, it consists of the fifteenth activation layer, the ninth convolutional layer, and the sixteenth batch of normalization set in sequence. layer and the sixteenth activation layer, among which, the first convolutional layer to the thirteenth convolutional layer, the first convolutional layer with holes to the third convolutional layer with holes, the first deconvolution layer to the fourth inverse convolutional layer The convolution kernel size of each convolution layer is 3 × 3, the number of convolution kernels of the first convolution layer is 32, the number of convolution kernels of the second convolution layer and the third convolution layer is 64, and the number of convolution kernels of the fourth convolution layer is 64. The number of convolution kernels of the convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first convolutional layer and the second convolutional layer with holes is 256, and the number of convolutional kernels of the third convolutional layer with holes The number of convolution kernels is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolution layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolution layer is 128, The number of convolution kernels in the third deconvolution layer and the eighth convolution layer is 64, the number of convolution kernels in the fourth deconvolution layer is 32, and the number of convolution kernels in the ninth convolution layer is 1. The number of convolution kernels in the tenth convolution layer is 256, the number of convolution kernels in the eleventh convolution layer is 128, the number of convolution kernels in the twelfth convolution layer is 64, and the number of convolution kernels in the thirteenth convolution layer is 64. The number of convolution kernels is 32, the first The convolution strides from the first convolutional layer to the thirteenth convolutional layer, the first convolutional convolutional layer to the third convolutional convolutional layer with holes adopt the default values, and the first deconvolutional layer to the fourth deconvolutional layer The respective convolution step size is 2×2, the parameters of the first batch of normalization layers to the twentieth batch of normalization layers use default values, the activation functions of the first activation layer to the twentieth activation layer use ReLu, and the first maximum pooling is used. The pooling step size from the layer to the fourth maximum pooling layer is 2×2, and the sampling step size from the first upsampling layer to the fourth upsampling layer is 2×2;

For the coding framework, the input of the first convolutional layer receives the original input image output by the output of the input layer, the output of the first convolutional layer outputs 32 feature maps, and the set of all the output feature maps is denoted as J ₁ , where the width of each feature map in J ₁ is R and the height is L; the input end of the first batch of normalization layers receives all the feature maps in J ₁ , and the output end of the first batch of normalization layers outputs 32 features The set of all output feature maps is denoted as P ₁ , wherein the width of each feature map in P ₁ is R and the height is L; the input end of the first activation layer receives all feature maps in P ₁ , the output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as H ₁ , wherein the width of each feature map in H ₁ is R and the height is L; the first largest The input end of the pooling layer receives all the feature maps in H ₁ , the output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as Z ₁ , wherein, in Z ₁ The width of each feature map is height is The input of the second convolutional layer receives all the feature maps in Z ₁ , the output of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₂ , where in J ₂ The width of each feature map is height is The input end of the _second batch of normalization layers receives all feature maps in J ₂ , and the output end of the _second batch of normalization layers outputs 64 feature maps. The width of each feature map is height is The input terminal of the _second activation layer receives all feature maps in P ₂ , and the output terminal of the _second activation layer outputs 64 feature maps. The width of the feature map is height is The input end of the third convolutional layer receives all the feature maps in H ₂ , and the output end of the _third convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₃ , where in The width of each feature map is height is The input end of the third batch of normalization layers receives all feature maps in J ₃ , the output end of the third batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₃ , where in P ₃ The width of each feature map is height is The input of the first Concatenate fusion layer receives all the feature maps in _P3 and all the feature maps in _H2 , and the output of the first Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C ₁ , where the width of each feature map in C ₁ is height is The input terminal of the _third activation layer receives all feature maps in C ₁ , and the output terminal of the _third activation layer outputs 128 feature maps. The width of the feature map is height is The input of the second maximum pooling layer receives all the feature maps in H ₃ , the output of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Z ₂ , where Z The width of each feature map in ₂ is height is The input end of the fourth convolutional layer receives all the feature maps in Z ₂ , and the output end of the fourth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₄ , where in J ₄ The width of each feature map is height is The input terminal of the _fourth batch of normalization layers receives all feature maps in J ₄ , and the output terminal of the _fourth batch of normalization layers outputs 128 feature maps. The width of each feature map is height is The input end of the _fourth activation layer receives all the feature maps in P ₄ , and the output end of the _fourth activation layer outputs 128 feature maps. The width of the feature map is height is The input end of the fifth convolutional layer receives all the feature maps in H ₄ , and the output end of the fifth convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₅ , where in J ₅ The width of each feature map is height is The input end of the fifth batch of normalization layers receives all feature maps in J ₅ , and the output end of the fifth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₅ , where in P ₅ The width of each feature map is height is The input end of the second Concatenate fusion layer receives all the feature maps in P ₅ and all the feature maps in H ₄ , and the output end of the second Concatenate fusion layer outputs 256 feature maps, and the set formed by all the output feature maps is recorded as C ₂ , where the width of each feature map in C ₂ is height is The input end of the _fifth activation layer receives all the feature maps in C ₂ , and the output end of the _fifth activation layer outputs 256 feature maps. The width of the feature map is height is The input terminal of the third maximum pooling layer receives all the feature maps in H ₅ , the output terminal of the third maximum pooling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Z ₃ , where Z 3 The width of each feature map in ₃ is height is The input of the first atrous convolutional layer receives all feature maps in Z ₃ , the output of the first atrous convolutional layer outputs 256 feature maps, and the set of all the output feature maps is denoted as K ₁ , where , the width _of each feature map in K1 is height is The input end of the sixth batch of normalization layers receives all the feature maps in K ₁ , the output end of the sixth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₆ , where in P ₆ The width of each feature map is height is The input end of the sixth activation layer receives all the feature maps in P ₆ , and the output end of the sixth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₆ , where each of the feature maps in H ₆ The width of the feature map is height is The input end of the second atrous convolutional layer receives all the feature maps in H ₆ , the output end of the second atrous convolutional layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as K ₂ , where , the width of each feature map in _K2 is height is The input end of the seventh batch of normalization layers receives all the feature maps in K ₂ , the output end of the seventh batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₇ , where in P ₇ The width of each feature map is height is The input end of the third Concatenate fusion layer receives all the feature maps in P ₇ and all the feature maps in H ₆ , and the output end of the third Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is denoted as C ₃ , where the width of each feature map in C ₃ is height is The input terminal of the seventh activation layer receives all the feature maps in C ₃ , and the output terminal of the _seventh activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H ₇ . The width of the feature map is height is The input end of the fourth maximum pooling layer receives all the feature maps in H ₇ , and the output end of the fourth maximum pooling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Z ₄ , where Z 4 The width of each feature map in ₄ is height is The input of the third atrous convolutional layer receives all the feature maps in Z ₄ , the output of the third atrous convolutional layer outputs 512 feature maps, and the set of all the output feature maps is denoted as K ₃ , where , the width _of each feature map in K3 is height is The input end of the eighth batch of normalization layers receives all the feature maps in K ₃ , the output end of the eighth batch of normalization layers outputs 512 feature maps, and the set formed by all the output feature maps is denoted as P ₈ , where in P ₈ The width of each feature map is height is The input end of the _eighth activation layer receives all the feature maps in P8, and the output end of the eighth activation layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as H ₈ , and H ₈ is also the coding frame. _The output of , where the width of each feature map in H8 is height is

For the decoding framework, the input end of the first deconvolution layer receives the output of the encoding framework, that is, all feature maps in _H8 , and the output end of the first deconvolution layer outputs 256 feature maps, which are composed of all the output feature maps. The set is denoted as F ₁ , where the width of each feature map in F ₁ is height is The input end of the ninth batch of normalization layers receives all the feature maps in F ₁ , and the output end of the ninth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₉ , where in P ₉ The width of each feature map is height is The input end of the fourth Concatenate fusion layer receives all the feature maps in P ₉ and all the feature maps in P ₇ , and the output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set composed of all the output feature maps is recorded as C ₄ , where the width of each feature map in C ₄ is height is The input end of the _ninth activation layer receives all the feature maps in C ₄ , and the output end of the _ninth activation layer outputs 512 feature maps. The width of the feature map is height is The input end of the sixth convolutional layer receives all the feature maps in H ₉ , the output end of the sixth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J ₆ , where in J ₆ The width of each feature map is height is The input end of the tenth batch of normalization layers receives all the feature maps in J ₆ , the output end of the tenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₁₀ , where in P ₁₀ The width of each feature map is height is The input end of the tenth activation layer receives all the feature maps in P ₁₀ , and the output end of the tenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₁₀ , where each H ₁₀ The width of the feature map is height is The input end of the second deconvolution layer receives the output of the coding framework, that is, all feature maps in _H10 , and the output end of the second deconvolution layer outputs 128 feature maps, and the set of all the output feature maps is denoted as F ₂ , where the width of each feature map in _F2 is height is The input end of the eleventh batch of normalization layers receives all the feature maps in F ₂ , and the output end of the eleventh batch of normalization layers outputs 128 feature maps, and the set of all the output feature maps is denoted as P ₁₁ , where P The width of each feature map in ₁₁ is height is The input of the fifth Concatenate fusion layer receives all the feature maps in P ₁₁ and all the feature maps in P ₅ , and the output of the fifth Concatenate fusion layer outputs 256 feature maps, and the set of all the output feature maps is recorded as C ₅ , where the width of each feature map in C ₅ is height is The input end of the eleventh activation layer receives all the feature maps in _C5 , and the output end of the _{eleventh activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H 11} _. The width of each feature map is height is The input end of the seventh convolutional layer receives all the feature maps in H ₁₁ , the output end of the seventh convolution layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₇ , where in J ₇ The width of each feature map is height is The input end of the twelfth batch of normalization layers receives all feature maps in J ₇ , the output end of the twelfth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₁₂ , where P The width of each feature map in ₁₂ is height is The input end of the twelfth activation layer receives all the feature maps in P ₁₂ , the output end of the twelfth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₂ , where in H ₁₂ The width of each feature map is height is The input end of the third deconvolution layer receives all the feature maps in H ₁₂ , the output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as F ₃ , where F The width of each feature map in ₃ is height is The input end of the thirteenth batch of normalization layers receives all the feature maps in F ₃ , the output end of the thirteenth batch of normalization layers outputs 64 feature maps, and the set of all the output feature maps is denoted as P ₁₃ , where P The width of each feature map in ₁₃ is height is The input end of the sixth Concatenate fusion layer receives all the feature maps in P ₁₃ and all the feature maps in P ₃ , and the output end of the sixth Concatenate fusion layer outputs 128 feature maps, and the set composed of all the output feature maps is recorded as C ₆ , where the width of each feature map in C ₆ is height is The input end of the thirteenth activation layer receives all the feature maps in C ₆ , the output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₃ , wherein, in H ₁₃ The width of each feature map is height is The input end of the eighth convolutional layer receives all the feature maps in H ₁₃ , the output end of the eighth convolution layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₈ , where in J ₈ The width of each feature map is height is The input end of the fourteenth batch of normalization layers receives all the feature maps in J ₈ , the output end of the fourteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₁₄ , where P The width of each feature map in ₁₄ is height is The input end of the fourteenth activation layer receives all the feature maps in P ₁₄ , the output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H ₁₄ , wherein, in H ₁₄ The width of each feature map is height is The input end of the fourth deconvolution layer receives all the feature maps in H ₁₄ , the output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F ₄ , where F The width of each feature map in ₄ is R and the height is L; the input end of the fifteenth batch of normalization layers receives all the feature maps in F ₄ , and the output end of the fifteenth batch of normalization layers outputs 32 feature maps. The set composed of all the output feature maps is denoted as P ₁₅ , wherein the width of each feature map in P ₁₅ is R and the height is L; the input end of the seventh Concatenate fusion layer receives all the feature maps in P ₁₅ , H All feature maps in ₁ , the output of the upsampling framework, the output of the seventh Concatenate fusion layer outputs 96 feature maps, and the set of all the output feature maps is denoted as C ₇ , where each feature in C ₇ The width of the graph is R and the height is L;

For the up-sampling framework, the input of the first up-sampling layer receives all the feature maps in Z ₄ , the output of the first up-sampling layer outputs 512 feature maps, and the set formed by all the output feature maps is denoted as Y ₁ , where the width of each feature map in Y ₁ is height is The input end of the tenth convolutional layer receives all the feature maps in Y ₁ , and the output end of the tenth convolution layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as J ₁₀ , where in J ₁₀ The width of each feature map is height is The input end of the seventeenth batch of normalization layers receives all feature maps in J ₁₀ , the output end of the seventeenth batch of normalization layers outputs 256 feature maps, and the set formed by all the output feature maps is denoted as P ₁₇ , where P The width of each feature map in ₁₇ is height is The input end of the seventeenth activation layer receives all the feature maps in P ₁₇ , the output end of the seventeenth activation layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as H ₁₇ , wherein, in H ₁₇ The width of each feature map is height is The input end of the second upsampling layer receives all the feature maps in H ₁₇ , the output end of the second upsampling layer outputs 256 feature maps, and the set formed by all the output feature maps is denoted as Y ₂ , where in Y ₂ The width of each feature map is height is The input of the eleventh convolutional layer receives all the feature maps in Y ₂ , the output of the eleventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as J ₁₁ , where J The width of each feature map in ₁₁ is height is The input end of the eighteenth batch of normalization layers receives all feature maps in J ₁₁ , the output end of the eighteenth batch of normalization layers outputs 128 feature maps, and the set formed by all the output feature maps is denoted as P ₁₈ , where P The width of each feature map in ₁₈ is height is The input end of the eighteenth activation layer receives all the feature maps in P ₁₈ , the output end of the eighteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as H ₁₈ , where in H ₁₈ The width of each feature map is height is The input end of the third up-sampling layer receives all the feature maps in H ₁₈ , the output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is denoted as Y ₃ , where in Y ₃ The width of each feature map is height is The input of the twelfth convolutional layer receives all feature maps in Y ₃ , and the output of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as J ₁₂ , where J The width of each feature map in ₁₂ is height is The input end of the nineteenth batch of normalization layers receives all the feature maps in J ₁₂ , the output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is denoted as P ₁₉ , where P The width of each feature map in ₁₉ is height is The input end of the nineteenth activation layer receives all the feature maps in P ₁₉ , and the output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as H ₁₉ , where in H ₁₉ The width of each feature map is height is The input end of the fourth upsampling layer receives all the feature maps in H ₁₉ , the output end of the fourth upsampling layer outputs 64 feature maps, and the set formed by all the output feature maps is denoted as Y ₄ , where in Y ₄ _The width of each feature map of the The set formed by all feature maps is denoted as J ₁₃ , wherein the width of each feature map in J ₁₃ is R and the height is L; the input end of the twentieth batch of normalization layers receives all feature maps in J ₁₃ , the second The output of ten batches of normalization layers outputs 32 feature maps, and the set of all output feature maps is denoted as P ₂₀ , where the width of each feature map in P ₂₀ is R and the height is L; the twentieth activation The input end of the layer receives all the feature maps in P ₂₀ , and the output end of the _twentieth activation layer outputs ₃₂ feature maps. The width of the graph is R and the height is L;

For the output layer, the input terminal of the fifteenth activation layer receives the output of the decoding framework, that is, all the feature maps in _C7 , and the output terminal of the fifteenth activation layer outputs 96 feature maps. It is denoted as H ₁₅ , wherein the width of each feature map in H ₁₅ is R and the height is L; the input end of the ninth convolution layer receives all the feature maps in H ₁₅ , and the output end of the ninth convolution layer outputs 1 feature map, the set of all output feature maps is denoted as J ₉ , where the width of the feature map in J ₉ is R, and the height is L; the input of the sixteenth batch of normalization layers receives J ₉ . Feature map, the output end of the sixteenth batch of normalization layers outputs one feature map, and the set composed of all the output feature maps is denoted as P ₁₆ , where the width of the feature map in P ₁₆ is R and the height is L; The input terminal of the sixteenth activation layer receives the feature map in P ₁₆ , and the output terminal of the sixteenth activation layer outputs one feature map, and the set formed by all the output feature maps is denoted as H ₁₆ , wherein the features in H ₁₆ The width of the image is R and the height is L, and the feature map in H ₁₆ is the estimated depth image corresponding to the original input image;

Step 1_3: Take each original monocular image in the training set as the original input image, input it into the convolutional neural network for training, and obtain the estimated depth image corresponding to each original monocular image in the training set, and set {Q ⁿ The estimated depth image corresponding to (x,y)} is denoted as in, express The pixel value of the pixel whose middle coordinate position is (x, y);

Step 1_4: Calculate the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image, and The loss function value between is denoted as

Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain the trained convolutional neural network training model, and obtain a total of N×V loss function values; then find the smallest value from the N×V loss function values. Then take the weight vector and the bias term corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias term of the trained convolutional neural network training model, which are recorded as W ^best and b ^best ; wherein, V>1;

The specific steps of the test phase process are:

Step 2_1: Let {Q(x',y')} denote the monocular image to be predicted; wherein, 1≤x'≤R', 1≤y'≤L', R' denotes {Q(x',y ')} width, L' means the height of {Q(x',y')}, Q(x',y') means the coordinate position in {Q(x',y')} is (x',y ') the pixel value of the pixel point;

Step 2_2: Input {Q(x',y')} into the trained convolutional neural network training model, and use W ^best and b ^best to predict, and get {Q(x',y')} corresponding to The predicted depth image is recorded as {Q _depth (x',y')}; among them, Q _depth (x',y') indicates that the coordinate position in {Q _depth (x',y')} is (x',y) ') of the pixel value of the pixel point.

2. a kind of monocular vision depth estimation method according to claim 1, is characterized in that in described step 1-4, Obtained using the mean squared error function.