CN112085776B

CN112085776B - A direct method for scene depth estimation in unsupervised monocular images

Info

Publication number: CN112085776B
Application number: CN202010754803.1A
Authority: CN
Inventors: 张治国; 孙业昊; 孙浩然; 王海霞; 卢晓; 盛春阳; 李玉霞
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2022-07-19
Anticipated expiration: 2040-07-31
Also published as: CN112085776A

Abstract

The invention discloses a direct method unsupervised monocular image scene depth estimation method, which belongs to the field of computer vision and depth estimation and comprises the following steps: constructing a neural network; calculating an image reprojection error; image masking calculation and camera pose updating. The method overcomes the defects that monocular image depth estimation has high requirement on environment, is easily interfered by low texture regions, has poor camera pose estimation precision and the like, and not only obviously improves the depth estimation precision but also has auxiliary action on positioning and navigation of a moving vehicle by combining the traditional monocular depth estimation problem with a visual odometer; the method has the advantages of high precision, strong flexibility, wide application range and the like, can be used for sensing, preventing collision and positioning navigation of the surrounding environment of equipment such as an automatic driving vehicle, a mobile robot and the like, and has various application scenes.

Description

A direct method for scene depth estimation in unsupervised monocular images

技术领域technical field

本发明属于计算机视觉、神经网络、深度估计领域，更具体地，涉及一种直接法无监督单目图像场景深度估计方法。The invention belongs to the fields of computer vision, neural network and depth estimation, and more particularly, relates to a direct method unsupervised monocular image scene depth estimation method.

背景技术Background technique

目前，深度估计在神经网路、传感器等相关技术的推动下，得到了飞速发展，并且在智能机器人、行人识别、人脸解锁、VR应用和自动驾驶等领域得到了广泛地应用。深度估计的首要任务是根据摄像头采集到的单张彩色图像估计出前方物体到摄像头的距离。At present, depth estimation has developed rapidly, driven by related technologies such as neural networks and sensors, and has been widely used in intelligent robots, pedestrian recognition, face unlocking, VR applications, and autonomous driving. The primary task of depth estimation is to estimate the distance from the object ahead to the camera based on a single color image captured by the camera.

从真实场景中获得场景对应的三维信息主要有两种方式：一是使用能够感知场景三维深度信息的传感器来采集场景中的深度信息，另一种方式是从场景对应的二维图像中恢复三维深度信息。结构光和飞行时间测距法(ToF)是目前常用的基于传感器的深度估计算法。结构光技术由一个特殊的投影仪和摄像头组成，摄像头通过采集投影仪发出的特定的光信息投影到场景之后的变化获得场景的三维信息。结构光使用特殊的传感器采集高精度的场景三维信息，目前在面部解锁、安全支付等领域被广泛采用。但是由于技术原理限制只能用在近距离物体测距和小场景范围内使用，所以并不适用于道路场景下的深度估计。ToF是另一种常用的深度信息采集技术，它利用信号在两个收发器之间的往返飞行时间来获得深度信息。手机中常用的深度摄像头和激光雷达测距设备都是使用ToF技术获得场景的高精度深度图，ToF技术在AR、体感游戏和自动驾驶领域都有广泛的应用，但是ToF需要的激光雷达和传感器价格昂贵，而且应用于自动驾驶的车顶激光雷达体积过大存在局限性，所以从场景对应的二维图像中恢复三维深度信息逐渐成为主流。There are two main ways to obtain the three-dimensional information corresponding to the scene from the real scene: one is to use a sensor that can perceive the three-dimensional depth information of the scene to collect the depth information in the scene, and the other is to restore the three-dimensional image from the corresponding two-dimensional image of the scene. in-depth information. Structured light and time-of-flight (ToF) are commonly used sensor-based depth estimation algorithms. Structured light technology consists of a special projector and a camera. The camera obtains three-dimensional information of the scene by collecting the specific light information emitted by the projector and projecting it into the scene. Structured light uses special sensors to collect high-precision three-dimensional scene information, and is currently widely used in face unlocking, secure payment and other fields. However, due to the limitation of technical principles, it can only be used for distance measurement of close-range objects and small scene ranges, so it is not suitable for depth estimation in road scenes. ToF is another commonly used depth information acquisition technique, which utilizes the round-trip flight time of the signal between two transceivers to obtain depth information. The depth cameras and lidar ranging devices commonly used in mobile phones use ToF technology to obtain high-precision depth maps of the scene. ToF technology has a wide range of applications in the fields of AR, somatosensory games and autonomous driving, but the lidar and sensors required by ToF The price is expensive, and the roof lidar applied to autonomous driving has limitations due to its large volume, so the recovery of 3D depth information from the 2D image corresponding to the scene has gradually become the mainstream.

基于图像的深度估计可以分为监督学习和无监督学习，监督学习的方法都依赖于训练图片对应的深度三维地图。通常深度三维地图都是由雷达传感器采集获得，所以基于神经网络的监督学习方法的训练数据集规模通常很小，而且数据集获得的成本也很高，这极大的限制了监督学习方法的可迁移性和适应性。Image-based depth estimation can be divided into supervised learning and unsupervised learning. Supervised learning methods all rely on the depth 3D map corresponding to the training image. Usually, deep 3D maps are collected by radar sensors, so the training datasets of neural network-based supervised learning methods are usually small in size, and the cost of dataset acquisition is also high, which greatly limits the availability of supervised learning methods. Mobility and adaptability.

无监督学习依据恢复深度时需要的相机数目，通常可以分为多目、单目和双目深度估计方法。传统的深度估计方法主要是基于特征点匹配和环境假设的几何约束，双目法和多目深度估计法需要精准的摄像头外参数，而且无法消除外参变化带来的误差。而单目图像深度估计法只需要相机的内参并且不需要特征匹配过程，算法更加的简洁，适应范围更加广泛。Unsupervised learning can usually be divided into multi-camera, monocular and binocular depth estimation methods according to the number of cameras needed to recover depth. The traditional depth estimation method is mainly based on the geometric constraints of feature point matching and environmental assumptions. The binocular method and the multi-eye depth estimation method require accurate external parameters of the camera, and cannot eliminate the error caused by the change of external parameters. The monocular image depth estimation method only needs the internal parameters of the camera and does not require the feature matching process. The algorithm is more concise and has a wider range of adaptability.

发明内容SUMMARY OF THE INVENTION

针对现有技术中存在的上述技术问题，本发明提出了一种直接法无监督单目图像深度估计方法，不仅显著提高了深度估计精度，而且对移动车辆的定位与导航也有辅助作用，设计合理，克服了现有技术的不足，具有良好的效果。In view of the above technical problems existing in the prior art, the present invention proposes a direct method for unsupervised monocular image depth estimation, which not only significantly improves the depth estimation accuracy, but also has an auxiliary effect on the positioning and navigation of moving vehicles, and the design is reasonable. , overcomes the deficiencies of the prior art, and has a good effect.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种直接法无监督单目图像场景深度估计方法，包括如下步骤：A direct method for unsupervised monocular image scene depth estimation method, comprising the following steps:

步骤1：构建深度估计神经网络,以单目连续图像作为输入图像，利用深度估计神经网络输出深度估计图像；Step 1: Build a depth estimation neural network, take the monocular continuous image as the input image, and use the depth estimation neural network to output the depth estimation image;

步骤2：计算初始相机位姿，利用输入图像的上一帧图像与深度估计神经图像以及相机位姿计算重投影图像，并将重投影图像与当前帧图像进行重投影误差计算，利用反向传播进行深度估计神经网络的参数更新，获得一个新的深度估计图像；Step 2: Calculate the initial camera pose, use the previous frame image of the input image, the depth estimation neural image and the camera pose to calculate the reprojection image, and calculate the reprojection error between the reprojection image and the current frame image, and use back propagation. Update the parameters of the depth estimation neural network to obtain a new depth estimation image;

步骤3：利用重投影图像与输入图像计算图像蒙版，更新前后两帧图像之间的相机位姿估计，重复迭代步骤2、3。Step 3: Use the reprojected image and the input image to calculate the image mask, update the camera pose estimation between the two frames of images before and after, and repeat iterative steps 2 and 3.

优选地，在步骤1中，具体包括如下步骤：Preferably, in step 1, it specifically includes the following steps:

步骤1.1：构建的深度估计神经网络为全卷积U型，卷积部分使用Res-Net18网络结构中的卷积网络作为主体结构网络；Step 1.1: The constructed depth estimation neural network is fully convolutional U-shaped, and the convolutional network in the Res-Net18 network structure is used as the main structure network in the convolution part;

步骤1.2：反卷积部分包含若干层反卷积层与ReLu激活层叠加作为主体结构，每层反卷积层与卷积部分中尺度相同的卷积块中的最后一层卷积层相连接，形成最终的反卷积层；Step 1.2: The deconvolution part includes several layers of deconvolution layers and ReLu activation layers superimposed as the main structure. Each deconvolution layer is connected to the last layer of convolutional layers in the convolutional block with the same scale in the convolutional part. , forming the final deconvolution layer;

步骤1.3：以单目连续图像作为训练数据集，并在输入深度估计神经网络之前进行图像反转、伽马变换、颜色通道变化等样本扩充和泛化操作。Step 1.3: Take the monocular continuous image as the training data set, and perform sample expansion and generalization operations such as image inversion, gamma transformation, and color channel change before inputting to the depth estimation neural network.

优选地，在步骤2中，具体包括如下步骤：Preferably, in step 2, it specifically includes the following steps:

步骤2.1：预先标定出相机的内参数或从数据集中获得相机的内参数；Step 2.1: pre-calibrate the internal parameters of the camera or obtain the internal parameters of the camera from the data set;

步骤2.2：利用当前帧图像以及上一帧图像，采用直接法计算初始相机位姿；Step 2.2: Using the current frame image and the previous frame image, use the direct method to calculate the initial camera pose;

步骤2.3：将利用输入图像的上一帧图像、深度估计图像以及相机位姿计算重投影图像；Step 2.3: The reprojection image will be calculated using the previous frame image of the input image, the depth estimation image and the camera pose;

步骤2.4：计算当前帧图像与重投影图像之间的重投影误差；Step 2.4: Calculate the reprojection error between the current frame image and the reprojection image;

步骤2.5：利用深度估计神经网络输出的深度估计图像和图像重投影误差，并通过反向传播进行深度估计神经网络的参数更新，获得一个新的深度估计图像。Step 2.5: Use the depth estimation image output by the depth estimation neural network and the image reprojection error, and update the parameters of the depth estimation neural network through backpropagation to obtain a new depth estimation image.

优选地，在步骤3中，具体包括如下步骤：Preferably, in step 3, it specifically includes the following steps:

步骤3.1：根据重投影图像与当前帧图像计算相似度误差，获得图像蒙版；Step 3.1: Calculate the similarity error according to the reprojected image and the current frame image to obtain an image mask;

步骤3.2：计算上一帧图像与获得的相机位姿之间的雅克比矩阵；Step 3.2: Calculate the Jacobian matrix between the previous frame image and the obtained camera pose;

步骤3.3：将图像蒙版与雅克比矩阵相乘，获得改进的雅克比矩阵；Step 3.3: Multiply the image mask with the Jacobian matrix to obtain the improved Jacobian matrix;

步骤3.4：利用改进的雅克比矩阵，更新前后两帧图像之间的相机位姿估计。Step 3.4: Using the improved Jacobian matrix, update the camera pose estimation between the two frames of images before and after.

步骤3.5：重复迭代步骤2、3。Step 3.5: Repeat iterative steps 2 and 3.

优选地，所述卷积部分使用Res-Net18网络结构中的卷积网络作为主体结构网络，由5个卷积块和5个反卷积块组成，第1个卷积块中包含1个卷积组，卷积组中包含1个卷积层，该卷积层的输入为3通道彩色图像，输出为64通道，第2、3、4、5个卷积块中分别包含2个卷积组，每个卷积组中分别包含2个卷积层，第2、3、4、5个卷积块中包含的通道数分别为64、128、256和512，所述反卷积块中反卷积组的数量是对应尺度卷积块中卷积组数量的2倍，各个反卷积块中的每个卷积组中包含的通道数分别为512、256、128、64、1。Preferably, the convolutional part uses the convolutional network in the Res-Net18 network structure as the main structure network, which consists of 5 convolutional blocks and 5 deconvolutional blocks, and the first convolutional block contains 1 volume Product group, the convolution group contains 1 convolution layer, the input of the convolution layer is a 3-channel color image, and the output is 64 channels. The 2nd, 3rd, 4th, and 5th convolution blocks respectively contain 2 convolutions Each convolution group contains 2 convolution layers respectively, and the number of channels contained in the 2nd, 3rd, 4th, and 5th convolution blocks are 64, 128, 256, and 512, respectively. The number of deconvolution groups is twice the number of convolution groups in the corresponding scale convolution block, and the number of channels contained in each convolution group in each deconvolution block is 512, 256, 128, 64, and 1, respectively.

优选地，上一个卷积块与下一个卷积块之间通过最大池操作连接，最大池操作也将相邻的两个卷积块中的下一个卷积块中的尺寸缩放为上一个卷积块的二分之一，相邻的两个卷积块中下一个反卷积块的尺度为上一个反卷积块尺度的2倍，每个卷机组和反卷积组中的卷积层和反卷积层之间通过3*3卷积方式连接，此外，将第二个卷积块的内容复制到第四个反卷积块中，第五个卷积块的内容复制到第三个反卷积块中。Preferably, the previous convolutional block and the next convolutional block are connected through a max pooling operation, and the max pooling operation also scales the size of the next convolutional block in the two adjacent convolutional blocks to the previous convolutional block One-half of the accumulation block, the scale of the next deconvolution block in the two adjacent convolution blocks is twice the scale of the previous deconvolution block, and the convolution in each convolution group and deconvolution group The layer and the deconvolution layer are connected by 3*3 convolution. In addition, the content of the second convolution block is copied to the fourth deconvolution block, and the content of the fifth convolution block is copied to the first convolution block. in three deconvolution blocks.

本发明所带来的有益技术效果：Beneficial technical effects brought by the present invention:

本发明通过单目相机获取环境的二维图像，然后使用设计的全卷积神经网络计算得到二维图像对应的三维深度图。在训练网络时，本发明使用设计的图像重投影误差与图像蒙版，增加了训练效率和深度估计精度。方法中用到的图像蒙版可以有效地去除道路环境中的地纹理区域与移动的车辆等干扰因素，大大地提高了单目深度估计的精度，同时降低了对环境的要求和训练成本；另外，由于本发明提供的定位方法可以实时获取摄像头前方的深度图，该方法不仅可以用于地面移动机器人的导航和自动驾驶汽车，对于空中飞行的无人机同样适用。The present invention obtains a two-dimensional image of the environment through a monocular camera, and then uses a designed fully convolutional neural network to obtain a three-dimensional depth map corresponding to the two-dimensional image. When training the network, the present invention uses the designed image reprojection error and image mask, which increases training efficiency and depth estimation accuracy. The image mask used in the method can effectively remove interference factors such as ground texture areas and moving vehicles in the road environment, greatly improving the accuracy of monocular depth estimation, while reducing the requirements for the environment and training costs; , because the positioning method provided by the present invention can obtain the depth map in front of the camera in real time, the method can not only be used for the navigation of ground mobile robots and automatic driving cars, but also suitable for drones flying in the air.

本发明方法不仅克服了传统深度估计方法对雷达传感器依赖性强、对环境要求高、不灵活等缺陷，而且有效地克服了单目深度估计中的因为低纹理区域产生的空洞问题，同时对空中无人机的定位与导航亦非常适用；具有精度高、灵活性强、应用范围广等诸多优势，可用于室内环境下移动机器人、无人机等智能移动设备的导航与避障，拓宽了应用场景。The method of the invention not only overcomes the defects of the traditional depth estimation method, such as strong dependence on radar sensors, high requirements for the environment, and inflexibility, but also effectively overcomes the problem of holes in the monocular depth estimation due to low texture areas. The positioning and navigation of UAV is also very suitable; it has many advantages such as high precision, strong flexibility, and wide application range, and can be used for navigation and obstacle avoidance of intelligent mobile devices such as mobile robots and UAVs in indoor environments, broadening the application. Scenes.

附图说明Description of drawings

图1是本发明基于全卷积神经网络的无监督单目深度估计模块示意图；1 is a schematic diagram of an unsupervised monocular depth estimation module based on a fully convolutional neural network of the present invention;

图2是本发明全卷积神经网络的训练过程示意图；Fig. 2 is the training process schematic diagram of the fully convolutional neural network of the present invention;

具体实施方式Detailed ways

下面结合附图以及具体实施方式对本发明作进一步详细说明：The present invention is described in further detail below in conjunction with the accompanying drawings and specific embodiments:

如图1所示，一种直接法无监督单目图像场景深度估计方法，包括：As shown in Figure 1, a direct method for unsupervised scene depth estimation from monocular images includes:

步骤1：构建的深度估计神经网络为全卷积U型，包括卷积部分和反卷积部分。深度估计神经网络以单目连续图像作为输入图像，利用深度估计神经网络输出深度估计图像。Step 1: The constructed depth estimation neural network is a fully convolutional U-shaped, including a convolution part and a deconvolution part. The depth estimation neural network takes the monocular continuous image as the input image, and uses the depth estimation neural network to output the depth estimation image.

步骤1包括以下子步骤：Step 1 includes the following sub-steps:

步骤1.1：卷积部分使用Res-Net18网络结构中的卷积网络作为主体结构网络。卷积部分由若干个卷积块组成，上一个卷积块与下一个卷积块之间通过最大池操作连接，每个卷积块中又包含若干个卷积组，每个卷积组中包含若干个卷积层，同时，最大池操作也将相邻的两个卷积块中的下一个卷积块中的尺寸缩放为上一个卷积块的二分之一。如图2所示，在本发明实施例中卷积块的数量为5个，每个卷积模块中包含若干个不同的卷积组。第1个卷积块中包含1个卷积组，卷积组中包含1个卷积层，该卷积层的输入为3通道彩色图像，输出为64通道；第2、3、4、5个卷积块中分别包含2个卷积组，每个卷积组中分别包含2个卷积层，第2、3、4、5个卷积块中包含的通道数分别为64、128、256和512。第1个卷积块中采用的是7*7的卷积核，第2、3、4、5个卷积块中采用3*3的卷积核，每个卷积块中的卷积层之间采用线性校正单元(Rectified Linear Units，ReLu)作为激活函数。Step 1.1: The convolutional part uses the convolutional network in the Res-Net18 network structure as the main structure network. The convolution part consists of several convolution blocks. The previous convolution block and the next convolution block are connected by a maximum pool operation. Each convolution block contains several convolution groups. It contains several convolutional layers, and at the same time, the maximum pooling operation also scales the size of the next convolutional block in two adjacent convolutional blocks to one-half of the previous convolutional block. As shown in FIG. 2 , in the embodiment of the present invention, the number of convolution blocks is 5, and each convolution module includes several different convolution groups. The first convolution block contains 1 convolution group, and the convolution group contains 1 convolution layer. The input of the convolution layer is a 3-channel color image and the output is 64 channels; the second, third, fourth, and fifth Each convolution block contains 2 convolution groups, and each convolution group contains 2 convolution layers respectively. The number of channels contained in the 2nd, 3rd, 4th, and 5th convolution blocks are 64, 128, 256 and 512. The 7*7 convolution kernel is used in the first convolution block, the 3*3 convolution kernel is used in the 2nd, 3rd, 4th, and 5th convolution blocks, and the convolution layer in each convolution block is Linear correction units (Rectified Linear Units, ReLu) are used as the activation function.

步骤1.2：反卷积部分由若干个反卷积块组成。反卷积部分中同一个尺度的反卷积块由两部分组成，分别是卷积部分中尺度与当前反卷积部分中尺度相同的基础块和当前尺度的反卷积块。下一反卷积块的尺度为上一反卷积块尺度的2倍。每个反卷积块中包含若干个反卷积组。如图2所示，本发明中反卷积块的数量为5，反卷积块中反卷积组的数量是对应尺度卷积块中卷积组数量的2倍，每个反卷积组中包含的卷积层个数与对应尺度卷积组中卷积层个数相同，各个反卷积块中的每个卷积组中包含的通道数分别为512、256、128、64、1。Step 1.2: The deconvolution part consists of several deconvolution blocks. The deconvolution block of the same scale in the deconvolution part consists of two parts, namely the base block of the same scale in the convolution part as the current deconvolution part and the deconvolution block of the current scale. The scale of the next deconvolution block is twice the scale of the previous deconvolution block. Each deconvolution block contains several deconvolution groups. As shown in Figure 2, the number of deconvolution blocks in the present invention is 5, and the number of deconvolution groups in the deconvolution block is twice the number of convolution groups in the corresponding scale convolution block. The number of convolutional layers contained in it is the same as the number of convolutional layers in the corresponding scale convolutional group, and the number of channels contained in each convolutional group in each deconvolution block is 512, 256, 128, 64, 1 .

步骤1.3：利用卷积部分和反卷积部分构建深度估计神经网络，以单目连续图像作为训练数据集，输入构建的神经网络进行深度估计，并在输入深度估计神经网络之前进行图像反转、伽马变换、颜色通道变化等样本扩充和泛化操作。Step 1.3: Use the convolution part and the deconvolution part to construct a depth estimation neural network, take the monocular continuous image as the training data set, input the constructed neural network for depth estimation, and perform image inversion before inputting the depth estimation neural network, Sample expansion and generalization operations such as gamma transform, color channel change, etc.

步骤2：利用输入图像的上一帧图像、深度估计图像以及相机位姿计算重投影图像，并将重投影图像与当前帧图像进行重投影误差计算，获得图像的重投影误差，利用反向传播进行深度估计神经网络的参数更新，获得一个新的深度估计图像。Step 2: Calculate the reprojection image using the previous frame image of the input image, the depth estimation image and the camera pose, and calculate the reprojection error between the reprojection image and the current frame image to obtain the reprojection error of the image, and use back propagation. Perform parameter update of the depth estimation neural network to obtain a new depth estimation image.

步骤2包括以下子步骤：Step 2 includes the following sub-steps:

步骤2.1：预先标定出相机的内参或从数据集中获得相机的内参数，相机的内参数通过离线的方式，利用张正友相机标定方法，采用棋盘格进行标定。Step 2.1: Pre-calibrate the camera's internal parameters or obtain the camera's internal parameters from the data set. The camera's internal parameters are calibrated off-line, using Zhang Zhengyou's camera calibration method, and using a checkerboard.

步骤2.2：利用当前帧图像以及上一帧图像，采用直接法计算初始相机位姿。Step 2.2: Using the current frame image and the previous frame image, use the direct method to calculate the initial camera pose.

步骤2.3：将将利用输入图像的上一帧图像与深度估计图像以及相机位姿计算重投影图像；Step 2.3: The reprojection image will be calculated using the previous frame of the input image, the depth estimation image and the camera pose;

步骤2.4：计算当前帧图像与重投影图像之间的重投影误差，重投影误差就是深度估计神经网络的损失函数。Step 2.4: Calculate the reprojection error between the current frame image and the reprojection image. The reprojection error is the loss function of the depth estimation neural network.

深度估计神经网络中损失函数采用的是重投影图像与原图像之间的误差，定义为：The loss function in the depth estimation neural network uses the error between the reprojected image and the original image, which is defined as:

其中，I_i，

分别表示输入图像和重投影图像的像素值，i表示像素值的索引。Among them, I _i ,

represent the pixel values of the input image and the reprojected image, respectively, and i represents the index of the pixel value.

步骤2.5：利用深度估计神经网络输出的深度估计图像和图像重投影误差，通过反向传播进行深度估计网络的参数更新，获得一个新的深度估计图像。Step 2.5: Use the depth estimation image output by the depth estimation neural network and the image reprojection error to update the parameters of the depth estimation network through backpropagation to obtain a new depth estimation image.

步骤3中包括以下子步骤：Step 3 includes the following sub-steps:

步骤3.1：根据重投影图像与当前帧图像计算相似度误差，获得图像蒙版。Step 3.1: Calculate the similarity error according to the reprojected image and the current frame image to obtain an image mask.

具体地，图像蒙版为二值图像。当重投影误差小于前后帧之间图像之差时，图像蒙版置为1，反之，置为0。在本发明方法中，低纹理区域和移动车辆区域图像蒙版为0，其他地方区域图像蒙版为1。Specifically, the image mask is a binary image. The image mask is set to 1 when the reprojection error is less than the difference between the images before and after the frame, otherwise, it is set to 0. In the method of the present invention, the image mask of the low texture area and the moving vehicle area is 0, and the image mask of other places is 1.

步骤3.3：计算上一帧图像与获得的相机位姿之间的雅克比矩阵。Step 3.3: Calculate the Jacobian matrix between the previous frame image and the obtained camera pose.

步骤3.4：将图像蒙版与雅克比矩阵相乘，获得改进的雅克比矩阵。Step 3.4: Multiply the image mask with the Jacobian matrix to obtain the improved Jacobian matrix.

步骤3.5：利用改进的雅克比矩阵，更新前后两帧图像之间的相机位姿估计。Step 3.5: Using the improved Jacobian matrix, update the camera pose estimation between the two frames of images before and after.

步骤3.6：重复迭代步骤2、3。Step 3.6: Repeat iterative steps 2 and 3.

设置迭代次数为6万次，在第5万次训练保存的模型取得了最好的估计效果，总训练时间为40小时。Setting the number of iterations to 60,000, the model saved at the 50,000th training achieved the best estimate, with a total training time of 40 hours.

以上为本实施例的完整实现过程。The above is the complete implementation process of this embodiment.

当然，上述说明并非是对本发明的限制，本发明也并不仅限于上述举例，本技术领域的技术人员在本发明的实质范围内所做出的变化、改型、添加或替换，也应属于本发明的保护范围。Of course, the above description is not intended to limit the present invention, and the present invention is not limited to the above examples. Changes, modifications, additions or substitutions made by those skilled in the art within the essential scope of the present invention should also belong to the present invention. the scope of protection of the invention.

Claims

1. A direct method unsupervised monocular image scene depth estimation method is characterized by comprising the following steps: the method comprises the following steps:

step 1: constructing a depth estimation neural network, taking a monocular continuous image as an input image, and outputting a depth estimation image by using the depth estimation neural network;

step 2: calculating an initial camera pose, calculating a re-projection image by using a previous frame image, a depth estimation image and the camera pose of an input image, calculating a re-projection error of the re-projection image and a current frame image, and updating parameters of a depth estimation network by using back propagation to obtain a new depth estimation image;

and step 3: calculating an image mask by utilizing the reprojected image and the input image, updating the camera pose estimation between the front frame image and the rear frame image, and repeating the iteration steps 2 and 3;

in step 3, the method specifically comprises the following steps:

step 3.1: calculating a similarity error according to the reprojected image and the current frame image to obtain an image mask;

step 3.2: calculating a Jacobian matrix between the last frame of image and the obtained camera pose;

step 3.3: multiplying the image mask by the Jacobian matrix to obtain an improved Jacobian matrix;

step 3.4: updating the camera pose estimation between the front frame image and the rear frame image by using the improved Jacobian matrix;

Step 3.5: and repeating the iteration steps 2 and 3.

2. A direct method unsupervised monocular image scene depth estimation method according to claim 1, characterized in that: in step 1, the method specifically comprises the following steps:

step 1.1: the constructed depth estimation neural network is in a full convolution U type, and the convolution part uses a convolution network in a Res-Net18 network structure as a main structure network;

step 1.2: the deconvolution part comprises a plurality of deconvolution layers and ReLu activated layer stacking as a main body structure, and each deconvolution layer is connected with the last convolution layer in the convolution blocks with the same mesoscale in the convolution part to form a final deconvolution layer;

step 1.3: monocular continuous images are taken as a training data set, and sample expansion and generalization operations including image inversion, gamma transformation and color channel change are carried out before the input of the depth estimation neural network.

3. The direct method unsupervised monocular image scene depth estimation method of claim 1, characterized by: in the step 2, the method specifically comprises the following steps:

step 2.1: pre-calibrating intrinsic parameters of the camera or obtaining the intrinsic parameters of the camera from the data set;

step 2.2: calculating the initial camera pose by using the current frame image and the previous frame image and adopting a direct method;

Step 2.3: calculating a re-projection image by using a last frame image and a depth estimation image of the input image and the camera pose;

step 2.4: calculating a reprojection error between the current frame image and the reprojected image;

step 2.5: and updating parameters of the depth estimation neural network by utilizing the depth estimation image and the image reprojection error output by the depth estimation neural network through back propagation to obtain a new depth estimation image.

4. A direct method unsupervised monocular image scene depth estimation method according to claim 2, characterized in that: the convolution part uses a convolution network in a Res-Net18 network structure as a main structure network, and comprises 5 convolution blocks and 5 deconvolution blocks, wherein the 1 st convolution block comprises 1 convolution group, the convolution group comprises 1 convolution layer, the input of the convolution layer is a 3-channel color image, the output is 64 channels, the 2 nd, 3 rd, 4 th and 5 th convolution blocks respectively comprise 2 convolution groups, each convolution group comprises 2 convolution layers, the number of channels contained in the 2 nd, 3 th, 4 th and 5 th convolution blocks is respectively 64, 128, 256 and 512, the number of deconvolution groups in the deconvolution blocks is 2 times of the number of convolution groups in the corresponding scale convolution blocks, and the number of channels contained in each convolution group in each deconvolution block is respectively 512, 256, 128, 64 and 1.

5. A direct method unsupervised monocular image scene depth estimation method according to claim 2, characterized in that: the last volume block and the next volume block are connected through a maximum pool operation, the maximum pool operation also scales the size of the next volume block in the two adjacent volume blocks to be one half of the size of the last volume block, the size of the next deconvolution block in the two adjacent volume blocks is 2 times of the size of the last deconvolution block, the convolution layer and the deconvolution layer in each volume unit and the deconvolution group are connected through a 3 x 3 convolution mode, in addition, the content of the second volume block is copied into the fourth deconvolution block, and the content of the fifth volume block is copied into the third deconvolution block.