CN114998411B

CN114998411B - Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss

Info

Publication number: CN114998411B
Application number: CN202210475411.0A
Authority: CN
Inventors: 李嘉茂; 张天宇; 朱冬晨; 张广慧; 石文君; 刘衍青; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2024-01-09
Anticipated expiration: 2042-04-29
Also published as: CN114998411A

Abstract

The invention relates to a self-supervised monocular depth estimation method and device combined with spatiotemporal enhanced photometric loss, wherein the method includes: acquiring several adjacent frame images in an image sequence; inputting the images into a trained deep learning network Depth information and pose information are obtained, in which the photometric loss information of the deep learning network is obtained based on the spatial transformation model of depth sensing pixel correspondence, and an omnidirectional automatic mask is used to avoid the pixels of moving objects from participating in the calculation of photometric errors. The present invention can improve the accuracy of photometric loss and thereby better supervise the learning of deep networks.

Description

Self-supervised monocular depth estimation method and device combined with spatiotemporal enhanced photometric loss

技术领域Technical field

本发明涉及计算机视觉技术领域，特别是涉及一种结合时空增强光度损失的自监督单目深度估计方法和装置。The present invention relates to the field of computer vision technology, and in particular to a self-supervised monocular depth estimation method and device combined with spatiotemporal enhanced photometric loss.

背景技术Background technique

从图像中估计场景的深度信息，即图像深度估计，是目前计算机视觉中的一个基础而又十分重要的任务。良好的图像深度估计算法可以应用于室外驾驶场景以及室内的小型机器人等领域，具有巨大的应用价值。机器人或者自动驾驶的汽车在进行工作过程中，利用深度估计算法得到场景深度信息辅助机器人进行下一步运动的路径规划或避障。Estimating the depth information of a scene from an image, that is, image depth estimation, is a basic and very important task in current computer vision. A good image depth estimation algorithm can be applied to outdoor driving scenes and indoor small robots, etc., and has huge application value. When a robot or self-driving car is working, it uses a depth estimation algorithm to obtain scene depth information to assist the robot in path planning or obstacle avoidance for its next movement.

利用图像的深度估计分为有监督和自监督的方法。有监督的方法主要利用神经网络建立图像和深度图之间的映射，在真值的监督下进行训练，使得网络逐渐具备拟合出深度的能力。然而由于有监督方法真值代价较高，自监督的方法在近些年逐渐变为主流。相较于需要双目图像对进行训练的方法，基于序列图像的方法因其适用范围更广成为被研究人员广泛关注的方法。Depth estimation using images is divided into supervised and self-supervised methods. Supervised methods mainly use neural networks to establish the mapping between images and depth maps, and train under the supervision of true values, so that the network gradually has the ability to fit the depth. However, due to the higher true value cost of supervised methods, self-supervised methods have gradually become mainstream in recent years. Compared with methods that require binocular image pairs for training, methods based on sequence images have become widely concerned by researchers because of their wider scope of application.

基于序列图像的自监督单目深度框架主要包括一个深度估计网络和一个位姿估计网络，分别预测目标帧的深度，目标帧和源帧的位姿变换。结合估计的深度和位姿，可以将源帧变换到目标帧的坐标系得到重建图像，利用目标帧和重建图像光度上的差别也就是光度损失就可以监督两个网络同时训练。随着光度损失的下降，网络估计的深度就逐渐准确。The self-supervised monocular depth framework based on sequence images mainly includes a depth estimation network and a pose estimation network, which respectively predict the depth of the target frame and the pose transformation of the target frame and source frame. Combining the estimated depth and pose, the source frame can be transformed into the coordinate system of the target frame to obtain the reconstructed image. The difference in photometric loss between the target frame and the reconstructed image can be used to supervise the simultaneous training of the two networks. As the photometric loss decreases, the depth estimated by the network becomes increasingly accurate.

光度损失生成时需要采用空间变换模型，现有空间变换模型虽然符合理论上刚体变换的方法，但是由于在计算过程中位姿中的平移向量的误差会带来一定的深度估计误差，也就是说，深度越大，深度估计的误差也就越大。另外，为了解决图像中违背光度一致性的运动像素所造成光度损失不准确的问题，现有方式的主要思路是找到在训练过程中过滤掉从一帧到另一帧中光度未变小的像素，生成的二值化掩膜，但是该二值化掩膜只能判别出与相机运动方向相同的物体。A spatial transformation model needs to be used when generating photometric loss. Although the existing spatial transformation model conforms to the theoretical rigid body transformation method, due to the error of the translation vector in the pose during the calculation process, a certain depth estimation error will occur, that is to say , the greater the depth, the greater the depth estimation error. In addition, in order to solve the problem of inaccurate photometric loss caused by moving pixels that violate photometric consistency in the image, the main idea of the existing method is to find and filter out pixels whose brightness does not become smaller from one frame to another during the training process. , the generated binary mask, but this binary mask can only identify objects in the same direction as the camera movement.

发明内容Contents of the invention

本发明的发明人发现，造成深度越大，深度估计的误差也就越大的原因如下：空间变换的目的是为了经过空间变换后使得目标帧和源帧中的对应像素在像素平面上重合，假如利用近处的点P_N来求解对应像素p_t和p_s的对应关系，如图1所示。自监督深度估计的原理是通过最小化p_t和p_s的光度误差以使得估计的位姿和深度更准。对于近处区域，如图1所示，在一定数量点的情况下，只有当p_t和变换后的点p_F较为重合时，估计的位姿才可以更准确，深度性能更好。而对于远处区域，如图2所示，只需要预测的旋转矩阵准确即可保证p_t和p_s的光度误差变小，因此如果不区分远近而利用估计的旋转矩阵和平移向量构造光度误差，光度误差不确定度会大大增加，从而造成深度估计的结果变差。The inventor of the present invention found that the greater the depth, the greater the depth estimation error is for the following reasons: the purpose of spatial transformation is to make the corresponding pixels in the target frame and the source frame coincide on the pixel plane after spatial transformation. If the nearby point P _N is used to solve the corresponding relationship between the corresponding pixels p _t and p _s , as shown in Figure 1. The principle of self-supervised depth estimation is to make the estimated pose and depth more accurate by minimizing the photometric error of p _t and p _s . For the near area, as shown in Figure 1, with a certain number of points, only when p _t and the transformed point p _F are relatively coincident, the estimated pose can be more accurate and the depth performance can be better. For distant areas, as shown in Figure 2, only the predicted rotation matrix needs to be accurate to ensure that the photometric error of p _t and p _s becomes smaller. Therefore, if the distance is not distinguished and the estimated rotation matrix and translation vector are used to construct the photometric error , the photometric error uncertainty will be greatly increased, resulting in poor depth estimation results.

本发明所要解决的技术问题是提供一种结合时空增强光度损失的自监督单目深度估计方法和装置，能够提高光度损失的准确性，进而更好的监督深度网络的学习。The technical problem to be solved by the present invention is to provide a self-supervised monocular depth estimation method and device that combines spatiotemporal enhanced photometric loss, which can improve the accuracy of photometric loss and thereby better supervise the learning of deep networks.

本发明解决其技术问题所采用的技术方案是：提供一种结合时空增强光度损失的自监督单目深度估计方法，包括以下步骤：The technical solution adopted by the present invention to solve the technical problem is to provide a self-supervised monocular depth estimation method combined with spatiotemporal enhanced photometric loss, which includes the following steps:

获取图像序列中相邻的若干帧图像；Obtain several adjacent frames of images in the image sequence;

将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息，其中，所述深度学习网络的光度损失信息基于深度感知像素对应关系的空间变换模型得到，并利用全向自动掩膜来避免运动物体的像素参与光度误差的计算。The image is input into the trained deep learning network to obtain depth information and pose information. The photometric loss information of the deep learning network is obtained based on the spatial transformation model of depth sensing pixel correspondence, and uses omnidirectional automatic masking. Membrane to prevent the pixels of moving objects from participating in the calculation of photometric errors.

所述光度损失信息基于深度感知像素对应的空间变换模型得到具体为：The photometric loss information is obtained based on the spatial transformation model corresponding to the depth sensing pixels as follows:

对于远处区域利用单应矩阵进行空间变换，并构造第一重建图；其中，所述远处区域将远处区域看作为一个无穷远的平面；Use the homography matrix to perform spatial transformation on the distant area, and construct a first reconstruction map; wherein, the far area regards the far area as an infinite plane;

利用基础矩阵进行空间变换，并构造第二重建图；Use the basic matrix to perform spatial transformation and construct a second reconstruction map;

通过两种像素对应关系求解出基于所述第一重建图的光度误差图和基于所述第二重建图的光度误差图，然后逐像素选取最小值，得到最终的光度损失信息。The photometric error map based on the first reconstructed image and the photometric error map based on the second reconstructed image are solved through two pixel correspondences, and then the minimum value is selected pixel by pixel to obtain the final photometric loss information.

所述利用全向自动掩膜来避免运动物体的像素参与光度误差的计算具体为：The calculation of using omnidirectional automatic mask to avoid the participation of pixels of moving objects in the photometric error is as follows:

通过预训练网络预测目标帧的初始深度和初始位姿，并生成初始重建图；Predict the initial depth and initial pose of the target frame through the pre-trained network, and generate an initial reconstruction map;

将干扰项加到所述初始位姿上，并利用空间变换得到若干假设的重建帧；利用所述假设的重建帧，结合所述目标帧的光度，生成多个光度误差图，并利用所述多个光度误差图得到多个二值化掩膜；Add interference terms to the initial pose, and use spatial transformation to obtain several hypothetical reconstructed frames; use the hypothetical reconstructed frames, combined with the luminosity of the target frame, to generate multiple photometric error maps, and use the Multiple photometric error maps obtain multiple binary masks;

从所述多个二值化掩膜中选取最小值作为最终的掩膜。Select the minimum value from the plurality of binary masks as the final mask.

所述干扰项为平移扰动项，包括：[t_max,0,0]、[-t_max,0,0]、[0,0,t_max]和[0,0,-t_max]，其中，t_max表示初始化的平移向量中的最大值。The interference term is a translational disturbance term, including: [t _max ,0,0], [-t _max ,0,0], [0,0,t _max ] and [0,0,-t _max ], where , t _max represents the maximum value in the initialized translation vector.

本发明解决其技术问题所采用的技术方案是：提供一种结合时空增强光度损失的自监督单目深度估计装置，包括：The technical solution adopted by the present invention to solve the technical problem is to provide a self-supervised monocular depth estimation device that combines spatiotemporal enhanced photometric loss, including:

获取模块，用于获取图像序列中相邻的若干帧图像；The acquisition module is used to acquire several adjacent frame images in the image sequence;

估计模块，用于将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息；所述深度学习网络的光度损失信息基于深度感知像素对应关系模块的空间变换模型得到，并利用全向自动掩膜模块来避免运动物体的像素参与光度误差的计算。An estimation module, used to input the image into a trained deep learning network to obtain depth information and pose information; the photometric loss information of the deep learning network is obtained based on the spatial transformation model of the depth sensing pixel correspondence module, and uses The omnidirectional automatic mask module prevents the pixels of moving objects from participating in the calculation of photometric errors.

所述深度感知像素对应关系模块包括：The depth sensing pixel correspondence module includes:

第一构造单元，用于对于远处区域利用单应矩阵进行空间变换，并构造第一重建图；其中，所述远处区域将远处区域看作为一个无穷远的平面；The first construction unit is used to use the homography matrix to perform spatial transformation on the far area and construct the first reconstruction map; wherein the far area regards the far area as an infinite plane;

第二构造单元，用于利用基础矩阵进行空间变换，并构造第二重建图；The second construction unit is used to perform spatial transformation using the basic matrix and construct the second reconstruction map;

光度损失信息获取单元，用于通过两种像素对应关系求解出基于所述第一重建图的光度误差图和基于所述第二重建图的光度误差图，然后逐像素选取最小值，得到最终的光度损失信息。The photometric loss information acquisition unit is used to solve the photometric error map based on the first reconstructed image and the photometric error map based on the second reconstructed image through two pixel correspondences, and then select the minimum value pixel by pixel to obtain the final Photometric loss information.

所述全向自动掩膜模块包括：The omnidirectional automatic mask module includes:

初始重建图生成单元，用于通过预训练网络预测目标帧的初始深度和初始位姿，并生成初始重建图；An initial reconstruction map generation unit is used to predict the initial depth and initial pose of the target frame through the pre-trained network, and generate an initial reconstruction map;

二值化掩膜生成单元，用于将干扰项加到所述初始位姿上，并利用空间变换得到若干假设的重建帧；利用所述假设的重建帧，结合所述目标帧的光度，生成多个光度误差图，并利用所述多个光度误差图得到多个二值化掩膜；A binarized mask generation unit is used to add interference terms to the initial pose, and use spatial transformation to obtain several hypothetical reconstructed frames; use the hypothetical reconstructed frames, combined with the luminosity of the target frame, to generate A plurality of photometric error maps, and using the multiple photometric error maps to obtain a plurality of binary masks;

掩膜选取单元，用于从所述多个二值化掩膜中选取最小值作为最终的掩膜。A mask selection unit is used to select a minimum value from the plurality of binary masks as the final mask.

有益效果beneficial effects

由于采用了上述的技术方案，本发明与现有技术相比，具有以下的优点和积极效果：本发明采用深度感知像素对应的方式对远处区域的像素对应关系进行了挖掘，改善了远处区域像素对应不准确的问题，并且利用全向自动掩膜的方式得到一个全方向的二值化掩膜用来避免运动物体的像素参与光度误差的计算。本发明通过改善空间变换以及生成动态物体自动掩膜来提高光度损失的准确性，进而更好的监督深度网络的学习。Due to the adoption of the above technical solution, the present invention has the following advantages and positive effects compared with the existing technology: the present invention uses a depth-sensing pixel correspondence method to mine pixel correspondences in distant areas, improving the The problem of inaccurate regional pixel correspondence, and the omnidirectional automatic masking method is used to obtain an omnidirectional binary mask to avoid the pixels of moving objects from participating in the calculation of photometric errors. The present invention improves the accuracy of photometric loss by improving spatial transformation and generating automatic masks for dynamic objects, thereby better supervising the learning of deep networks.

附图说明Description of the drawings

图1是近处点位姿求解示意图；Figure 1 is a schematic diagram of the near point pose solution;

图2是远处点位姿求解示意图；Figure 2 is a schematic diagram for solving the pose of a distant point;

图3是Monodepth2基本框架示意图；Figure 3 is a schematic diagram of the basic framework of Monodepth2;

图4是本发明第一实施方式中光度损失的生成示意图；Figure 4 is a schematic diagram of the generation of photometric loss in the first embodiment of the present invention;

图5是本发明第一实施方式中全向自动掩膜的示意图。Figure 5 is a schematic diagram of an omnidirectional automatic mask in the first embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施例，进一步阐述本发明。应理解，这些实施例仅用于说明本发明而不用于限制本发明的范围。此外应理解，在阅读了本发明讲授的内容之后，本领域技术人员可以对本发明作各种改动或修改，这些等价形式同样落于本申请所附权利要求书所限定的范围。The present invention will be further described below in conjunction with specific embodiments. It should be understood that these examples are only used to illustrate the invention and are not intended to limit the scope of the invention. In addition, it should be understood that after reading the teachings of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of this application.

本发明的第一实施方式涉及一种结合时空增强光度损失的自监督单目深度估计方法，包括以下步骤：获取图像序列中相邻的若干帧图像；将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息，其中，所述深度学习网络的光度损失信息基于深度感知像素对应的空间变换模型得到，并利用全向自动掩膜来避免运动物体的像素参与光度误差的计算。The first embodiment of the present invention relates to a self-supervised monocular depth estimation method combined with spatiotemporal enhanced photometric loss, which includes the following steps: acquiring several adjacent frame images in an image sequence; inputting the images into trained deep learning Depth information and pose information are obtained from the network. The photometric loss information of the deep learning network is obtained based on the spatial transformation model corresponding to the depth sensing pixels, and an omnidirectional automatic mask is used to avoid the pixels of moving objects from participating in the calculation of photometric errors. .

本实施方式的方法可以直接用到一般的自监督单目深度估计中，任何以SfMLearner这一框架为实现原理的工作都可以使用本实施方式的方法。其只需要将原本框架中的空间变换部分采用本实施方式的基于深度感知像素对应的空间变换模型，将自动掩膜部分采用本申请的全向自动掩膜即可。The method of this implementation mode can be directly used in general self-supervised monocular depth estimation. Any work based on the SfMLearner framework as the implementation principle can use the method of this implementation mode. It only needs to adopt the space transformation model based on depth sensing pixel correspondence of this embodiment for the spatial transformation part in the original framework, and adopt the omnidirectional automatic mask of the present application for the automatic mask part.

下面以Godardetal.的Monodepth2的基础框架为例进一步说明本发明。The following takes the basic framework of Monodepth2 of Godard et al. as an example to further illustrate the present invention.

为了更容易理解，下面先介绍Monodepth2的整体框架，如图3所示，其输入为序列中相邻三帧RGB图像；输出为目标帧深度，目标帧和源帧之间的位姿变换。In order to make it easier to understand, the overall framework of Monodepth2 is introduced below, as shown in Figure 3. Its input is the RGB image of three adjacent frames in the sequence; the output is the depth of the target frame, and the pose transformation between the target frame and the source frame.

本实施方式的基本框架和图3相同。由于本实施方式改进的主要是空间变换生成光度损失和自动掩膜部分的方法，因此先重点介绍Monodepth2的这两个部分：The basic framework of this embodiment is the same as Figure 3 . Since this implementation mainly improves the method of spatial transformation to generate photometric loss and automatic masking, we will first focus on these two parts of Monodepth2:

Monodepth2采用和SfMLearner一样的空间变换模型，根据目标帧I_t的深度D_t和目标帧I_t和源帧I_s的位姿T_t→s＝[R_t→s|t_t→s]。对于目标帧和源帧之间的对应像素p_t和p_s，若其对应同一个3D点，应满足：Monodepth2 uses the same spatial transformation model as SfMLearner, based on the depth D _t of the target frame I _t and the pose T _t→s _of the target frame I _t and the source frame I s = [R _t→s | _{t t→s} ]. For the corresponding pixels p _t and p _s between the target frame and the source frame, if they correspond to the same 3D point, they should satisfy:

D_sK^-1p_s＝D_tK^-1p_t D _s K ^-1 p _s =D _t K ^-1 p _t

其中，K是相机的内参。由于单目深度具有尺度模糊性，所以可以变换为下式用于空间变换：Among them, K is the internal parameter of the camera. Since monocular depth has scale ambiguity, it can be transformed into the following formula for spatial transformation:

p_s～KT_t→sD_tK^-1p_t p _s ~KT _t→s D _t K ^-1 p _t

空间几何变换中将KT_t→sK^-1定义为基础矩阵F，用来帧间像素的对应关系。进而可以利用该关系构建重建帧 In the spatial geometric transformation, KT _t→s K ^-1 is defined as the basic matrix F, which is used to correspond to the pixels between frames. This relationship can then be used to construct a reconstructed frame

根据目标帧和重建帧，可以构造光度损失pe，由L1误差和结构相似(SSIM)误差组成，具体如下：According to the target frame and reconstructed frame, the photometric loss pe can be constructed, which consists of L1 error and structural similarity (SSIM) error, as follows:

α是超参数，Monodepth2设置为0.85α is a hyperparameter, Monodepth2 is set to 0.85

Monodepth2的自动掩膜主要是为了解决图像中违背光度一致性的运动像素所造成光度损失不准确的问题。其主要思路是找到在训练过程中过滤掉从一帧到另一帧中光度未变小的像素，生成的二值化掩膜μ如下：The automatic masking of Monodepth2 is mainly to solve the problem of inaccurate photometric loss caused by moving pixels that violate photometric consistency in the image. The main idea is to filter out pixels whose brightness does not become smaller from one frame to another during the training process. The generated binary mask μ is as follows:

[]是Iversonbracket，用来生成二值化掩膜。I_t是目标帧，I_s是源帧，是空间变换所得到的重建帧。[] is Iversonbracket, used to generate binary masks. I _t is the target frame, I _s is the source frame, is the reconstructed frame obtained by spatial transformation.

对于空间变换生成光度损失，本实施方式基于深度感知像素对应的空间变换模型得到光度损失量。如图4所示，具体如下：Regarding the photometric loss generated by spatial transformation, this embodiment obtains the photometric loss amount based on the spatial transformation model corresponding to the depth sensing pixel. As shown in Figure 4, the details are as follows:

在空间变换过程中，足够远处区域可以看作一个无穷远的平面，而平面满足：In the process of space transformation, the far enough area can be regarded as an infinite plane, and the plane satisfies:

n^TP+D＝0n ^T P+D＝0

其中，n是平面的法向量，P是平面上一个三维点，D是点的深度，经过变换可得：Among them, n is the normal vector of the plane, P is a three-dimensional point on the plane, and D is the depth of the point. After transformation, it can be obtained:

将其带入空间变换的关系中可得：Putting it into the relationship of spatial transformation, we can get:

当D_t无穷大时，也就是对于无穷远平面：When D _t is infinite, that is, for the infinite plane:

p_s～KR_t→sD_tK^-1p_t p _s ~KR _t→s D _t K ^-1 p _t

KR_t→sK^-1被定义为无穷远处的单应矩阵H∞，因此对于远处区域只利用旋转矩阵来进行空间变换进而构造重建图为了进行区分，将利用基础矩阵得到的重建图表示为由于单目尺度估计估计的深度有尺度模糊性，因此无法直接通过预测的深度来选择两种像素对应关系。因此本实施方式设计了自适应选择的方法，具体是通过两种像素对应关系求解出两个光度误差图，然后逐像素选取最小值，即最终的光度误差为：KR _t→s K ^-1 is defined as the homography matrix H∞ at infinity, so for distant areas only the rotation matrix is used to perform spatial transformation and then construct the reconstruction map In order to distinguish, the reconstructed graph obtained by using the basic matrix is expressed as Since the depth estimated by monocular scale estimation has scale ambiguity, it is impossible to directly select the two pixel correspondences through the predicted depth. Therefore, this embodiment designs an adaptive selection method. Specifically, two photometric error maps are obtained through two pixel correspondences, and then the minimum value is selected pixel by pixel, that is, the final photometric error is:

对于全向自动掩膜，本实施方式将图像序列直接输入模块，得到掩膜结果后将其作用于光度误差上遮挡掉不可靠的部分，如图5所示，具体如下：For omnidirectional automatic masking, this implementation directly inputs the image sequence into the module. After obtaining the masking result, it is applied to the photometric error to block out the unreliable parts, as shown in Figure 5. The details are as follows:

本实施方式引入了一个Monodepth2的预训练网络，预测目标帧的初始深度D_init和初始帧位姿T_init，进一步生成一个初始重建图I_init。由于深度和位姿已经比较准确，因此符合光度一致性的区域的光度误差已经较小，但不符合光度一致性的区域就有潜力变小。This implementation introduces a Monodepth2 pre-training network to predict the initial depth D _init and initial frame pose T _init of the target frame, and further generates an initial reconstruction map I _init . Since the depth and pose are already relatively accurate, the photometric error in areas that meet photometric consistency is already smaller, but the areas that do not meet photometric consistency have the potential to become smaller.

针对该思路，通过将干扰项加到初始位姿上，引入了一些干扰后的位姿，利用空间变换后得到一些假设的重建帧。利用这些重建帧I_i，其中，i∈{1,2,…}，结合目标帧的光度，可以生成多个光度误差图，利用这些光度误差值的大小就可以得到多个二值化掩膜，对应着各个方向运动物体的像素，如下：In response to this idea, by adding interference terms to the initial pose, some post-interference poses are introduced, and some hypothetical reconstructed frames are obtained by using spatial transformation. Using these reconstructed frames I _i , where i∈{1,2,…}, combined with the luminosity of the target frame, multiple photometric error maps can be generated, and multiple binary masks can be obtained by using the sizes of these photometric error values. , corresponding to the pixels of moving objects in various directions, as follows:

M_i＝[pe(I_t,I_init),pe(I_t,I_i)]M _i =[pe(I _t ,I _init ),pe(I _t ,I _i )]

为了捕捉到各个方向运动的物体，将生成的各个掩膜取最小值得到最终的掩膜，即：In order to capture objects moving in all directions, the minimum value of each generated mask is taken to obtain the final mask, that is:

M_oA＝min(M₁,M₂,…) _MoA =min(M ₁ ,M ₂ ,…)

本实施方式在实现过程中，只在平移向量上进行了扰动，具体平移扰动项t_i：t₁＝[t_max,0,0]、t₂＝[-t_max,0,0]、t₃＝[0,0,t_max]和t₄＝[0,0,-t_max]，其中，t_max为初始化的平移向量中的最大值。During the implementation process of this implementation, only the translation vector is perturbed. The specific translation perturbation term t _{i is} : t ₁ =[t _max ,0,0], t ₂ =[-t _max ,0,0], t ₃ =[0,0,t _max ] and t ₄ =[0,0,-t _max ], where t _max is the maximum value of the initialized translation vector.

不难发现，本发明采用深度感知像素对应的方式对远处区域的像素对应关系进行了挖掘，改善了远处区域像素对应不准确的问题，并且利用全向自动掩膜的方式得到一个全方向的二值化掩膜用来避免运动物体的像素参与光度误差的计算。本发明通过改善空间变换以及生成动态物体自动掩膜来提高光度损失的准确性，进而更好的监督深度网络的学习。因此将本实施方式的深度感知像素对应和全向自动掩膜应用到Godard et al.的Monodepth2框架中，可以得到精度较高的单目深度估计结果。It is not difficult to find that the present invention uses a depth-sensing pixel correspondence method to mine pixel correspondences in distant areas, improves the problem of inaccurate pixel correspondence in distant areas, and uses omnidirectional automatic masking to obtain an omnidirectional pixel correspondence. The binary mask is used to prevent the pixels of moving objects from participating in the calculation of photometric errors. The present invention improves the accuracy of photometric loss by improving spatial transformation and generating automatic masks for dynamic objects, thereby better supervising the learning of deep networks. Therefore, by applying the depth-sensing pixel correspondence and omnidirectional automatic masking of this embodiment to the Monodepth2 framework of Godard et al., monocular depth estimation results with higher accuracy can be obtained.

本发明的第二实施方式涉及一种结合时空增强光度损失的自监督单目深度估计装置，包括：获取模块，用于获取图像序列中相邻的若干帧图像；估计模块，用于将所述图像输入至训练好的深度学习网络中得到深度信息和位姿信息；所述深度学习网络的光度损失信息基于深度感知像素对应关系模块的空间变换模型得到，并利用全向自动掩膜模块来避免运动物体的像素参与光度误差的计算。The second embodiment of the present invention relates to a self-supervised monocular depth estimation device combined with spatiotemporal enhanced photometric loss, including: an acquisition module for acquiring several adjacent frame images in an image sequence; and an estimation module for converting the The image is input into the trained deep learning network to obtain depth information and pose information; the photometric loss information of the deep learning network is obtained based on the spatial transformation model of the depth sensing pixel correspondence module, and the omnidirectional automatic mask module is used to avoid Pixels of moving objects participate in the calculation of photometric errors.

所述深度感知像素对应关系模块包括：第一构造单元，用于对于远处区域利用单应矩阵进行空间变换，并构造第一重建图；其中，所述远处区域将远处区域看作为一个无穷远的平面；第二构造单元，用于利用基础矩阵进行空间变换，并构造第二重建图；光度损失信息获取单元，用于通过两种像素对应关系求解出基于所述第一重建图的光度误差图和基于所述第二重建图的光度误差图，然后逐像素选取最小值，得到最终的光度损失信息。The depth-sensing pixel correspondence module includes: a first construction unit for spatially transforming the distant area using a homography matrix and constructing a first reconstruction map; wherein the distant area regards the distant area as a a plane at infinity; a second construction unit for using the basic matrix to perform spatial transformation and construct a second reconstruction map; a photometric loss information acquisition unit for solving the problem based on the first reconstruction map through two pixel correspondences The photometric error map and the photometric error map based on the second reconstructed image are then selected pixel by pixel to obtain the final photometric loss information.

所述全向自动掩膜模块包括：初始重建图生成单元，用于通过预训练网络预测目标帧的初始深度和初始位姿，并生成初始重建图；二值化掩膜生成单元，用于将干扰项加到所述初始位姿上，并利用空间变换得到若干假设的重建帧；利用所述假设的重建帧，结合所述目标帧的光度，生成多个光度误差图，并利用所述多个光度误差图得到多个二值化掩膜；掩膜选取单元，用于从所述多个二值化掩膜中选取最小值作为最终的掩膜。其中，所述干扰项为平移扰动项，包括：[t_max,0,0]、[-t_max,0,0]、[0,0,t_max]和[0,0,-t_max]，其中，t_max表示初始化的平移向量中的最大值。The omnidirectional automatic mask module includes: an initial reconstruction map generation unit, used to predict the initial depth and initial pose of the target frame through a pre-trained network, and generate an initial reconstruction map; a binary mask generation unit, used to Interference terms are added to the initial pose, and spatial transformation is used to obtain several hypothetical reconstructed frames; multiple photometric error maps are generated using the hypothetical reconstructed frames, combined with the luminosity of the target frame, and the multiple photometric error maps are generated. A plurality of binary masks are obtained from each photometric error map; a mask selection unit is used to select a minimum value from the plurality of binary masks as the final mask. Among them, the interference term is a translational disturbance term, including: [t _max ,0,0], [-t _max ,0,0], [0,0,t _max ] and [0,0,-t _max ] , where t _max represents the maximum value in the initialized translation vector.

Claims

1. A method for self-supervising monocular depth estimation in combination with space-time enhancement luminosity loss, comprising the steps of:

acquiring a plurality of adjacent frame images in an image sequence;

inputting the image into a trained deep learning network to obtain depth information and pose information, wherein luminosity loss information of the deep learning network is obtained based on a space transformation model of a depth perception pixel corresponding relation, and pixels of a moving object are prevented from participating in luminosity error calculation by using an omnidirectional automatic mask; the luminosity loss information is obtained based on a space transformation model of depth perception pixel corresponding relation specifically as follows:

carrying out space transformation on a remote area by utilizing a homography matrix, and constructing a first reconstruction map; wherein the remote area treats the remote area as an infinitely distant plane;

performing space transformation by utilizing the basic matrix, and constructing a second reconstruction map;

and solving a luminosity error graph based on the first reconstruction graph and a luminosity error graph based on the second reconstruction graph through the corresponding relation of the two pixels, and then selecting the minimum value pixel by pixel to obtain final luminosity loss information.

2. The method for self-supervising monocular depth estimation combined with space-time enhancement luminosity loss of claim 1 wherein the calculation of avoiding pixel participation luminosity errors of a moving object by using an omnidirectional automatic mask is specifically:

predicting the initial depth and the initial pose of a target frame through a pre-training network, and generating an initial reconstruction map;

adding an interference item to the initial pose, and obtaining a plurality of assumed reconstructed frames by utilizing space transformation; generating a plurality of luminosity error maps by using the hypothesized reconstructed frame and combining luminosity of the target frame, and obtaining a plurality of binarization masks by using the plurality of luminosity error maps;

and selecting a minimum value from the plurality of binarized masks as a final mask.

3. The method of self-supervising monocular depth estimation combining spatio-temporal enhancement luminosity losses of claim 2 wherein the interference term is a translational disturbance term comprising: [ t ] _max ,0,0]、[-t _max ,0,0]、[0,0,t _max ]And [0, -t _max ]Wherein t is _max Representing the maximum value in the initialized translation vector.

4. A self-supervising monocular depth estimation device incorporating space-time enhancement luminosity loss, comprising:

the acquisition module is used for acquiring a plurality of adjacent frame images in the image sequence;

the estimation module is used for inputting the image into a trained deep learning network to obtain depth information and pose information; the luminosity loss information of the deep learning network is obtained based on a space transformation model of a depth perception pixel corresponding relation module, and pixels of a moving object are prevented from participating in luminosity error calculation by using an omnidirectional automatic mask module; the depth perception pixel correspondence module comprises:

the first construction unit is used for carrying out space transformation on the remote area by utilizing a homography matrix and constructing a first reconstruction map;

wherein the remote area treats the remote area as an infinitely distant plane;

a second construction unit for performing spatial transformation using the base matrix and constructing a second reconstruction map;

and the luminosity loss information acquisition unit is used for solving a luminosity error graph based on the first reconstruction graph and a luminosity error graph based on the second reconstruction graph through the corresponding relation of the two pixels, and then selecting the minimum value pixel by pixel to obtain final luminosity loss information.

5. The self-supervising monocular depth estimation apparatus incorporating spatio-temporal enhancement luminosity losses of claim 4 wherein the omnidirectional automatic masking module includes:

the initial reconstruction map generating unit is used for predicting the initial depth and the initial pose of the target frame through the pre-training network and generating an initial reconstruction map;

the binarization mask generating unit is used for adding an interference item to the initial pose and obtaining a plurality of assumed reconstruction frames by utilizing space transformation; generating a plurality of luminosity error maps by combining the luminosity of the target frame by using the hypothesized reconstructed frame, and obtaining a plurality of binarized masks by using the plurality of luminosity error maps

And the mask selecting unit is used for selecting the minimum value from the plurality of binarized masks as a final mask.

6. The self-supervising monocular depth estimation apparatus incorporating spatio-temporal enhancement luminosity losses of claim 5 wherein the interference term is a translational disturbance term comprising: [ t ] _max ,0,0]、[-t _max ,0,0]、[0,0,t _max ]And [0, -t _max ]Wherein t is _max Representing the maximum value in the initialized translation vector.